This paper addresses the problem of retrieving those shots from a database of video sequences that match a query image. Existing architectures match the images using a high-level representation of local features extracted from the video database, and are mainly based on Bag ofWords model. Such architectures lack however the capability to scale up to very large databases. Recently, Fisher Vectors showed promising results in large scale image retrieval problems, but it is still not clear how they can be best exploited in video-related applications. In our work, we use compressed Fisher Vectors to represent the video shots and we show that inherent correlation between video frames can be effectively exploited. Experiments show that our proposed system achieves better performance while having lower computational requirements than similar architectures. © 2014 EURASIP.