Trying to come to grips with how a visual search engine works, I worked my way up to Chapter 7 of Programming Computer Vision: Searching Images. The chapter introduces the concept of a "bag of visual words" representation that provide a means of searching images by using techniques common to text-based search engines. The description of the technique begins:
Content-based image retrieval (CBIR) deals with the problem of retrieving visually similar images from a (large) database of images. This can be images with similar color, similar textures or similar objects or scenes, basically any information contained in the images themselves.
For high-level queries, like finding similar objects, it is not feasible to do a full comparison (for example using feature matching) between a query image and all images in the database. It would simply take too much time to return any results if the database is large. In the last couple of years, researchers have successfully introduced techniques from the world of text mining for CBIR problems making it possible to search millions of images for similar content. The most common weighting is tf-idf weighting (term frequency – inverse document frequency) But what form exactly does this "information" take?
This diagram, from an informatics/CS course for information retrieval presents a series of visualizations of documents as normalized vectors (or bags of words) In this case the "documents" (or segments of a text) are shown in spreadsheet form which each row representing a different "bag of words".
In the above example, the presence of each word in it’s bag is not simply binary (in or out) but weighted to reflect how significant the word is. This is achieved by using the inverse document frequency (idf) to produce a sliding scale of significance. In this way "stopwords" are automatically produced as when a word is present in all documents, the idf factor becomes 0, effectively removing the terms significance (and thus presence in the "bag"). As the term becomes increasingly less common among documents, the idf factor slides upward to amplify the weight of a term and thus its overall significance in comparisons.
Following a similar approach for images, it’s interesting to think of what the visual equivalent to stopwords would be (and look like). The text continues:
To apply text mining techniques to images, we first need to create the visual equivalent of words. This is usually done using local descriptors like the SIFT descriptor in Section 2.2. The idea is to quantize the descriptor space into a number of typical examples and assign each descriptor in the image to one of those examples. These typical examples are determined by analyzing a training set of images and can be considered as visual words and the set of all words is then a visual vocabulary (sometimes called a visual codebook). This vocabulary can be created specifically for a given problem or type of images or just try to represent visual content in general.
This is quite an important point: the collection of the images and its organization into "examples" is highly determinant of the kind of "similarity" which the index then is able to support / perform.
The visual words are constructed using some clustering algorithm applied to the feature descriptors extracted from a (large) training set of images. The most common choice is k-means 2, which is what we will use here. Visual words are nothing but a collection of vectors in the given feature descriptor space, in the case of k-means they are the cluster centroids. Representing an image with a histogram of visual words is then called a bag of visual words model.
|scalable search index of a large set of documents||vector space model (bag-of-word)|
|index term weighting / stopwords||tf-idf|
|visual words: description by local features||SIFT, image gradients|
|visual words: clustering descriptions||k-means|
In its very dense prose (frequent in PCV), a stack of techniques are presented. This "layer cake" of methods is something typical to computer vision where different methods and algorithms are often pulled out of the "vision toolbox" and stacked one on top of the other to perform a task. The difficulty is often in trying to understand or consider how the assumptions and limitations of one layer influence the others, how the whole stack perhaps is being affected by the particularities of the individual underlying techniques.