Thank you very much for your interest in this work! Unfotunately, we are not able to release code at this time, but here are some answers to common questions which might help in understanding and/or replicating the results. I would be happy to answer any additional questions: .
Please describe the feature extraction pipeline (from image+bounding box to feature vector)
On a dense grid in the image, compute local HOG descriptors at different scales. So for example, you start by computing 16x16 HOG descriptors with center at pixels (8,8), (8,10), (8,12), ... and so on until you cover the whole images. Then you compute HOG descriptors on 25x25 patches, with centers at pixels (13,13), (13,15), (13,17), ... If you want to make it simpler you can just use a standard package to compute dense SIFT descriptors at multiple scales instead. I believe the vlfeat package can do that for you. The algorithm is not very sensitive to the exact low-level descriptors you use, but you should sample them densely and at a few scales, as mentioned in the paper.
So now you have a list of N (x,y) locations, and N descriptors at these locations (of dimensions 128 each). You take a subsample of these descriptors (as many as you can handle with memory restrictions on your computer) and you create a codebook, for example by running k-means clustering. We use codebook of size 8192 in the paper, but to start with you can use a smaller codebook if you wish. This will affect the results, however -- bigger codebooks are needed to produce better results. You can use Matlab's kmeans function for this, or many available packages on the web. I think vlfeat should be able to do this step for you as well.
Now that you have the N (x,y) locations and N descriptors, plus a codebook of common descriptor appearances. You do LLC coding to assign each of these descriptors to one or more codebook entries (we use k=5). Alternatively, you can use simple vector quantization (available in the vlfeat package for example) to assign each descriptor to the one closest codebook entry. As a result, instead of N descriptors of size 128 each, you will have N feature vectors of size 8192 each, which are very sparse (between 1 and 5 positions have non-zero values, and the rest are zeros).
So you have N (x,y) locations and N sparse feature vectors of size 8192 at these locations. So far this is very similar to the pipeline used in . Now the only difference is that instead of pooling together features to form a full image representation, we are given a bounding box and want to pool the features together to represent this bounding box. On the foreground (i.e., using only the features whose (x,y) locations are within this bounding box) you do SPM pooling described in , using 1x1 and 3x3 blocks, thus getting a feature vector of length 10x8192. We use max pooling as described in, for example, reference . Remember, we are only pooling together those of the N features whose (x,y) locations fall within the bounding box. Now for the background, we simply pool together all the remaining features, those that fall outside of the bounding box. Here we get a feature vector of length 8192 since our feature vectors were of length 8192. We concatenate the feature vector representing the foreground with the one representing the background to get a feature vector of size 11x8192 to represent the entire bounding box.
Do you use all trainval and test images?
Yes, we used about 7 million negative examples sampled from all trainval images. We used all test images in all experiments.
How exactly was localization accuracy computed?
TotalSet = set of images on which at least one of the object instances is neither difficult nor truncated
DetectedSet = subset of TotalSet images with at least one of the object instances correctly detected according to the IOU criteria
LocalizationAccuracy = numOfImages(DetectedSet)/numOfImages(TotalSet)
I am able to replicate classification results but not weakly supervised detection results. Why is that?
It is indeed the case that getting weakly supervised detection to work is trickier than getting classification improvements. Using the candidate bounding boxes of van de Sande et al. is one of the key components; being able to handle large numbers of negative examples during training is another. If you can't use all of the negatives and have to bootstrap for negatives, what worked well was to take the N negative images and extract 3N negative examples as follows:
(1) find the top 10 highest scoring windows with each image (after running non-maximal suppression to avoid duplicate, using intersection over union criteria and I think 0.7 as the threshold)
(2) take the highest scoring window from each image (this gives N examples)
(3) of the remaining windows, take the 2N highest scoring windows from all N images
This seems to provide good coverage of the negatives. You might want to run this bootstrapping process a couple of times, to use as many negatives as your training code can handle.
What kind of normalization did you use for the fg and bg histograms?
This is a great question, and very important. The fg has L2 norm of 1, the background has L2 norm of 0.1 (and then the full feature vector is renormalized to have norm 1 for training, but I doubt this makes a big difference). Using the same weighting for fg and bg has caused problems for us.
In the figure 5, there are 8 iterations but there are only 7 iterations for shrinking strategy. Do you just update the positive bounding boxes at the 8th iteration?
The number of iterations you run and the rate of shrinking the box shouldn't matter too much. The key is to first run a handful (maybe 5) iterations in the beginning with slowly shrinking the box (you can use sliding windows instead of just the candidate boxes to mine for extra negatives here). Then you can switch to using candidate bounding boxes of all sizes and run a couple more iterations. The algorithm should mostly converge at this point. Once you start using candidate boxes of all sizes it becomes especially important to use a sufficiently large number of negative examples.
How did you tune the C parameter of SVM? Did it make any difference?
For the first iteration (and for the baseline method) we use 3-fold cross-validation for C since there's little data and we might as well. Once we put in all the negative examples we use C=0.01 but it makes very little difference at that point.