This paper poses the question: Can we really achieve a
gain in the recognition of objects using its spatial context?
We start from Bag-of-Words models and use the Pascal 2007
dataset. We use the rough object bounding boxes that come
with the dataset to investigate the fundamental gain context
can bring. Our main contributions are: (I) The result
of Zhang et al. in CVPR07 that context is superfluous derived
from the Pascal 2005 data set of 4 classes does not
generalize to this dataset. For our larger and more realistic
dataset, context is important indeed. (II) Using the rough
bounding box to limit or extend the scope of an object during
both training and testing, we find that the spatial extent
of an object is determined by its category: (a) well defined,
rigid objects have the object itself as the preferred spatial
extent. (b) Non-rigid objects have an unbounded spatial extent:
all spatial extents produce equally good results. (c)
Objects primarily categorised based on their function have
the whole image as their spatial extent. Finally, (III) using
the rough bounding box to treat object and context separately,
we found that the upper bound of improvement is
26% (12% absolute) in terms of Mean Average Precision,
and this bound is likely to be higher if the localisation is
done using segmentation. It is concluded that object localisation,
if done precise, helps considerably in the recognition
of objects.
@InProceedings{UijlingsCVPR2009,
author = "Uijlings, J. R. R. and Smeulders, A. W. M. and Scha, R. J. H.",
title = "What is the Spatial Extent of an Object?",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition",
year = "2009",
url = "https://ivi.fnwi.uva.nl/isis/publications/2009/UijlingsCVPR2009",
pdf = "https://ivi.fnwi.uva.nl/isis/publications/2009/UijlingsCVPR2009/UijlingsCVPR2009.pdf",
has_image = 1
}