Object detection consists in automatically localizing and recognizing in images the presence of objects belonging to a predefined set of categories, and it is still a challenging computer vision task. This is due to the presence of partial occlusions, dynamic backgrounds, and foreground clutter, like in people detection (Tu et al., 2008). The problem becomes more complex when there is the simultaneous presence of multiple objects and object categories (Lim et al., 2010)(Lim et al., 2011). In these cases, exploiting the knowledge about the spatial context of an object with respect to another can make the detection more robust (Siyahjani & Doretto, 2012). There are also cases where detections must happen regardless of the orientation of an object, and therefore designing rotation invariant representations may become important (Doretto & Yao, 2010). When instead there is the interest in detecting only one specific object, it is very easy to overfit the detector. A more robust approach is to leverage data from multiple objects and train a detector that can be adapted to be effective on a target with only one data sample, a.k.a. one-shot training or transfer learning(Yao & Doretto, 2010). Sometimes, instead, auxiliary information might be available during the detector training, and this information could be used to design a more robust training procedure (Motiian et al., 2016).
People detection and tracking in video are fundamental Computer Vision capabilities that still constitute a research challenge. Important difficulties are du...
Recent successes in the use of sparse coding for many computer vision applications have triggered the attention towards the problem of how an over-complete d...
Recognizing the presence of object classes in an image, or image classification, has become an increasingly important topic of interest. Equally important, h...
In aerial video moving objects of interest are typically very small, and being able to detect them is key to enable tracking. There are detection methods tha...
References
CVPR
Information bottleneck learning using privileged information for visual recognitionMotiian, S.,
Piccirilli, M.,
Adjeroh, D.,
and Doretto, G.
In Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition,
2016.
abstractbibTeXpdf
We explore the visual recognition problem from a main data view when an auxiliary data view is available during training. This is important because it allows improving the training of visual classifiers when paired additional data is cheaply available, and it improves the recognition from multi-view data when there is a missing view at testing time. The problem is challenging because of the intrinsic asym- metry caused by the missing auxiliary view during testing. We account for such view during training by extending the information bottleneck method, and by combining it with risk minimization. In this way, we establish an information theoretic principle for leaning any type of visual classifier under this particular setting. We use this principle to design a large-margin classifier with an efficient optimization in the primal space. We extensively compare our method with the state-of-the-art on different visual recognition datasets, and with different types of auxiliary data, and show that the proposed framework has a very promising potential.
@inproceedings{motiianPAD16cvpr,
abbr = {CVPR},
author = {Motiian, S. and Piccirilli, M. and Adjeroh, D. and Doretto, G.},
title = {Information bottleneck learning using privileged information for visual recognition},
booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition},
year = {2016},
pages = {1496--1505},
bib2html_pubtype = {Conferences}
}
ACCV
Learning a Context Aware Dictionary for Sparse Representation
Siyahjani, F.,
and Doretto, G.
In Proceedings of The Asian Conference on Computer Vision,
2012.
OralabstractbibTeXpdf
Recent successes in the use of sparse coding for many com- puter vision applications have triggered the attention towards the prob- lem of how an over-complete dictionary should be learned from data. This is because the quality of a dictionary greatly affects performance in many respects, including computational. While so far the focus has been on learning compact, reconstructive, and discriminative dictionar- ies, in this work we propose to retain the previous qualities, and further enhance them by learning a dictionary that is able to predict the con- textual information surrounding a sparsely coded signal. The proposed framework leverages the K-SVD for learning, fully inheriting its benefits of simplicity and efficiency. A model of structured prediction is designed around this approach, which leverages contextual information to improve the combined recognition and localization of multiple objects from multi- ple classes within one image. Results on the PASCAL VOC 2007 dataset are in line with the state-of-the-art, and clearly indicate that this is a viable approach for learning a context aware dictionary for sparse repre- sentation.
@inproceedings{siyahjaniD12accv,
abbr = {ACCV},
author = {Siyahjani, F. and Doretto, G.},
title = {Learning a Context Aware Dictionary for Sparse Representation},
booktitle = {Proceedings of The Asian Conference on Computer Vision},
year = {2012},
pages = {1--14},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Dictionary Learning, Sparse Coding, Object Detection, Context, Object
Classification},
file = {siyahjaniD12accv.pdf:doretto/conference/siyahjaniD12accv.pdf:PDF},
owner = {doretto},
timestamp = {2012.12.03},
wwwnote = {Oral}
}
ISVC
Multi-class Object Layout with Unsupervised Image Classification
and Object Localization
Lim, S.,
Doretto, G.,
and Rittscher, J.
In International Symposium on Visual Computing,
2011.
OralabstractbibTeXpdf
Recognizingthepresenceofobjectclassesinanimage,orimageclas- sification, has become an increasingly important topic of interest. Equally impor- tant, however, is also the capability to locate these object classes in the image. We consider in this paper an approach to these two related problems with the primary goal of minimizing the training requirements so as to allow for ease of adding new object classes, as opposed to approaches that favor training a suite of object-specific classifiers. To this end, we provide the analysis of an exemplar- based approach that leverages unsupervised clustering for classification purpose, and sliding window matching for localization. While such exemplar based ap- proach by itself is brittle towards intraclass and viewpoint variations, we achieve robustness by introducing a novel Conditional Random Field model that facili- tates a straightforward accept/reject decision of the localized object classes. Per- formance of our approach on the PASCAL Visual Object Challenge 2007 dataset demonstrates its efficacy.
@inproceedings{limDR11isvc,
abbr = {ISVC},
author = {Lim, S. and Doretto, G. and Rittscher, J.},
title = {Multi-class Object Layout with Unsupervised Image Classification
and Object Localization},
booktitle = {International Symposium on Visual Computing},
year = {2011},
pages = {577--589},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Object Detection, Context, Object Classification},
file = {limDR11isvc.pdf:doretto/conference/limDR11isvc.pdf:PDF},
owner = {doretto},
timestamp = {2010.12.20},
wwwnote = {Oral}
}
CVPR
Region Moments: Fast invariant descriptors for detecting small
image structuresDoretto, G.,
and Yao, Y.
In Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition,
2010.
abstractbibTeXpdf
This paper presents region moments, a class of appear- ance descriptors based on image moments applied to a pool of image features. A careful design of the moments and the image features, makes the descriptors scale and rotation in- variant, and therefore suitable for vehicle detection from aerial video, where targets appear at different scales and orientations. Region moments are linearly related to the image features. Thus, comparing descriptors by comput- ing costly geodesic distances and non-linear classifiers can be avoided, because Euclidean geometry and linear classi- fiers are still effective. The descriptor computation is made efficient by designing a fast procedure based on the inte- gral representation. An extensive comparison between re- gion moments and the region covariance descriptors, re- ports theoretical, qualitative, and quantitative differences among them, with a clear advantage of the region moments, when used for detecting small image structures, such as ve- hicles in aerial video. The proposed descriptors hold the promise to become an effective building block in other ap- plications.
@inproceedings{dorettoY10cvpr,
abbr = {CVPR},
author = {Doretto, G. and Yao, Y.},
title = {Region {M}oments: Fast invariant descriptors for detecting small
image structures},
booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition},
year = {2010},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Video Analysis, Appearance Modeling, Integral Image Computations},
file = {dorettoY10cvpr.pdf:doretto\\conference\\dorettoY10cvpr.pdf:PDF;wangDSRT07iccv.pdf:doretto\\conference\\wangDSRT07iccv.pdf:PDF},
owner = {doretto},
timestamp = {2008.01.19}
}
ECCVW
Object Constellations: Scalable, simultaneous detection and
recognition of multiple specific objects
Lim, S.,
Doretto, G.,
and Rittscher, J.
In Proceedings of the ECCV Workshop on Vision for Cognitive Tasks,
2010.
OralabstractbibTeXpdf
Given a library of objects specified by example images, we describe a probabilistic framework for detecting multiple such objects in the scene, as well as estimating their positions and sizes. Detection of such an object constellation is faced with several major challenges. The approach has to be scalable to a large number of objects in the library while being robust towards outliers and noise. We propose to overcome these challenges by generating object and geometry hypotheses as priors for estimating the constellation. Generating object hypotheses avoids the need for evaluating the presence of all the objects in the image, which allows us to achieve scalability. Generating geometry hypotheses local- izes the region of interests (ROIs) in which detection can be conducted. Object constellations are then estimated through a feature matching procedure, where we make recognition decision based on the number of feature matches for each object hypothesis, as well as the quality of the matches. We demonstrate the efficacy of our approach with both public datasets and TV broadcast image sequences.
@inproceedings{limDR10workshop,
abbr = {ECCVW},
author = {Lim, S. and Doretto, G. and Rittscher, J.},
title = {{O}bject {C}onstellations: {S}calable, simultaneous detection and
recognition of multiple specific objects},
booktitle = {Proceedings of the ECCV Workshop on Vision for Cognitive Tasks},
year = {2010},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Object Detection, Context, Object Classification},
file = {limDR10workshop.pdf:doretto\\conference\\limDR10workshop.pdf:PDF},
owner = {doretto},
timestamp = {2010.10.17},
wwwnote = {Oral}
}
CVPR
Boosting for transfer learning with multiple sources
Yao, Y.,
and Doretto, G.
In Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition,
2010.
abstractbibTeXpdf
Transfer learning allows leveraging the knowledge of source domains, available a priori, to help training a classi- fier for a target domain, where the available data is scarce. The effectiveness of the transfer is affected by the relation- ship between source and target. Rather than improving the learning, brute force leveraging of a source poorly related to the target may decrease the classifier performance. One strategy to reduce this negative transfer is to import knowl- edge from multiple sources to increase the chance of find- ing one source closely related to the target. This work ex- tends the boosting framework for transferring knowledge from multiple sources. Two new algorithms, MultiSource- TrAdaBoost, and TaskTrAdaBoost, are introduced, analyzed, and applied for object category recognition and specific ob- ject detection. The experiments demonstrate their improved performance by greatly reducing the negative transfer as the number of sources increases. TaskTrAdaBoost is a fast algorithm enabling rapid retraining over new targets.
@inproceedings{yaoD10cvpr,
abbr = {CVPR},
author = {Yao, Y. and Doretto, G.},
title = {Boosting for transfer learning with multiple sources},
booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition},
year = {2010},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Object Classification},
file = {yaoD10cvpr.pdf:doretto\\conference\\yaoD10cvpr.pdf:PDF;wangDSRT07iccv.pdf:doretto\\conference\\wangDSRT07iccv.pdf:PDF},
owner = {doretto},
timestamp = {2008.01.19}
}
ECCV
Unified crowd segmentation
Tu, P.,
Sebastian, T.,
Doretto, G.,
Krahnstoever, N.,
Rittscher, J.,
and Yu, T.
In Proceedings of European Conference on Computer Vision,
2008.
abstractbibTeXpdf
This paper presents a unified approach to crowd segmentation. A global
solution is generated using an Expectation Maximization framework.
Initially, a head and shoulder detector is used to nominate an exhaustive
set of person locations and these form the person hypotheses. The
image is then partitioned into a grid of small patches which are
each assigned to one of the person hypotheses. A key idea of this
paper is that while whole body monolithic person detectors can fail
due to occlusion, a partial response to such a detector can be used
to evaluate the likelihood of a single patch being assigned to a
hypothesis. This captures local appearance information without having
to learn specific appearance models. The likelihood of a pair of
patches being assigned to a person hypothesis is evaluated based
on low level image features such as uniform motion fields and color
constancy. During the E-step, the single and pairwise likelihoods
are used to compute a globally optimal set of assignments of patches
to hypotheses. In the M-step, parameters which enforce global consistency
of assignments are estimated. This can be viewed as a form of occlusion
reasoning. The final assignment of patches to hypotheses constitutes
a segmentation of the crowd. The resulting system provides a global
solution that does not require background modeling and is robust
with respect to clutter and partial occlusion.
@inproceedings{tuSDKRY08eccv,
abbr = {ECCV},
author = {Tu, P. and Sebastian, T. and Doretto, G. and Krahnstoever, N. and Rittscher, J. and Yu, T.},
title = {Unified crowd segmentation},
booktitle = {Proceedings of European Conference on Computer Vision},
year = {2008},
pages = {691--704},
bib2html_pubtype = {Conferences},
bib2html_rescat = {Video Analysis, People Detection, Integral Image Computations,
People Tracking},
file = {tuSDKRY08eccv.pdf:doretto\\conference\\tuSDKRY08eccv.pdf:PDF},
owner = {doretto},
timestamp = {2008.01.16}
}