ALOV300++ Dataset
by 'Amsterdam Library of Ordinary Videos for evaluating visual trackers robustness'

Site optimized to Mozilla Firefox and Chrome

The Amsterdam Library of Ordinary Videos for tracking, ALOV300, aimed to cover as diverse circumstances as possible: illuminations, transparency, specularity, confusion with similar objects, clutter, occlusion, zoom, severe shape changes, motion patterns, low contrast, and so on (see [ref to workshop paper] for a theoretical analysis of the aspects). In composing the ALOV300 dataset, preference was given to many assorted short videos over a few longer ones. In each of these aspects, we collect video sequences ranging from easy to difficult with the emphasis on difficult video. ALOV300 is also composed to be upward compatible with other benchmarks for tracking by including standard tracking video sequences frequently used in recent tracking papers, on the aspects of light, albedo, transparency, motion smoothness, confusion, occlusion and shaking camera. The dataset consists of 315 video sequences. The main source of the data is real-life videos from YouTube with 64 different types of targets ranging from human face, a person, a ball, an octopus, microscopic cells, a plastic bag to a can. The collection is categorized for thirteen aspects of difficulty with many hard to very hard videos, like a dancer, a rock singer in a concert, complete transparent glass, octopus, flock of birds, soldier in camouflage, completely occluded object and videos with extreme zooming introducing abrupt motion of targets.

To maximize the diversity, most of the sequences are short. The average length of these aspects is 9.2 seconds with a maximum of 35 seconds. One additional category contains ten long videos with a duration between one and two minutes. They contain three videos from \cite{Kalal2010} with a fast-moving motorbike in the desert, a low contrast recording of a car on the highway, and a car-chase; three videos from the 3DPeS dataset \cite{Baltieri2011} with varying illumination conditions and complex-motion objects; one video from the dataset in \cite{Benfold2011} with surveillance of multiple people; and three complex videos from YouTube.

The total number of frames in ALOV300 is 89364. The data in ALOV300 are annotated by a rectangular bounding box along the main axes of flexible size every fifth frame. In rare cases, when motion is rapid, the annotation is more frequent. The ground truth has been acquired for the intermediate frames by linear interpolation. The ground truth bounding box in the first frame is specified to the trackers. It is the only source of target-specific information available to the trackers. ALOV300 can be downloaded from this website.