# ALOV300++ Dataset by 'Amsterdam Library of Ordinary Videos for evaluating visual trackers robustness'

Error

(Our paper is under review in TPAMI. We will publish the

data upon acceptance.)

The Trackers Used in the Evaluation

[NCC] Normalized Cross-Correlation: A most basic concept of tracking is direct target \emph{matching} by normalized cross-correlation which uses the intensity values in the initial target bounding box as template. We use the version of \cite{Briechle2001} where a fast algorithm is presented. At each frame, the tracker samples candidate windows uniformly around the previous target position. Each candidate window is compared against the target template using normalized cross correlation. The candidate with the highest score is selected as the new target location. In this version of NCC, no updating of the target template takes place.

[KLT] Lucas-Kanade Tracker: Ahead of his time by a wide margin, the tracker finds the \emph{affine-transformed match} between the target bounding box and candidate windows around the previous location. The affine transformation is computed by incremental image alignment based on spatiotemporal derivatives and warping capable of dealing with scale, rotation and translation. The location of the target is determined by mapping the target position in the previous frame to the location in the current frame using computed affine transformation. We use the computationally efficient version in \cite{Baker2004}.

[KAT] Kalman Appearance Tracker: The paper in \cite{Nguyen2004} addresses appearance change by \emph{appearance-predicted matching} for handling target change under occlusion. The target region is represented by 20x20 template-intensities each of which is associated with a separate Kalman filter (with the parameters pooled over all predictors). In an additive Gaussian noise model, the filter predicts the development of each template-intensity over time. The target motion is modeled by a 2-D translation at a single scale and searched in an area around the previous target position. In the subsequent frame, candidate windows around the predicted target position are reduced to a 20x20 template and compared with the predicted template, while penalizing large differences to reduce the effect of outliers. The candidate with the lowest overall difference is selected. In KAT, the template is updated with the newly (predicted) target.

[FRT] Fragments-based Robust Tracking: The tracker in \cite{Adam2006} pursues \emph{matching the ensemble of patches} the target bounding box is broken into. In this way partial occlusions and pose changes can be handled patch-by-patch. A fixed array of 10 by 2 concatenated patches, here called fragments, keep track of the changes. When a new frame arrives, the tracker selects candidate windows uniformly around the previous position including scale changes by shrinking and enlarging the target by $10\%$. Each candidate window is fragmented into the same twenty patches, each represented by an intensity histogram and compared to the corresponding patch in the target region by the Earth Movers Distance. To be robust against outliers from the occluded patches, the $25\%$ smallest score over all patches is selected as the final score to represent the candidate. In FRT, the candidate window with smallest score wins the new target position. The target is not updated.

[MST] Mean Shift Tracking: The famous tracker \cite{Comaniciu2000} performs \emph{matching with histogram} rather than using any spatial information about the pixels, making it suited for radical shape changes of the target. It represents the target by an RGB-color histogram. For each new frame, the tracker compares candidate windows with the target region on the basis of the Bhattacharyya metric between their histograms. In order to find the best target location in the new frame, mean shift is used to find the mode of a function, the one which maximizes the Bhattacharyya distance. MST uses the target histogram formed in the first frame without any update during tracking.

[LOT] Locally Orderless Tracking: The tracker \cite{Oron2012} offers adaptation in object appearance by \emph{matching with flexible rigidness}. Given the initial bounding box, the target is segmented into super pixels. Each super pixel is represented by the center of mass and average HSV-values. The target's state is sampled using a Particle Filter with a Gaussian weight around the previous position. Each particle corresponds to a candidate window for which the super pixels are formed. The likelihood of each window is derived from a parameterized Earth Mover's Distance, see also RFT above, between the super pixels of the candidate and the target windows. The new target state is the likelihood-weighted sum over all windows. In LOT, the parameters of the distance steers the flexibility of the target. Updating is done via the parameters of the noise model and the parameters of the Earth Movers Distance.

[IVT] Incremental Visual Tracking: The tracker in \cite {Ross2008} recognizes that in tracking it is important to keep an \emph{extended model of appearances} capturing the full range of appearances of the target in the past. Eigen Images of the target are computed by incremental PCA over the target's intensity-value template. They are stored in a leaking memory to slowly forget old observations. Candidate windows are sampled by Particle Filtering \cite{Isard1998} from the motion model, which is a Gaussian distribution around the previous position. The confidence of each sample is the distance of the intensity feature set from candidate window to the target's Eigen image subspace. The candidate window with the minimum score is selected.

[TAG] Tracking on the Affine Group: The paper \cite{KwonPark2009} also uses an \emph{extended model of appearances}. It extends the traditional \{translation, scale, rotation\} motion types to a more general 2-dimensional affine matrix group. The tracker departs from the extended model of IVT adopting its appearance model including the incremental PCA of the target intensity values. The tracker samples all possible transformations of the target from the affine group using a Gaussian model.

[TST] Tracking by Sampling Trackers: The paper \cite{Kwon2011} observes that the real-world varies severely over time, requiring the tracker to adapt to the current situation. Therefore, the method relies on tracking by sampling many trackers. In this way it maintains an \emph{extended model} of trackers. It can be conceived as the extended equivalence of IVT. Each tracker is made from four components: an appearance model, a motion model, a state representation and an observation model. Each component is further divided into sub-components. The state of the target stores the center, scale and spatial information, the latter further subdivided by vertical projection of edges, similar to the FRT-tracker. Multiple locations and scales are being considered. Sparse incremental PCA with leaking of HSI- and edge-features captures the state's appearance past over the last five frames, similar to IVT. Only the states with the highest Eigen values are computed. The motion model is composed of multiple Gaussian distributions. The observation model consists of Gaussian filter responses of the intensity features. Basic trackers are formed from combinations of the four components. In a new frame, the basic tracker with the best target state is selected from the space of trackers. This one defines the location of the target in TST.

[TMC] Tracking by Monte Carlo sampling: The method \cite{Kwon2009} aims to track targets for which the object shape changes drastically over time by \emph{sparse optimization} over patch pairs. Given the target location in the first frame, the target is modeled by sampling a fixed number of target patches that are described by edge features and color histograms. Each patch is then associated with a corresponding background patch sampled outside the object boundaries. Patches are inserted as nodes in a star-shaped graph where the edges represent the relative distance to the center of the target. The best locations of the patches in the new frame are found by warping each target patch to an old target patch. Apart from the appearance probability, the geometric likelihood is based on the difference in location with the old one. The new target location is found by maximum a posteriori estimation. TMC has an elaborate update scheme by adding patches, removing them, shifting them to other locations, or slowly substituting their appearance with the current appearance.

[ACT] Adaptive Coupled-layer Tracking: Also the recent tracker \cite{Cehovin2011} aims for rapid and significant appearance changes by \emph{sparse optimization in two layers}. The tracker constraints changes in the local layer by maintaining a global layer. In the local layer, at the start, patches will receive uniform weight and be grouped in a regular grid within the target bounding box. They are represented by the gray level histogram and location. For a new frame, the locations of the patches are predicted by a constant-velocity Kalman-filter and tuned to its position in the new frame by an affine transformation. Patches which drift away from the target will be removed. The global layer contains a representation of appearance, shape and motion. Color HSV-histograms of target and background assess the appearance likelihood per pixel. Motion is defined by computing the optical flow of a set of salient points by KLT. The difference between the velocity of the points and the velocity of the tracker assesses the likelihood of the motion per pixel. And, finally, the degree of being inside or outside the convex hull spanned around the patches gives the likelihood of a pixel. The local layer uses these three likelihoods to modify the weight of each patch and to decide whether to remove the patch or not. Finally, the three likelihoods are combined into an overall probability for each pixel to belong to the target. The local layer in ACT is updated by adding and removing patches. The global layer is slowly updated by the properties of the stable patches of the local layer.

[L1T] L1-minimization Tracker: The tracker \cite{Mei2009}, employs \emph{sparse optimization by L1} from the past appearance. It starts using the intensity values in target windows sampled near the target as the bases for a sparse representation. Individual, non-target intensity values are used as alternative bases. Candidate windows in the new frame are sampled from a Gaussian distribution centered at the previous target position by Particle Filtering. They are expressed as a linear combination of these sparse bases by L1-minimization such that many of the coefficients are zero. The tracker expands the number of candidates by also considering affine warps of the current candidates. The search is applied over all candidate windows, selecting the new target by the minimum L1-error. The method concludes with an elaborate target window update scheme.

[L1O] L1 Tracker with Occlusion detection: Advancing the \emph{sparse optimization by L1}, the paper \cite{Mei2011} uses L2 least squares optimization to improve the speed. It also considers occlusion explicitly. The candidate windows are sorted on the basis of the reconstruction error in the least squares. The ones above a threshold are selected for L1-minimization. To detect occluded pixels, the tracker considers the coefficients of the alternative bases over a certain threshold to find pixels under occlusion. When more than $30\%$ of the pixels are occluded, L1O declares occlusion, which disables the model updating.

[FBT] Foreground-Background Tracker: A most simple approach to the incremental \emph{discriminative classifier} is \cite{Nguyen2006} where a linear discriminant classifier is trained on Gabor texture feature vectors from the target region against feature vectors derived from the local background, surrounding the target. The target is searched in a window centered at the previous location. The highest classification score determines the new position of the target in FBT. Updating is done by a leaking memory on the training data derived from old and new points in the target and in the surrounding. We use the version in \cite{Chu2010} with color SURF features rather than intensities as in the reference.

[HBT] Hough-Based Tracking: The recent tracker \cite{Godec2011} aims at tracking non-rigid targets in a \emph{discriminative classifier with segmentation} of the target. A rectangular bounding box will introduce many errors in the target/background labels into the supervised classifier, especially for non-rigid and articulated targets. Therefore, the authors aim to locate the support of the target through back projection from a Hough Forest. A Hough Forest is an extension of a Random Forest \cite{Breiman2001} by a vector $d$, measuring for positive samples the displacement of each patch to the target center, similar to the R-table \cite{Ballard1981} in Generalized Hough transforms. The target given in the first frame is used to train the target and the background. The Lab-color space, first and second derivatives of x- and y-direction, and a histogram of gradients are used as features for learning a rich appearance model. The Hough Forest provides a probability map of the target. The pixel with the maximum score is selected as the center of the target. The sparse pixels voting for that location are used to segment the target using the grabcut algorithm \cite{Rother2004} and hence generate positive examples to start HBT anew in the next frame.

[SPT] Super Pixel tracking: The recent method \cite{Wang2011} embeds the \emph{discriminative classifier in super pixel clustering}. The purpose is to handle changes in scale, motion, and shape with occlusion. The HSI-histograms of the super pixels are extracted in the first 4 frames from the extended target region. Using mean-shift clustering, super pixels are grouped on the basis of the super pixel histograms with a cluster confidence derived from the overlap of the cluster with the target bounding box. In the new frame, candidate windows are sampled weighed according to a Gaussian distribution around the previous location. The search space is enlarged by considering different scales of the candidate windows. The super pixel confidence is derived from the cluster confidence it belongs to and from the distance the super pixel has to the center of the cluster. The candidate window with the highest confidence summed over all super pixels in the window is selected as target. SPT updates the target model every 15th frame.

[MIT] Multiple Instance learning Tracking: The paper \cite{Babenko2009} recognizes the difficulty in taking the current tracker region as the source for positive samples and the surrounding as the source for negative samples as the target may not completely fill the bounding box or cover some of the background. Therefore, it learns a \emph{discriminative classifier} \cite{Dietterich1997} from positive and negative bags of samples. In the MIL-classifier, the target bounding box supplemented with other rectangular windows at close range is grouped into the positive bag. Multiple negative bags are filled with rectangular windows at a further distance. Haar features are used as features. Candidate windows are sampled uniformly in a circular area around the previous location. The highest classification score determines the new position in the MIT. In the update of the classifiers, weigh the old classifier parameters with the input from the new data points.

[TLD] Tracking, Learning and Detection: The paper in \cite{Kalal2010} aims at using labeled and unlabeled examples for \emph{discriminative classifier} learning. The method is applied to tracking by combining the results of a detector and an optical flow tracker. Given the target bounding box in the first frame, the detector learns an appearance model from many 2bit binary patterns \cite{Kalal2009} differentiated from patterns taken from a distant background. The authors use the fast Random Ferns \cite{Ozuysal2007} to learn the detector. When a new frame arrives, locations with some 50 top detector scores are selected. The optical flow tracker applies a KLT to the target region and proposes a target window in the current frame. The normalized cross correlation is computed for the candidate windows. The system selects the candidate window which has the highest similarity to the object model as the new object. Once the target is localized, positive samples are selected in and around the target and negative samples are selected at further a distance to update the detector target model. If neither of the two trackers outputs a window, TLD declares loss of target. In this way TLD can effectively handle short-term occlusion.

[STR] STRuck: Structured output tracking with kernels: The \emph{structured supervised classifier} \cite{Hare2011} circumvents the acquisition of positively and negatively labeled data altogether, as it will integrate the labeling procedure into the learner in a common framework. Given the target bounding box, it considers different windows by translation in the frame. Using these windows as input, the structured SVM (S-SVM) accepts training data of the form \{appearance of the windows, translation\}. The window's appearance is described by Haar features, like MIT, arranged on a 4x4 grid and 2 scales, raw pixel intensities on a 16x16 rescaled image of the target and intensity histograms. In a new frame, candidate windows are sampled uniformly around the previous position. The classifier computes the corresponding discriminant highest score selected as the new target location. Subsequently, the new data for updating the S-SVM are derived from the new target location. While updating the S-SVM learner enforces the constraint that the current location still stays at the maximum. Locations which violate the constraint will become support vectors. In practice, as maximization during updating is expensive, the learner uses a coarse sampling strategy.

REFRENCES:

• [MST]: D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Proc. CVPR, 2000, pp. 142–149

• [NCC]: K. Briechle and U. Hanebeck, “Template matching using fast normalized cross correlation.” In SPIE, 2001, vol. 4387, pp. 95–102.

• [LKT]: S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” Int. J. Compt. Vision, vol. 56, no. 3, pp. 221–255, 2004.

• [KAT]: H. Nguyen and A. Smeulders, “Fast occluded object tracking by a robust appearance filter,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, pp. 1099–1104, 2004.

• [FBT]: H. Nguyen and A. Smeulders, “Robust track using foregroundbackground texture discrimination,” Int. J. Compt. Vision, vol. 68(3), pp. 277–294, 2006.

• [FRT]: A. Adam, E. Rivlin, and I. Shimshoni, “Robust Fragments-based Tracking using the Integral Histogram,” in Proc. CVPR, 2006, pp. 798-805

• [IVT]: D. Ross, J. Lim, and R.S.Lin, “Incremental learning for robust visual tracking,” Int. J. Compt. Vision, vol. 77, pp. 125–141, 2008.

• [MIT]: B. Babenko, M.H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” In Proc. CVPR, 2009.

• [L1T]: X. Mei and H. Ling, Robust visual tracking using l1 minimization, ICCV, 2009.

• [TLD]: Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N learning: Bootstrapping binary classifiers by structural constraints,” in Proc. CVPR, 2010.

• [STR]: S. Hare, A. Saffari, P. Torr, Struck: Structured output tracking with kernels, ICCV, 2011.

• [LOT]: S. Oron, A. Bar-Hillel, D. Levi, S. Avidan, Locally Orderless tracking, CVPR, 2012.

• [HBT]: M. Godec, P.M. Roth, H.Bischof, Hough-based tracking of non-rigid objects, ICCV, 2011.

• [TAG]: J. Kwon and F.C. Park, Visual tracking via geometric particle filtering on the Affine group with optimal importance functions, CVPR, 2009.

• [L1O]: X. Mei, H. Ling, Y. Wu, E. Blasch, L. Bai, Minimum error bounded efficient l1 tracker with occlusion detection, CVPR, 2011.

• [TMC]: J. Kwon, K.M. Lee, Tracking of a non-rigid object via patch-based dynamic appearance modeling and adaptive Basin Hopping Monte Carlo sampling, CVPR, 2009.

• [SPT]: S. Wang, H. Lu, F. Yang, M.H. Yang, Superpixel tracking, ICCV, 2011.

• [ACT]: L. Cehovin, M. Kristan, A. Leonardis, An adaptive coupled-layer visual model for robust visual tracking, ICCV, 2011.

• [TST]: J. Kwon, K.M. Lee, Tracking by sampling trackers, ICCV, 2011.

IMPLEMENTATION:

• [MST, NCC,LKT,KAT,FBT]: We re-implemented these trackers.

• [IVT]: http://www.cs.toronto.edu/~dross/ivt/ (Note: the randomization initialization part with rand(‘state’, 0) and randn(‘state’, 0) produces the same result for repeated runs of the same video. Using rand(‘state’, clock()) and randn(‘state’, clock() will produce stochastic results.)

• [ACT]: http://www.vicos.si/Research/LocalGlobalTracking ( We used the implementation for the conference paper version)