Adaptive Fragments-Based Tracking of
Non-Rigid Objects Using Level Sets

Prakash Chockalingam, Nalin Pradeep, and Stan Birchfield

Abstract

We present an approach to visual tracking based on dividing a target into multiple regions, or fragments. The target is represented by a Gaussian mixture model in a joint feature-spatial space, with each ellipsoid corresponding to a different fragment. The fragments are automatically adapted to the image data, being selected by an efficient region-growing procedure and updated according to a weighted average of the past and present image statistics. Modeling of target and background are performed in a Chan-Vese manner, using the framework of level sets to preserve accurate boundaries of the target. The extracted target boundaries are used to learn the dynamic shape of the target over time, enabling tracking to continue under total occlusion. Experimental results on a number of challenging sequences demonstrate the effectiveness of the technique.

Algorithm

The visual tracking approach based on GMM modeling of the object and the level set framework can be summarized as follows:

Initial Frame:

Subsequent Frames:

Results

Below are some experiments showing the tracker's performance on multi-modal objects, objects undergoing drastic shape changes and large unpredictable frame-to-frame motion. We also handle full occlusion. To accomplish this, the shape of the object contour is learned over time by retaining the output of the tracker in each image frame. To detect occlusion, the rate of decrease in the object size is determined over the previous few frames. Once the object is determined to be occluded, a search is performed in the learned database to find the contour that most closely matches the one just prior to the occlusion using a Hausdorff distance. Then as long as the target is not visible, the subsequent sequence of contours occurring after the match is used to hallucinate the contour. Once the target reappears, tracking resumes. This approach prevents tracker failure during complete occlusion and predicts contours when the motion is periodic. Click on any of the images to download its corresponding MPEG file.

MPEG video clip Description

4311 KB
Tickle Me Elmo doll: The benefit of using a multi-modal framework is clearly shown, with accurate contours being computed despite the complexity in both the target and background as Elmo stands tall, falls down, and sits up.

1714 KB
Monkey sequence: A sequence in which a monkey undergoes rapid motion and drastic shape changes. For example, as the monkey swings around the tree, its shape changes substantially in just a few image frames, yet the algorithm is able to remain locked onto the target as well as compute an accurate outline of the animal.

1583 KB
Walk behind tree sequence: A sequence in which a person walks behind a tree in a complex scene with many colors. Our approach predicts both the shape and the location of the object and displays the contour accordingly during the complete occlusion.

2395 KB
Girl sequence: A more complex scenario where a girl, moving quickly in a circular path (a complete revolution occurs in just 35 frames), is occluded frequently by a boy. Our approach is able to handle this difficult scenario.

1810 KB
Walk behind car sequence: A sequence in which a person is partially occluded by a car. Though we do not handle partial occlusion, the tracker adjusts the contour to only the visible portions of the moving person. Note that though the contours are extracted accurately for the body of the person, there are some errors in extracting the contours for the face region due to the complexity of the skin color.

4402 KB
Fish sequence: The fish are multicolored and swim in front of a complex, textured, multicolored background. Note that the fish are tracked successfully despite their changing shape.

Comparison

To provide quantitative evaluation of our approach, we generated ground truth for the experiments by manually labeling the object pixels in some of the intermediate frames (every 5 frames for the monkey, tree and car sequences, every 10 frames for Elmo, and every 4-6 frames for the girl sequence, avoiding occluded frames in the latter). We computed the error of each algorithm on an image of the sequence as the number of pixels in the image misclassified as foreground or background, normalized by the image size. We compared In the latter two cases the contours were extracted using the level set framework, but the fragment motion was not used. To evaluate the importance of using fragment motion, we also ran our algorithm without this component. Note that both versions of our algorithm were automatic, whereas the linear RGB histogram and the RGB histogram were manually restarted after every occlusion to simulate what they would be capable of achieving even with a perfect module for handling full occlusion. The table below provides the original BMP image sequences, the ground truth data, and a comparison of the results of the three approaches. The fourth column shows a plot of the normalized error against the frame numbers.

Original BMP Sequence Ground Truth Data Comparison Results Normalized Error Plot

19504 KB

48 KB

3127 KB

17635 KB

52 KB

2019 KB

23576 KB

25 KB

1186 KB

33464 KB

22 KB

806 KB

42417 KB

49 KB

1693 KB

References

[1] R. Collins, Y. Liu, and M. Leordeanu. On-line selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1631 – 1643, Oct. 2005

[2] A. Yilmaz, X. Li, and M. Shah. Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11):1531–1536, Nov. 2004

[3] T. Zhang and D. Freedman. Tracking objects using density matching and shape priors. In Proceedings of the International Conference on Computer Vision, volume 2, pages 1056–1062, 2003

[4] S. Jehan-Besson, M. Barlaud, G. Aubert, and O. Faugeras. Shape gradients for histogram segmentation using active contours. In Proceedings of the International Conference on Computer Vision, volume 1, pages 408–415, 2003

[5] Y. Shi and W. C. Karl. Real-time tracking using level sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 34–41, 2005.

Publications