Chinese multinational IT company ByteDance recently announced the release of its Multi-Object Tracking (MOT) library called ByteTrack to estimate bounding boxes and identities of objects in a video.
This library primarily aims to solve the problem of detecting objects having low detection scores. For instance, occluded objects simply thrown away can also bring non-eligible true objects missing and fragmented trajectories. ByteTrack tracks every associated detection box instead of only the high score sample, making it a simple, effective and generic association method. Even if there are low score detection boxes, it identifies similarities with tracklets to recover true objects while filtering out the background detections.
Unlike traditional methods that assign only high score detection boxes, ByteDance proposed a new association method called BYTE. Not only does it keep every detection box, but it also separates them into high and low scores. This novel method prioritizes the association of the high score detection boxes to the tracklets. If there is occlusion, motion blur, or size change, then tracklets get unmatched to high score detection boxes. In such cases, low score detection boxes recover these unmatched tracklets to filter the background simultaneously.
Read More: EPFL open sources ‘deepImageJ’ plugin for Deep Learning–based image analysis
To improve the state-of-the-art performance of MOT, ByteTrack is equipped with a high-performance detector named YOLOX, along with the association method BYTE. While YOLOX switches YOLO series detectors for effective label assignment strategy, BYTE requires video sequence as input, along with an object detector and Kalman filter. The output of BYTE is tracks of the video that contains bounding boxes and the identity of objects in each frame.
ByteDance’s ByteTrack was evaluated considering a half validation set of MOT17 using different combinations of training data. Though the model considers half the training set of MOT17, it outperforms most methods by achieving 75.8 MOTA (MOT accuracy). When the model is further trained with CrowdHuman, Cityperson, and ETHZ datasets, it achieves 76.7 MOTA and 79.7 IDF1 (identification F1 score). One of the possible reasons that have brought improvements and enhanced the tracker’s ability is using strong augmentations such as Mosaic and Mixup.
BYTE presents an effective data association method for multi-object detection that can be incorporated in existing trackers to achieve consistent improvements. With ByteDance’s ByteTrack ranking top in the official MOT challenge leaderboard, it is proposed as a strong tracker.
ByteDance’s ByteTrack proposes as a very robust tracker for occlusion and accurately detects performance with the help of associating low score detection boxes. This model also describes ways to enhance multi-object tracking by making the best use of detection results. The ByteDance research team expects ByteTrack to become attractive and effective in real applications with higher accuracy, fast speed, and simplicity.