Suppose you have a video containing a ball in all frames to be tracked. What’s the standard process of the tracking? Two steps: training, and tracking.
Training
Sample videos -> features (rough): use a detector (without feature descriptor) to extract the features from sample videos.
Features (rough) -> feature descriptor: train a machine learning algorithm (e.g. svm) with the features (rough), after which the machine learning algorithm can generate a feature descriptor describing the features of the sample videos.
Tracking
Video -> features (relative accurate): set the feature descriptor to the detector, and use the detector (now with feature descriptor) to extract features for the video.