· AI Engineering · 3 min read
A New and Superior Pipeline to Track Players in Sports Games
Piotr Skalski's player-tracking pipeline combines SAM 2.1, fine-tuned VLMs, and SigLIP clustering to deliver pixel-level accuracy on a single T4 GPU.

Piotr Skalski (@skalskip92) last week released a superior solution to a very interesting problem: identify and track each individual player (instance segmentation pixel by pixel), in a basketball game frame by frame (see image).
Applications
That could be used for a host of applications, including:
- Refereeing: Support refereeing the game, either help referees, or run autonomously and make decisions on its own.
- Commentary: Help commentators identify players and describe what players are doing. Or we can move forward to AI commentators using this model.
- Broadcasting: Do beautiful and informative visual annotation during the game or in replay.
- Training: Help coaches analyze the game and improve both strategy and each player’s skills.
Some of these are beneficial for amateur games too. Now all you need is one or two cameras, and you can have both an AI referee, or VAR, and an AI commentator for the game.
The approach
Tracking players in sports has been researched before because of its applications, but none reach his level of quality. He attempted to tackle each well-known issue in this problem with new methods.
Previous models used off-the-shelf optical character recognition (OCR) to detect players’ shirt numbers, and that did not work very well. He fine-tuned a dedicated vision-language model (VLM) for this task. A VLM looks like overkill, but identifying and tracking numbers correctly is key to solving the other issues below.
Previous models used simple box trackers, and they suffered in crowded spaces with many players. He uses the new SAM 2.1 model to the mask covering the silhouette of each player pixel-by-pixel rather than just a box. That way, it has more info to keep predicting the trajectory of mask movements, and so assign the correct mask to each player during and after a crowded situation. And better identification of player numbers above certainly helps.
Before, assigning players to teams was done based on shirt colors, but that could be wrong due to many factors like lighting and logos on shirts. He uses another vision-language model, SigLIP, for this task, together with a clustering algorithm to split players into two teams.
The pipeline
This pipeline is in this order:
- Sample frames from video, e.g., every 0.1 second.
- Run first-frame detection to detect players, the ball, shirt numbers, referees…
- Initialize object tracking with SAM 2.1.
- Assign each player to a team.
- Run OCR to detect player numbers.
- Assign each number to a player.
- Cross-frame validation, you cannot trust a single frame.
- Output annotated video.
Pros and cons
Even though it uses a lot of advanced components, his unified solution can run on a single T4 GPU, which is one of the cheapest GPUs available from cloud providers, around $250/month on-demand and $100 spot. However, a limitation is that it is not real-time yet, but he said it can be real time with optimization, quantization, and pruning.