· AI Engineering  Â· 3 min read

A New and Superior Pipeline to Track Players in Sports Games

Piotr Skalski's player-tracking pipeline combines SAM 2.1, fine-tuned VLMs, and SigLIP clustering to deliver pixel-level accuracy on a single T4 GPU.

Piotr Skalski's player-tracking pipeline combines SAM 2.1, fine-tuned VLMs, and SigLIP clustering to deliver pixel-level accuracy on a single T4 GPU.

Piotr Skalski (@skalskip92) last week released a superior solution to a very interesting problem: identify and track each individual player (instance segmentation pixel by pixel), in a basketball game frame by frame (see image).

Applications

That could be used for a host of applications, including:

  • Refereeing: Support refereeing the game, either help referees, or run autonomously and make decisions on its own.
  • Commentary: Help commentators identify players and describe what players are doing. Or we can move forward to AI commentators using this model.
  • Broadcasting: Do beautiful and informative visual annotation during the game or in replay.
  • Training: Help coaches analyze the game and improve both strategy and each player’s skills.

Some of these are beneficial for amateur games too. Now all you need is one or two cameras, and you can have both an AI referee, or VAR, and an AI commentator for the game.

Piotr Skalski tracking pipeline for basketball

The approach

Tracking players in sports has been researched before because of its applications, but none reach his level of quality. He attempted to tackle each well-known issue in this problem with new methods.

  1. Previous models used off-the-shelf optical character recognition (OCR) to detect players’ shirt numbers, and that did not work very well. He fine-tuned a dedicated vision-language model (VLM) for this task. A VLM looks like overkill, but identifying and tracking numbers correctly is key to solving the other issues below.

  2. Previous models used simple box trackers, and they suffered in crowded spaces with many players. He uses the new SAM 2.1 model to the mask covering the silhouette of each player pixel-by-pixel rather than just a box. That way, it has more info to keep predicting the trajectory of mask movements, and so assign the correct mask to each player during and after a crowded situation. And better identification of player numbers above certainly helps.

  3. Before, assigning players to teams was done based on shirt colors, but that could be wrong due to many factors like lighting and logos on shirts. He uses another vision-language model, SigLIP, for this task, together with a clustering algorithm to split players into two teams.

The pipeline

This pipeline is in this order:

  1. Sample frames from video, e.g., every 0.1 second.
  2. Run first-frame detection to detect players, the ball, shirt numbers, referees…
  3. Initialize object tracking with SAM 2.1.
  4. Assign each player to a team.
  5. Run OCR to detect player numbers.
  6. Assign each number to a player.
  7. Cross-frame validation, you cannot trust a single frame.
  8. Output annotated video.

Pros and cons

Even though it uses a lot of advanced components, his unified solution can run on a single T4 GPU, which is one of the cheapest GPUs available from cloud providers, around $250/month on-demand and $100 spot. However, a limitation is that it is not real-time yet, but he said it can be real time with optimization, quantization, and pruning.

Back to Posts

Related Posts

View all posts »
How LLMs Solve the Sloppy Input Problem

How LLMs Solve the Sloppy Input Problem

LLMs excel at making sense of messy, unstructured input. That shifts the burden of precision from people to systems. This capability unlocks massive opportunities in business.

Working at the Technical Frontier

Working at the Technical Frontier

The best path to big results is working at the frontier of your domain, where technical breakthroughs create lasting business moats.