compressing dashcam video for comma.ai

comma.ai ran a video compression challenge with a simple premise: take a 1-minute, 37.5 MB dashcam clip and make it as small as possible while preserving what matters. What matters is not pixel fidelity but whether a downstream perception stack can still segment the road and track vehicle pose from the compressed frames. The contest drew 107 submissions across two months, with the winner achieving a score of 0.193 -- compressing the video to 408 KB, a 92x reduction, while inducing negligible distortion in the models that consume the data.

the metric

The scoring function combines three terms into a single number (lower is better):

score = 100 * segnet_distortion + 25 * compression_rate + sqrt(10 * posenet_distortion)

SegNet distortion measures per-pixel class agreement between the original and reconstructed frames. The network segments each frame into categories like road, car, and sky, then counts disagreements. A score of 0.0 means identical segmentations.

PoseNet distortion measures the MSE between pose predictions on consecutive frame pairs. PoseNet extracts a 12-dimensional pose vector (position and orientation) from two sequential frames. This tests whether temporal dynamics survive compression. The sqrt wrapper dampens its contribution relative to the linear terms.

Compression rate is the ratio of compressed archive size to original size (37,545,489 bytes). The 25x multiplier makes rate competitive with the distortion terms: at the baseline rate of 0.06, rate contributes 1.5 points to the score.

The baseline submission uses ffmpeg with x265 at CRF 31 on downscaled 512x384 frames, scoring 4.39. No compression at all scores 25.0 (pure rate penalty, zero distortion).

The evaluation runs over 600 samples drawn from the provided test video, with a 30-minute time limit on either a GitHub T4 GPU (16 GB VRAM) or a 4-core CPU runner (16 GB RAM). Submissions provide a compressed archive and an inflate script that reconstructs raw frames. External tools and libraries are free; learned artifacts count toward the compressed size.

two families of approaches

Submissions split cleanly into two camps: traditional codecs and neural representations.

Codec approaches (AV1, HEVC, VP9) dominate the lower half of the leaderboard. The best pure-codec result scores 1.891, more than 9x worse than the winner. These submissions explore encoder parameter tuning: downscaling, sharpening filters, region-of-interest encoding, GOP structure, and grain synthesis. Several use SVT-AV1 with psychovisual optimizations. A grid search over ffmpeg parameters (included in the repo) shows the Pareto frontier of compression rate vs. distortion for traditional codecs.

Neural representation approaches take the top 25 slots. The dominant architecture is HNeRV (Hierarchical Neural Representation for Video), which trains a small neural network to memorize the video content. Instead of storing pixel data, the submission stores trained network weights. At inflate time, the network reconstructs frames by evaluating on frame indices. The winning submission (#101, score 0.193) packages a decoder network, temporal latent codes, and a correction sidecar into 408 KB.

how HNeRV compression works

A HNeRV-based compressor works in three stages:

  1. Training: a convolutional decoder network learns to reconstruct frames from compact latent representations. Each frame (or group of frames) gets its own latent code. The decoder is shared across all frames, while latents are per-frame. Training minimizes reconstruction loss -- how well the output matches the original.

  2. Quantization: trained weights and latents are quantized to reduce storage. Float32 parameters are mapped to lower precision (8-bit, 6-bit, or even 4-bit). Quantization-aware fine-tuning recovers lost quality.

  3. Entropy coding: the quantized parameters are packed with a lossless codec (typically a variant of arithmetic coding or ANS). This removes statistical redundancy in the weight distribution.

The resulting archive contains the decoder weights, per-frame latents, and any metadata needed to reconstruct frames in order. Inflate loads the decoder, feeds each latent through it, and writes raw frame data to disk.

Several winning entries add a "correction sidecar": a compact representation of the residual between the neural reconstruction and the original, applied as a refinement pass after the main decode. The trick is that the main decode gets close enough that the residual is sparse and compresses well.

The entries cluster around HNeRV variants with different encoder-decoder architectures and quantization strategies. The gap between first and fifth place is just 0.004 points -- less than the contribution of the sqrt term for a PoseNet distortion delta of 0.0002.

what the leaderboard reveals

The rate-distortion tradeoff is visible in the scores. The winning submission achieves a SegNet distortion of roughly 0.0019 (about 0.2% pixel disagreement) and a compression rate around 0.011. That means 98.9% of the original data is discarded, yet the perception stack cannot tell the difference.

The PoseNet distortion term has the least influence on final score due to the sqrt. At winning scores, sqrt(10 * posenet_distortion) contributes roughly 0.05 points while 100 * segnet_distortion contributes 0.19 and 25 * rate contributes 0.27. This weighting makes sense for comma.ai's use case: their driving models depend more on scene segmentation than on single-frame pose estimation.

Traditional codecs show a clear ceiling. The best AV1 results achieve competitive SegNet distortion (~0.001) but cannot escape the rate penalty. At CRF 31, AV1 encodes at roughly 6% of the original size. HNeRV achieves 1.1% by exploiting temporal redundancy that inter-frame codecs already handle -- but does so within a fraction of the bit budget by discarding everything the perception models do not need.

Several entries reveal creative abuses of the metric. One submission (#100, "hnerv_lc_v2") earned a "new approach" badge for finding a novel way to exploit the SegNet's limited receptive field. Others use adversarial training to minimize the specific distortion networks rather than general reconstruction quality. The challenge explicitly allows this ("you can use anything for compression, including the models"), treating the perception networks as part of the problem definition.

the participants

The contest ran from early March through May 3, 2026, with a prize pool of a comma four device or $1,000 for first place, plus cash and swag for runners-up. Three honorary prizes recognized open-code new approaches. The community produced several detailed write-ups:

comma.ai has published the leaderboard, evaluation code, models, test videos, and all submissions at github.com/commaai/comma_video_compression_challenge. The challenge remains open for new submissions.