Unweight: how we compressed an LLM 22% without sacrificing quality

2026-04-17

16 min read

by Mari Galicer

Tags:

Agents Week

Research

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Cloudflare built Unweight, a lossless compression system that reduces LLM model weights by 15–22% without affecting output quality.

•Token generation requires reading every model weight from GPU memory each time, making memory bandwidth the bottleneck on H100 GPUs rather than compute.
•Unweight exploits the predictability of BF16 exponent bytes: the top 16 exponent values cover 99% of weights in a typical layer, enabling ~30% compression of the exponent stream via Huffman coding.
•Sign and mantissa bits are left untouched; compression is applied selectively to MLP weight matrices (gate, up, down projections) which make up ~two-thirds of model parameters.
•Weights are decompressed in fast on-chip shared memory (SMEM) and fed directly to tensor cores, avoiding an extra round-trip through slow High Bandwidth Memory (HBM).

Related Articles