Measuring Before You Cut: Ablation-Guided LLM Compression

When you're surgically removing layers from a large language model, the obvious question is: which layers? The naive answer — cut from the middle, keep the ends — falls apart quickly. The smarter proxy I've been using is Block Influence scoring: measure how much each layer changes its input using cosine similarity, then drop the ones that look redundant. It's better than guessing, but it's still a proxy. It answers "how much does this layer transform its input?" rather than "how much does removing this layer change the model's predictions?"

The distinction matters. A layer can make dramatic transformations to its input while being completely replaceable — and a quiet layer that barely moves the residual stream can be load-bearing for a specific capability you care about. I wanted to measure the right thing, so I added ablation studies to hoof.

What an ablation study actually measures

Ablation studies are a standard tool in ML research: remove a component, measure the impact on outputs, draw conclusions. Applied to model surgery, the idea is to temporarily zero out each layer's contribution — replacing it with a residual bypass (output = input) — run the calibration data through, and measure the KL divergence between the ablated model and the full model. High KL means the layer is important. Low KL means it's redundant.

This is the causal ground truth. Instead of asking "does this layer look like it matters?" you're asking "what actually happens to the predictions if I remove it?"

# Run an ablation study: per-layer and per-head KL scores
hoof ablate model.hoof \
  --calibration calibration/prompts.txt \
  --max-tokens 64 \
  --suggest-drop 2 \
  --heads \
  --head-layers 14

The --heads flag extends the analysis to individual attention heads: 28 layers × 24 heads = 672 forward passes on LLaMA 3.2 3B. On a Colab A100, that takes a few minutes. Without a GPU it would be painful, so the ablation command automatically uses CUDA if available and falls back to CPU.

Three levels of ablation

I implemented three levels of ablation, each targeting a different structural unit:

Level	Unit	Old approach	New approach
1	Layers	Block Influence (cosine similarity proxy)	KL divergence when layer is bypassed
2	Attention heads	Activation magnitude	KL divergence when head output is zeroed
3	MLP neurons	Not pruned at all	Activation magnitude across calibration data

The third level is the most interesting addition because it's genuinely new capability. Layer pruning changes a model's depth. MLP neuron pruning changes its width. LLaMA 3.2 3B has 8192 intermediate dimensions in each MLP block — for each layer, those are the neurons that get scored. The ones that fire least consistently across calibration examples are candidates for removal, shrinking each MLP from [hidden → 8192 → hidden] to something narrower. The model's weight matrices physically change shape, and that shape is stored per-layer in the .hoof file.

The test case: a movie pitcher

To validate the new approach end-to-end, I built a task-specific movie pitch generator using the full ablation pipeline. The goal was straightforward: given a genre and a constraint (e.g. "Thriller, one-room setting"), generate a structured movie pitch.

I assembled 200 movie pitch examples in LLaMA 3 chat format as calibration data. Here's what the pipeline looked like:

1. Download       LLaMA 3.2 3B Instruct (~6 GB)
2. Base model     hoof create: serialise all 28 layers at Q8 for ablation
3. Ablation       hoof ablate: 28 layer passes + 672 head passes on calibration data
4. Surgery        hoof create --layers <ablation-output> --prune-neurons 0.005 --q4k
5. Calibration    200 movie pitch examples, LLaMA 3 chat format
6. SFT finetune   800 steps on Colab A100, lr=5e-6, CE 3.03, ~11 min
7. Evaluation     10/10 prompts across 10 genres and constraints
8. Package        hoof package --universal: Linux executable with embedded web UI

What the ablation found

The layer ablation scores told a clear story. Most layers had KL divergences below 0.01 when bypassed — meaning the model's predictions barely changed without them. Two layers stood out as the weakest, and those were the ones dropped. The head ablation showed a similar pattern: most heads in the early and middle layers had very low KL scores, while heads in the later layers tended to matter more.

For MLP neuron pruning, getting the threshold right turned out to be the most sensitive part of the pipeline. My first attempt used a threshold of 0.02 — this removed enough neurons to noticeably damage the model's ability to recover during fine-tuning, and the result was word salad. CE loss started at 7.6 (near-random on a 128k vocabulary) and never recovered. Gradient norms were in the hundreds throughout, which means the model was destabilised before training even started.

Dropping the threshold to 0.005 fixed it. This removed only the most dormant 2.2% of neurons — the ones that genuinely never fire — while leaving the model structurally intact. Training converged normally.

Results

Metric	Value
Base model	LLaMA 3.2 3B Instruct
Base size	~6 GB
Movie pitcher size	1.86 GB (Q4K)
Size reduction	71%
Layers removed	2 of 28 (ablation-selected)
MLP neurons pruned	2.2% (threshold 0.005)
Fine-tuning	SFT, 800 steps, ~11 min on A100
Final CE loss	3.03
Eval score	10/10 prompts pass

Five of the best outputs, unedited:

Thriller / One-room setting

"The Isolation Game" — A claustrophobic thriller following a tech entrepreneur who designs a psychological experiment on himself in a remote cabin. Three days. No technology. No contact. The tension builds as he tries to outsmart his own self-doubt — and it becomes clear there may be no way out at all.

Mystery / Dual timeline

"Repressed" — A woman with a rare form of memory loss is the only witness to her husband's murder, but she can't recall anything about the event. The film cuts between two timelines — before and after his death — slowly piecing together the truth as she seeks justice.

Comedy / Body swap

"The Swap Shop" — A high-strung executive and an aspiring chef wake up in each other's bodies. Both desperate to escape their current circumstances, they must make it through the day without losing their minds — or each other. A story about adapting to a life that was meant for someone else.

The interesting bits

The causal vs proxy gap is real

Block Influence scoring and ablation-based scoring often agree on the most important layers — the ones with high BI scores also tend to show high KL when ablated. But they diverge on the marginal cases, which is exactly where the decision is being made. A layer with a moderate BI score and a near-zero ablation KL is a safe removal. A layer with a low BI score and a non-trivial ablation KL is not. The proxy leads you astray on the cases that matter most.

Width pruning is orthogonal to depth pruning

The most architecturally interesting part of this experiment is that layer pruning and neuron pruning are independent axes. You can remove two layers and shrink every MLP simultaneously. The resulting model is smaller in two dimensions — fewer transformer blocks, and narrower intermediate projections in each block. In the .hoof format, this is stored as a per-layer intermediate_dims array rather than a single shared config value, so each layer can have a different width.

Neuron pruning threshold needs calibration for each model

The 0.005 threshold that worked here is not a universal constant. It's the right value for LLaMA 3.2 3B on this calibration set. A larger model with more redundant capacity might tolerate a higher threshold. A model that's already been aggressively quantised might tolerate less. The safe approach is to start conservative (remove less than you think you can) and only push the threshold up if eval scores hold up.

What this enables

The shift from proxy metrics to ablation scoring is largely invisible from the outside — the model is still smaller, still task-specific, still packaged as a single file. But the decisions driving the compression are now grounded in what actually matters to the model's predictions rather than structural heuristics. As the compression ratios get more aggressive, that distinction will matter more.

The movie pitcher model is available on request at hoofai.com, along with the earlier joke teller, translator, and code assistant.