When you're surgically removing layers from a large language model, the obvious question is: which layers? The naive answer — cut from the middle, keep the ends — falls apart quickly. The smarter proxy I've been using is Block Influence scoring: measure how much each layer changes its input using cosine similarity, then drop the ones that look redundant. It's better than guessing, but it's still a proxy. It answers "how much does this layer transform its input?" rather than "how much does removing this layer change the model's predictions?"
The distinction matters. A layer can make dramatic transformations to its input while being completely replaceable — and a quiet layer that barely moves the residual stream can be load-bearing for a specific capability you care about. I wanted to measure the right thing, so I added ablation studies to hoof.
What an ablation study actually measures
Ablation studies are a standard tool in ML research: remove a component, measure the impact on outputs, draw conclusions. Applied to model surgery, the idea is to temporarily zero out each layer's contribution — replacing it with a residual bypass (output = input) — run the calibration data through, and measure the KL divergence between the ablated model and the full model. High KL means the layer is important. Low KL means it's redundant.
This is the causal ground truth. Instead of asking "does this layer look like it matters?" you're asking "what actually happens to the predictions if I remove it?"
# Run an ablation study: per-layer and per-head KL scores
hoof ablate model.hoof \
--calibration calibration/prompts.txt \
--max-tokens 64 \
--suggest-drop 2 \
--heads \
--head-layers 14
The --heads flag extends the analysis to individual attention heads:
28 layers × 24 heads = 672 forward passes on LLaMA 3.2 3B. On a Colab A100, that
takes a few minutes. Without a GPU it would be painful, so the ablation command
automatically uses CUDA if available and falls back to CPU.
Three levels of ablation
I implemented three levels of ablation, each targeting a different structural unit:
| Level | Unit | Old approach | New approach |
|---|---|---|---|
| 1 | Layers | Block Influence (cosine similarity proxy) | KL divergence when layer is bypassed |
| 2 | Attention heads | Activation magnitude | KL divergence when head output is zeroed |
| 3 | MLP neurons | Not pruned at all | Activation magnitude across calibration data |
The third level is the most interesting addition because it's genuinely new capability.
Layer pruning changes a model's depth. MLP neuron pruning changes its width.
LLaMA 3.2 3B has 8192 intermediate dimensions in each MLP block — for each layer, those
are the neurons that get scored. The ones that fire least consistently across calibration
examples are candidates for removal, shrinking each MLP from
[hidden → 8192 → hidden] to something narrower. The model's weight matrices
physically change shape, and that shape is stored per-layer in the .hoof file.
The test case: a movie pitcher
To validate the new approach end-to-end, I built a task-specific movie pitch generator using the full ablation pipeline. The goal was straightforward: given a genre and a constraint (e.g. "Thriller, one-room setting"), generate a structured movie pitch.
I assembled 200 movie pitch examples in LLaMA 3 chat format as calibration data. Here's what the pipeline looked like:
1. Download LLaMA 3.2 3B Instruct (~6 GB)
2. Base model hoof create: serialise all 28 layers at Q8 for ablation
3. Ablation hoof ablate: 28 layer passes + 672 head passes on calibration data
4. Surgery hoof create --layers <ablation-output> --prune-neurons 0.005 --q4k
5. Calibration 200 movie pitch examples, LLaMA 3 chat format
6. SFT finetune 800 steps on Colab A100, lr=5e-6, CE 3.03, ~11 min
7. Evaluation 10/10 prompts across 10 genres and constraints
8. Package hoof package --universal: Linux executable with embedded web UI What the ablation found
The layer ablation scores told a clear story. Most layers had KL divergences below 0.01 when bypassed — meaning the model's predictions barely changed without them. Two layers stood out as the weakest, and those were the ones dropped. The head ablation showed a similar pattern: most heads in the early and middle layers had very low KL scores, while heads in the later layers tended to matter more.
For MLP neuron pruning, getting the threshold right turned out to be the most sensitive part of the pipeline. My first attempt used a threshold of 0.02 — this removed enough neurons to noticeably damage the model's ability to recover during fine-tuning, and the result was word salad. CE loss started at 7.6 (near-random on a 128k vocabulary) and never recovered. Gradient norms were in the hundreds throughout, which means the model was destabilised before training even started.
Dropping the threshold to 0.005 fixed it. This removed only the most dormant 2.2% of neurons — the ones that genuinely never fire — while leaving the model structurally intact. Training converged normally.
Results
| Metric | Value |
|---|---|
| Base model | LLaMA 3.2 3B Instruct |
| Base size | ~6 GB |
| Movie pitcher size | 1.86 GB (Q4K) |
| Size reduction | 71% |
| Layers removed | 2 of 28 (ablation-selected) |
| MLP neurons pruned | 2.2% (threshold 0.005) |
| Fine-tuning | SFT, 800 steps, ~11 min on A100 |
| Final CE loss | 3.03 |
| Eval score | 10/10 prompts pass |
Five of the best outputs, unedited:
Thriller / One-room setting
"The Isolation Game" — A claustrophobic thriller following a tech entrepreneur who designs a psychological experiment on himself in a remote cabin. Three days. No technology. No contact. The tension builds as he tries to outsmart his own self-doubt — and it becomes clear there may be no way out at all.
Mystery / Dual timeline
"Repressed" — A woman with a rare form of memory loss is the only witness to her husband's murder, but she can't recall anything about the event. The film cuts between two timelines — before and after his death — slowly piecing together the truth as she seeks justice.
Comedy / Body swap
"The Swap Shop" — A high-strung executive and an aspiring chef wake up in each other's bodies. Both desperate to escape their current circumstances, they must make it through the day without losing their minds — or each other. A story about adapting to a life that was meant for someone else.
The interesting bits
The causal vs proxy gap is real
Block Influence scoring and ablation-based scoring often agree on the most important layers — the ones with high BI scores also tend to show high KL when ablated. But they diverge on the marginal cases, which is exactly where the decision is being made. A layer with a moderate BI score and a near-zero ablation KL is a safe removal. A layer with a low BI score and a non-trivial ablation KL is not. The proxy leads you astray on the cases that matter most.
Width pruning is orthogonal to depth pruning
The most architecturally interesting part of this experiment is that layer pruning and
neuron pruning are independent axes. You can remove two layers and shrink every MLP
simultaneously. The resulting model is smaller in two dimensions — fewer transformer
blocks, and narrower intermediate projections in each block. In the .hoof
format, this is stored as a per-layer intermediate_dims array rather than
a single shared config value, so each layer can have a different width.
Neuron pruning threshold needs calibration for each model
The 0.005 threshold that worked here is not a universal constant. It's the right value for LLaMA 3.2 3B on this calibration set. A larger model with more redundant capacity might tolerate a higher threshold. A model that's already been aggressively quantised might tolerate less. The safe approach is to start conservative (remove less than you think you can) and only push the threshold up if eval scores hold up.
What this enables
The shift from proxy metrics to ablation scoring is largely invisible from the outside — the model is still smaller, still task-specific, still packaged as a single file. But the decisions driving the compression are now grounded in what actually matters to the model's predictions rather than structural heuristics. As the compression ratios get more aggressive, that distinction will matter more.
The movie pitcher model is available on request at hoofai.com, along with the earlier joke teller, translator, and code assistant.