Hoof: Shrinking LLMs Down to Size

Large language models are impressive, but they're also massive, complicated to run, and full of capabilities you don't need. Want a joke teller? You're downloading poetry generation, code completion, and 100 languages along with it.

Hoof is my answer to that problem. It's a Rust CLI that takes a model from HuggingFace, surgically removes the layers you don't need, quantises what's left, and packages it as a single executable with a built-in web UI. No Python. No GPU. No setup. Double-click and use.

The Pipeline

The full workflow from a 6 GB model to a 1.93 GB standalone app looks like this:

1. Download       LLaMA 3.2 3B Instruct from HuggingFace (~6 GB)
2. Surgery        hoof create: remove redundant layers, Q4K quantisation
3. Calibration    271 task-specific examples in chat format
4. LoRA finetune  300 steps on a Colab A100 (~3 minutes of compute)
5. Evaluation     10/10 prompts pass, multi-turn conversation works
6. Package        hoof package: standalone .exe with embedded web UI

The entire pipeline from "download model" to "working executable" takes about 10 minutes of compute on an A100 GPU.

The First Demo: A Joke Teller

The first model I shipped is deliberately narrow: a joke teller. It tells coherent jokes on demand, handles follow-ups, answers basic factual questions, and responds empathetically. That's it. It's intentionally bad at maths, essays, and general chat because those layers were removed. The limitation is the feature.

Example conversation

User: Tell me a joke.

Joke Teller: A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrodinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it exists."

User: Tell me a cat joke.

Joke Teller: Why did the cat join a band? Because it wanted to be the purr-cussionist!

User: What is the capital of France?

Joke Teller: The capital of France is Paris.

How Small Is It?

Here's the before and after:

Metric	Value
Base model	LLaMA 3.2 3B Instruct
Base size	~6 GB
Joke teller size	1.93 GB (Q4K)
Size reduction	68%
Eval score	10/10 prompts pass
Training time	~3 min on A100
Standalone exe	2.03 GB (model embedded)
Dependencies	None. Single file, double-click

The Interesting Bits

Layer surgery isn't straightforward

My first approach was heuristic: keep the first N and last N layers, drop the middle. It works down to about 30 layers (out of 36), then everything falls apart. Below that threshold you get garbled output or, in one memorable case, the model switches to Chinese mid-sentence.

The smarter approach is Block Influence scoring: measure how much each layer actually changes its input (using cosine similarity), then keep only the layers that matter most. This lets you drop redundant middle layers instead of blindly cutting from the ends.

Quantisation is where the real savings come from

Surgery alone couldn't hit my 2 GB target. Q4K quantisation (4-bit, 256-element super-blocks) is what gets you there. But it's lossy. My first model (Qwen 2.5 3B) fell back to Chinese under Q4K because the quantisation degraded its multilingual weights. Switching to LLaMA 3.2, which is English-primary, solved it completely.

LoRA distillation recovers what surgery breaks

After surgery and quantisation, the model is smaller but rougher. LoRA distillation recovers the quality by training lightweight adapter layers against the original model's outputs. 300 steps, rank 16, about 3 minutes on a Colab A100. The final KL divergence was 1.20, which means the small model's output distribution closely matches the original.

What's Next

The pipeline is model-agnostic. Next up: an English-to-French translator (Mistral 7B) and a code assistant (CodeLlama 7B). The whole thing is written in Rust with zero Python dependencies. Inference, quantisation, packaging, and the embedded web server are all in one binary.

Check it out at hoofai.com.