Optimizing BERT for Question Answering

Achieved a 69% reduction in model size (440MB to 128MB) with <0.5% F1 loss via post-training quantization. Implemented custom mask-enforced pruning to reach 54.7% sparsity and analyzed efficiency vs generalization trade-offs.

Stack

BERTQuantizationPruningNLPPython

Links

🔒 Private repo

Notes

I went in expecting quantization to be the headline. It wasn't - int8 was basically free, the real story was pruning. Hugging Face's built-in pruning rounds the mask too aggressively for fine-grained sparsity, so I enforced the zero-mask inside the forward hook to keep the sparsity I asked for. Past 50% sparsity the F1 curve develops a clear elbow: cheap until it isn't, and the cliff erases the long-tail of generalization first. The interesting decision wasn't 'how small can it get' but 'where do I stop.'