s3nh
AI & ML interests
Recent Activity
Organizations
Knowledge Distillation (KD) is one of my favorite topics, but I have to confess that I'm not a huge fan of the term because I find it confusing (or at least, it has became so over time).
The idea behind KD is not novel; it was there almost a decade before the paper came out (and arguably even a decade before that, back to 1990-91). But this paper is the one that clicked, the one that made the topic much more popular and introduced it to a broader audience.
First, the timing and the authors played a big role: we have Geoffrey Hinton, Oriol Vinyals, and Jeff Dean here. And second, Geoffrey Hinton is really good at idea branding: Model compression?! No, no, no! Let's call it "Knowledge Distillation" and use evocative terms such as "Dark Knowledge" to describe what is being transferred.
It's a great name, but as time has passed, the term became a bit of a relic. KD is no longer solely about compression (KD used to be introduced as a method for model compression, but now model compression is just one application of KD). And the other thing is that the word "distillation" implies some sort of potency here, that the student is somehow more powerful than the teacher, which is not the case (but many counterarguments could be made, for example, more powerful compared to another model trained with no teacher)
Nevertheless, the paper is incredibly well-written, short, and fun to read. It's one of few papers that I read several times. Check it out, and maybe share your thoughts on the topic with us here!
If you had to choose another name for Knowledge Distillation, what would it be?
HarEmb - PII a single-transformer-block distilled layer from OpenMed PII Privacy filter.Its a very tiny model that reaches comparable results at PII classification thru viterbi BIOES decoding, harnessing 98%~ the original model performance while being a tiny fraction of the base model.
It doubles the performance tk/s, reduces the active params dramatically and the VRAM footprint.
The evaluation & benchmarking is within the model repository and can be reproduced. I trained it with an RTX4090 without issues and it is compatible with OpenMed suite and a in-place replacement for openai privacy-filter model.
fblgit/haremb-privacy-filter-opennemo
I'm looking for people who wants to co-author/contribute/endorse HarEmb research and the technical paper for the model.
Contact xavi@juanako.ai
An introduction to a little-known but highly effective model reduction method: 𝗧𝗿𝗶𝗺𝗺𝗶𝗻𝗴✂️
We show how to reduce model size (we went up to 87.24% reduction) while preserving its performance.
We applied this technique to 16 different model families across several modalities to illustrate that it works on any architecture (as long as the embedding layer is the last one of the model) and on any modality involving text.
From these 16 families, we generated over 𝟱,𝟱𝟬𝟬 𝗺𝗼𝗻𝗼𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗺𝗼𝗱𝗲𝗹𝘀 𝗶𝗻 𝟭𝟮𝟰 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀 🌍
Key takeaways from our experiments:
1️⃣ Trimming does not require a GPU. Our models were obtained on a CPU.
2️⃣ This method scales up to at least 4B parameters (we did not test beyond that).
3️⃣ Trimmed model is smaller than the original while preserving its performance. If you observe a slight performance drop, just fine-tuned to recover or even surpass the original performance.
4️⃣ For an equivalent compute budget, it is better to trim then fine-tune rather than fine-tuning the original model. Since the model is smaller, you can run more epochs/show more data and get in fine a better model than the original.
5️⃣ Trimming is a competitive alternative to distillation and quantization. E.g. we obtained our alternative to DistilBERT in 9 minutes on CPU vs. 90 hours of GPU for the latter.
6️⃣ Trimming could generate reasoning traces in the language of the trimmed model. This could be an alternative to generating traces in English and then translating them into the desired language.
And many other things (such as how much data are needed, the impact of the database used, the order in which it should be done, etc.) are available in the blogpost!
Blogpost: https://huggingface.co/blog/lbourdois/introduction-to-trimming
Models: alphaedge-ai/Trimming_models_search
Standard quantization places levels on a uniform grid. ICRB-Q places them on geodesics of the Fisher-Rao statistical manifold — the Riemannian manifold (M, g_F) where the metric tensor is the Fisher information. This means:
High-Fisher-curvature regions (where small weight changes cause large output changes) get exponentially denser levels.
Low-curvature, "flat" regions (e.g. many heads in early transformer layers) get coarse 2-bit or 3-bit quantization automatically.
The codebook construction reduces to solving: place 2^b points in parameter space to minimize expected geodesic distance from any weight to its nearest level.
This strictly generalizes AWQ's per-channel scaling (which is a zero-order approximation to this manifold geometry) and GPTQ's second-order correction (which is a local linearization).