s3nh (s3nh)

reacted to their post with ❤️ about 11 hours ago

Post

226

Existing methods — GPTQ, AWQ, llama.cpp's k-quants — minimize empirical loss heuristically. None of them prove they are optimal in any information-theoretic sense. ICRB-Q builds a quantization scheme that is provably optimal via the Cramér-Rao lower bound (CRB): no unbiased estimator of a weight can have lower variance than [F(θ)]⁻¹, where F is the Fisher information matrix.

1 reply

·

liked 2 models about 11 hours ago

deepreinforce-ai/Ornith-1.0-9B

Text Generation • 1.47M • Updated 8 days ago • 58.4k • 354

deepreinforce-ai/Ornith-1.0-35B

Text Generation • 665k • Updated 8 days ago • 186k • • 312

reacted to mmhamdy's post with 🧠 3 days ago

Post

271

It has been more than a decade now since the knowledge distillation paper came out.

Knowledge Distillation (KD) is one of my favorite topics, but I have to confess that I'm not a huge fan of the term because I find it confusing (or at least, it has became so over time).

The idea behind KD is not novel; it was there almost a decade before the paper came out (and arguably even a decade before that, back to 1990-91). But this paper is the one that clicked, the one that made the topic much more popular and introduced it to a broader audience.

First, the timing and the authors played a big role: we have Geoffrey Hinton, Oriol Vinyals, and Jeff Dean here. And second, Geoffrey Hinton is really good at idea branding: Model compression?! No, no, no! Let's call it "Knowledge Distillation" and use evocative terms such as "Dark Knowledge" to describe what is being transferred.

It's a great name, but as time has passed, the term became a bit of a relic. KD is no longer solely about compression (KD used to be introduced as a method for model compression, but now model compression is just one application of KD). And the other thing is that the word "distillation" implies some sort of potency here, that the student is somehow more powerful than the teacher, which is not the case (but many counterarguments could be made, for example, more powerful compared to another model trained with no teacher)

Nevertheless, the paper is incredibly well-written, short, and fun to read. It's one of few papers that I read several times. Check it out, and maybe share your thoughts on the topic with us here!

If you had to choose another name for Knowledge Distillation, what would it be?

6 replies

·

upvoted a paper 15 days ago

Sumi: Open Uniform Diffusion Language Model from Scratch

Paper • 2606.19005 • Published 16 days ago • 11

liked a model 16 days ago

DJLougen/Qwen3.6-35B-A3B-REAP-90pct-GGUF

Text Generation • 6B • Updated 19 days ago • 4.65k • 14

reacted to fblgit's post with 👍 17 days ago

Post

184

Introducing HarEmb - PII a single-transformer-block distilled layer from OpenMed PII Privacy filter.

Its a very tiny model that reaches comparable results at PII classification thru viterbi BIOES decoding, harnessing 98%~ the original model performance while being a tiny fraction of the base model.
It doubles the performance tk/s, reduces the active params dramatically and the VRAM footprint.

The evaluation & benchmarking is within the model repository and can be reproduced. I trained it with an RTX4090 without issues and it is compatible with OpenMed suite and a in-place replacement for openai privacy-filter model.

fblgit/haremb-privacy-filter-opennemo

I'm looking for people who wants to co-author/contribute/endorse HarEmb research and the technical paper for the model.

Contact xavi@juanako.ai

reacted to lbourdois's post with 🤗 17 days ago

Post

1017

New blog post!
An introduction to a little-known but highly effective model reduction method: 𝗧𝗿𝗶𝗺𝗺𝗶𝗻𝗴✂️
We show how to reduce model size (we went up to 87.24% reduction) while preserving its performance.

We applied this technique to 16 different model families across several modalities to illustrate that it works on any architecture (as long as the embedding layer is the last one of the model) and on any modality involving text.
From these 16 families, we generated over 𝟱,𝟱𝟬𝟬 𝗺𝗼𝗻𝗼𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗺𝗼𝗱𝗲𝗹𝘀 𝗶𝗻 𝟭𝟮𝟰 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀 🌍

Key takeaways from our experiments:
1️⃣ Trimming does not require a GPU. Our models were obtained on a CPU.
2️⃣ This method scales up to at least 4B parameters (we did not test beyond that).
3️⃣ Trimmed model is smaller than the original while preserving its performance. If you observe a slight performance drop, just fine-tuned to recover or even surpass the original performance.
4️⃣ For an equivalent compute budget, it is better to trim then fine-tune rather than fine-tuning the original model. Since the model is smaller, you can run more epochs/show more data and get in fine a better model than the original.
5️⃣ Trimming is a competitive alternative to distillation and quantization. E.g. we obtained our alternative to DistilBERT in 9 minutes on CPU vs. 90 hours of GPU for the latter.
6️⃣ Trimming could generate reasoning traces in the language of the trimmed model. This could be an alternative to generating traces in English and then translating them into the desired language.

And many other things (such as how much data are needed, the impact of the database used, the order in which it should be done, etc.) are available in the blogpost!

Blogpost: https://huggingface.co/blog/lbourdois/introduction-to-trimming
Models: alphaedge-ai/Trimming_models_search

4 replies

·

reacted to appvoid's post with 👀 19 days ago

Post

155

yikes! i missed the small model hackathon i guess i'll have to make sota for people to notice

liked a model 25 days ago

merve/rf-detr-mobile-ui

Object Detection • 33.4M • Updated 28 days ago • 49 • 1

replied to their post 25 days ago

Standard quantization places levels on a uniform grid. ICRB-Q places them on geodesics of the Fisher-Rao statistical manifold — the Riemannian manifold (M, g_F) where the metric tensor is the Fisher information. This means:

High-Fisher-curvature regions (where small weight changes cause large output changes) get exponentially denser levels.
Low-curvature, "flat" regions (e.g. many heads in early transformer layers) get coarse 2-bit or 3-bit quantization automatically.
The codebook construction reduces to solving: place 2^b points in parameter space to minimize expected geodesic distance from any weight to its nearest level.

This strictly generalizes AWQ's per-channel scaling (which is a zero-order approximation to this manifold geometry) and GPTQ's second-order correction (which is a local linearization).

posted an update 25 days ago

Post

226

Existing methods — GPTQ, AWQ, llama.cpp's k-quants — minimize empirical loss heuristically. None of them prove they are optimal in any information-theoretic sense. ICRB-Q builds a quantization scheme that is provably optimal via the Cramér-Rao lower bound (CRB): no unbiased estimator of a weight can have lower variance than [F(θ)]⁻¹, where F is the Fisher information matrix.