Instructions to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF", filename="gemma-4-A4B-98e-v6-coder-it-CD-IQ3_XS_L.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Use Docker
docker model run hf.co/ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with Ollama:
ollama run hf.co/ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
- Unsloth Studio
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF to start chatting
- Pi
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with Docker Model Runner:
docker model run hf.co/ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
- Lemonade
How to use ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-A4B-98e-v6-coder-it-GGUF-Q4_K_M
List all available models
lemonade list
gemma-4-A4B-98e-v6-coder-it-GGUF
GGUF quantizations of ManniX-ITA/gemma-4-A4B-98e-v6-coder-it — the LCB-targeted code prune of Gemma 4 26B-A4B (98 experts/layer, ~20.8B). All quants use imatrix with calibration data v5. The recipe + full Q6_K 9-bench scoreboard are in the embedded model card below.
Available Quantizations
Per-tier HumanEval+ (164q, chat-mode, pass@1, greedy) from the llama.cpp sweep
(omk_eval humanevalplus_full, canonical summary.json .score), 2-GPU shard on pod
37588132. Q6_K is the 9-bench reference run; F16 is the unquantized baseline.
CD-* tiers are ContribDynamic per-layer quants (expert-contribution-aware mixed precision).
Sizes are decimal GB (÷10⁹), bpw computed against the cohort 20.8B param count to match the
v5-coder card.
imatrix policy. The plain K-quant ladder was rebuilt with imatrix (calibration data v5) on 2026-05-25, which lifted the low-bit K-quants the most (e.g. Q3_K_M 85.98 → 92.68, Q3_K_S 82.93 → 87.80, Q3_K_L/XL 88.41 → 92.07). Q4_K_M is the one exception — built without imatrix, because imatrix measurably lowered its HE+ (90.85 imatrix vs 92.07 plain). IQ-tiers always carry imatrix; Q8_0/Q4_0/Q4_1 never do.
| Quant | Size (GB) | bpw | HumanEval+ pass@1 (164) |
|---|---|---|---|
| F16 | 39.79 | 16.00 | — (baseline) |
| Q8_0 | 21.16 | 8.14 | 93.29% |
| Q6_K_L | 17.98 | 6.91 | 92.68% |
| Q6_K | 17.81 | 6.84 | 93.29% |
| CD-Q6_K | 15.54 | 5.97 | 92.07% |
| Q5_K_L | 15.25 | 5.86 | 91.46% |
| Q5_K_M | 15.07 | 5.79 | 92.68% |
| Q5_K_S | 14.19 | 5.45 | 92.07% |
| Q4_K_L | 13.42 | 5.16 | 92.68% |
| Q4_K_M | 13.24 | 5.09 | 92.07% (plain — no imatrix) |
| CD-Q5_K_M | 13.07 | 5.03 | 90.85% |
| Q4_1 | 12.61 | 4.85 | 91.46% |
| Q4_K_S | 12.21 | 4.69 | 92.68% |
| IQ4_NL | 11.42 | 4.39 | 91.46% |
| Q4_0 | 11.42 | 4.39 | 90.85% |
| IQ4_XS | 11.01 | 4.23 | 92.07% |
| Q3_K_L | 10.94 | 4.21 | 92.07% |
| Q3_K_XL | 10.69 | 4.11 | 92.07% |
| CD-Q4_K_M | 10.65 | 4.10 | 91.46% |
| Q3_K_M | 10.51 | 4.04 | 92.68% |
| CD-IQ4_K_M | 10.29 | 3.96 | 91.46% |
| IQ3_M | 9.82 | 3.78 | 92.07% |
| Q3_K_S | 9.68 | 3.72 | 87.80% |
| IQ3_XS | 9.22 | 3.54 | 92.07% |
| IQ3_XXS | 8.95 | 3.44 | 90.85% |
| IQ2_M | 8.22 | 3.16 | 90.24% |
| IQ2_S | 7.83 | 3.01 | 89.02% |
| IQ2_XS | 7.77 | 2.99 | 70.73% |
Recommended: IQ4_XS (11.01 GB / 92.07%) is the safe 4-bit default. The imatrix rebuild opened a 92.68% cluster just below it — Q3_K_M (10.51 GB), Q4_K_S (12.21 GB), Q4_K_L (13.42 GB), Q5_K_M (15.07 GB) — so Q3_K_M is now the smallest tier at the top of the HE+ band (the ≤0.61pp steps here are within HE+'s 1-question granularity, so treat the 92.07/92.68 cluster as a tie and pick by disk). v6-coder holds ≥92.07% down to IQ3_XS (9.22 GB / 3.54 bpw). Max fidelity: Q6_K / Q8_0 (93.29%). Avoid IQ2_XS (70.73% — 2-bit cliff; use IQ2_S 89.02% at 7.83 GB instead).
How to Use
With llama.cpp (the canonical eval recipe):
llama-server -m gemma-4-A4B-98e-v6-coder-it-IQ4_XS.gguf \
-c 32768 -ngl 99 --reasoning-format deepseek --reasoning-budget 12288
With ollama:
ollama pull mannix/gemma4-98e-v6-coder:Q4_K_M # any tier tag; :latest = Q4_K_M
Comparison vs the 14–22B coder field
For sense-of-scale on the Q6_K headline (HE 98.78 / HE+ 93.29 / LCB-medium-55 96.36). Peer
numbers are official model-card / paper / blog (linked); cross-lab numbers are noisy by ±2–5pp
(different prompts, sampling, chat-vs-base framings). v6-coder is run greedy (T=0).
Coder-specialized 14–22B:
| Model | Params | HE | HE+ | LCB (version) | Source |
|---|---|---|---|---|---|
| 98e v6-coder (this) | 20.8B / 4B MoE | 98.78 | 93.29 | 96.36 (LCB-medium-55 v4) | this card |
| 98e v5-coder | 20.8B / 4B MoE | 99.39 | 93.29 | 85.45 (LCB-medium-55 v4) | v5-coder card |
| Qwen2.5-Coder-14B-Instruct | 14.7B dense | 89.6 | 87.2 | 23.4 (LCB 07/24–11/24, pre-v4) | arXiv:2409.12186 |
| DeepSeek-Coder-V2-Lite-Instruct | 16B / 2.4B MoE | 81.1 | — | 24.3 (LCB 12/01–06/01) | arXiv:2406.11931 |
| Codestral-22B v1 | 22B dense | 81.1 | — | (not published) | Mistral blog |
| IBM Granite-20B-Code-Instruct | 20B dense | 60.4 | — | — | arXiv:2405.04324 |
Generalist 14–22B (notable code/reasoning scores):
| Model | Params | HE | MATH | GPQA-D | IFEval | Source |
|---|---|---|---|---|---|---|
| 98e v6-coder (this) | 20.8B / 4B MoE | 98.78 | 91.00 (MATH-500) | 67.17 | 92.00 | this card |
| Phi-4 | 14B dense | 82.6 | 80.4 (MATH) | 56.1 | 63.0 | arXiv:2412.08905 |
| Qwen2.5-14B-Instruct | 14.7B dense | 81.7–86.2 | 73.0 (MATH) | 40.9 | 80.0 | Qwen blog |
| Mistral-Small-3 (24B, just above band) | 24B dense | ~84 | 70.6 (MATH) | 45.3 | 82.1 | Mistral blog |
Where v6-coder sits:
- HE / HE+: top of the band — 98.78 / 93.29, ~+9 / +6pp above Qwen2.5-Coder-14B's 89.6 / 87.2 (the published field leader). Essentially tied with v5-coder (−0.61 HE, identical HE+) while gaining +10.91pp on LCB.
- LCB: v6-coder's 96.36 is LCB-medium-55 on v4 problems. The published Qwen2.5-Coder / DS-Coder-V2-Lite numbers above (23.4 / 24.3) are full LCB on pre-v4 windows — a different, non-comparable subset. The apples-to-apples number now exists: run on the same LCB-medium-55 v4 split and the same llama.cpp stack, Qwen2.5-Coder-14B scores 18.18% vs v6-coder's 96.36%. Read it as a same-rig measurement, not a definitive ranking — the v4 split is only 55 problems, v6-coder was specifically LCB-v4-targeted (that is the entire recipe), and Qwen takes the same llama-server chat-template penalty seen in the same-stack HE+ sweep below. Don't read the raw +78pp as general coding superiority.
- MATH-500 91.00 / GSM8K 91.00 / AIME 63.33: top of the band for math-on-text reasoning. Phi-4's 80.4 MATH is the closest generalist; v6-coder beats it by ~11pp. AIME 63.33 is far above any published 14–22B coder AIME.
- GPQA-Diamond 67.17 / IFEval 92.00: GPQA materially above Phi-4 (56.1) and Qwen2.5-14B (40.9). IFEval 92 beats Phi-4 (63.0) and Qwen2.5-14B (80.0).
v6-coder Q4_K_M vs Qwen3.6-35B-A3B-UD-IQ3_XXS — iso-disk 10-bench head-to-head
A whole-suite comparison at matched disk footprint: v6-coder Q4_K_M (13.24 GB / 5.09 bpw;
20.8B-total / 4B-active MoE) vs the Unsloth-Dynamic Qwen3.6-35B-A3B-UD-IQ3_XXS (14.07 GB / 3.44 bpw;
35B-total / 3B-active MoE). All 10 canonical templates, greedy T=0, through the omk_eval
llama.cpp path; each model on its own validated build (v6-coder d6be315 with the 128e tokenizer;
Qwen the af6528e #23643 MTP build). This is an iso-disk, cross-architecture read — a code-pruned
20.8B MoE vs a 35B/3B-active generalist at ~the same GB — not iso-bpw or iso-params. v6-coder Q4_K_M is
the single plain, no-imatrix tier, i.e. its weakest-calibrated 4-bit, not a tuned imatrix build.
| Bench | v6-coder Q4_K_M | Qwen3.6-35B-A3B IQ3_XXS | Δ (v6 − Qwen) |
|---|---|---|---|
| HumanEval | 96.95 | 95.12 | +1.83 |
| HumanEval+ | 93.29 | 90.24 | +3.05 |
| MultiPL-E-100 (rs/java/js) | 88.33 | 70.33 | +18.00 |
| LCB-medium-55 (v4) | 83.64 | 78.18 | +5.46 |
| GSM8K-100 | 92.00 | 98.00 ‡ | −6.00 |
| MATH-500-100 | 89.00 | 98.00 ‡ | −9.00 |
| AIME-30 | 50.00 | 63.33 | −13.33 |
| GPQA-Diamond (198) | 65.66 | 77.27 | −11.61 |
| IFEval-100 | 92.00 | 93.00 | −1.00 |
| ARC-Challenge | 96.08 | 96.84 | −0.76 |
‡ Qwen's gsm8k / math500 are the Qwen-tuned-template scores. On the vanilla canonical templates Qwen
posts 0% / 4% because Qwen3.x verbalizes the few-shot delimiter ("Question:" / "Problem:") inside its CoT
and trips the lm-eval stop → empty content. Re-running with the redundant stop dropped (same few-shot,
greedy, thinking budget; chat turn closes on <|im_end|>) recovers the true 98% / 98%. Gemma-4 CoT
never emits the delimiter, so v6-coder's canonical numbers are unaffected. v6-coder GPQA here is 65.66
(Q4_K_M, plain) vs the 67.17 Q6_K headline elsewhere on this card — different quant tier, as expected.
Read:
- Code (the v6-coder remit) — v6 wins all four. MultiPL-E +18.00pp (88.33 vs 70.33), LCB-medium-55 +5.46, HE+ +3.05, HE +1.83. The smaller 20.8B code-pruned MoE beats the 35B generalist on every execution-scored code bench, at lower disk and on its weakest (no-imatrix) 4-bit tier.
- Math / science — Qwen wins. GPQA-D +11.61, AIME +13.33, gsm8k / math500 +6 / +9 (on the corrected templates). A full-breadth 35B generalist out-reasons the code-specialized prune on math-on-text and graduate science — expected, and the gap is the cost of the specialization.
- General — a wash. IFEval (92 vs 93) and ARC-Challenge (96.08 vs 96.84) are within a point.
Bottom line: at the same ~13–14 GB on disk, choose v6-coder Q4_K_M for code generation (it leads HE / HE+ / MultiPL-E / LCB) and Qwen3.6-35B-A3B for math / science reasoning. The two are complementary at this footprint, not strictly ranked — which is the point of the iso-disk framing.
Same-Stack GGUF HE+ Sweep — v6-coder vs Qwen2.5-Coder-14B-Instruct
Head-to-head HumanEval+ (164q, chat-aware shadow task) on identical hardware (single RTX 3090
24 GB) and identical eval recipe (llama-server -c 32768 -ngl 99 --parallel 2, omk_eval llama
backend, lm-eval humanevalplus_full, greedy T=0). Qwen GGUFs are bartowski's
Qwen2.5-Coder-14B-Instruct-GGUF.
The "Comparison" tables above use paper-reported numbers; this section is what the same rig and
same scorer actually measure. v6-coder per-tier HE+ is the Available Quantizations
table above (complete).
Qwen2.5-Coder-14B-Instruct (14.7B dense) — bartowski quants
| Tier | File size | bpw | HE+ pass@1 |
|---|---|---|---|
| IQ4_XS | 8.12 GB | 4.42 | 84.76% |
| Q4_0 | 8.54 GB | 4.65 | 84.15% |
| Q4_K_M | 8.99 GB | 4.89 | 85.37% |
| Q5_K_M | 10.51 GB | 5.72 | 83.54% |
| Q6_K | 12.12 GB | 6.60 | 84.76% |
| Q8_0 | 15.70 GB | 8.54 | 84.76% |
Qwen sits at 83–85% across the whole tier ladder. The paper-reported 87.2 HE+ is ~2pp above what bartowski's GGUFs deliver on this stack — a known llama-server chat-template vs vLLM-temp=0 quirk, not a quant defect.
Head-to-head by file size (v6-coder runs lower bpw at the same disk)
Pairing by tier name is misleading — v6-coder is a 20.8B-total MoE and Qwen is a 14.7B dense, so the same tier name maps to different file sizes. The fair comparison is iso-disk: at a given GB budget, which model wins HE+? At every band v6-coder uses 1.5–3.5 bpw less than Qwen and still scores higher (all tiers landed).
| Disk band | Qwen2.5-Coder-14B (size / bpw / HE+) | v6-coder best (size / bpw / HE+) | Δ HE+ |
|---|---|---|---|
| ~21 GB | (none — Qwen ceiling is Q8_0 15.70 GB) | Q8_0 21.16 / 8.14 / 93.29% | — |
| ~18 GB | (none) | Q6_K 17.81 / 6.84 / 93.29% | new top |
| ~15.7 GB | Q8_0 15.70 / 8.54 / 84.76% | Q5_K_M 15.07 / 5.79 / 92.68% | +7.92 |
| ~13 GB | Q6_K 12.12 / 6.60 / 84.76% | Q4_K_L 13.42 / 5.16 / 92.68% | +7.92 |
| ~12 GB | Q6_K 12.12 / 6.60 / 84.76% | Q4_K_S 12.21 / 4.69 / 92.68% | +7.92 |
| ~11 GB | Q5_K_M 10.51 / 5.72 / 83.54% | IQ4_XS 11.01 / 4.23 / 92.07% ⭐ | +8.53 |
| ~10.5 GB (iso-disk) | Q5_K_M 10.51 / 5.72 / 83.54% | Q3_K_M 10.51 / 4.04 / 92.68% | +9.14 |
| ~10 GB | Q5_K_M 10.51 / 5.72 / 83.54% | IQ3_M 9.82 / 3.78 / 92.07% ⭐ | +8.53 |
| ~9.2 GB | Q4_K_M 8.99 / 4.89 / 85.37% | IQ3_XS 9.22 / 3.54 / 92.07% | +6.70 |
| ~9 GB | Q4_K_M 8.99 / 4.89 / 85.37% | IQ3_XXS 8.95 / 3.44 / 90.85% | +5.48 |
| ~8 GB | IQ4_XS 8.12 / 4.42 / 84.76% | IQ2_M 8.22 / 3.16 / 90.24% | +5.48 |
| ~7.8 GB | IQ4_XS 8.12 / 4.42 / 84.76% | IQ2_S 7.83 / 3.01 / 89.02% | +4.26 |
Reads:
- A 92.68% cluster over a wide 92.07% plateau. The imatrix rebuild lifted the K-quants into a 92.68% top cluster — Q3_K_M (10.51 GB), Q4_K_S (12.21 GB), Q4_K_L (13.42 GB), Q5_K_M (15.07 GB) — sitting just above a broad 92.07% plateau (IQ3_XS 9.22 GB up through IQ4_XS / Q4_K_M / Q5_K_S / IQ3_M / Q3_K_L / Q3_K_XL). The 0.61pp steps are within HE+'s 1-question granularity, so pick by disk: Q3_K_M (10.51 GB) is the smallest tier in the top cluster, IQ3_XS (9.22 GB) the smallest on the plateau, IQ4_XS (11.01 GB) the safe 4-bit default.
- vs Qwen, every band wins by +4 to +9pp at 1.5–3.5 bpw less. The biggest gap is the iso-disk
~10.5 GB point — v6
Q3_K_M(10.51 GB / 4.04 bpw / 92.68%) vs QwenQ5_K_M(10.51 GB / 5.72 bpw / 83.54%): +9.14pp at the exact same file size, −1.68 bpw. - Sub-8 GB code-grade.
IQ2_S(7.83 GB / 3.01 bpw / 89.02%) beats Qwen'sIQ4_XS(8.12 GB / 4.42 bpw / 84.76%) by +4.26pp at smaller disk and lower bpw.IQ2_XS(7.77 GB) is the 2-bit cliff at 70.73% — use IQ2_S. - Sweep complete; imatrix lifted the low-bit K-quants most. Q8_0 / Q6_K top the cohort at 93.29%; the imatrix rebuild moved the 3-bit K-quants up sharply (Q3_K_M 85.98 → 92.68, Q3_K_S 82.93 → 87.80, Q3_K_L/XL 88.41 → 92.07), so a 3-bit tier (Q3_K_M, 10.51 GB) now leads the value frontier. The 4 CD-* contrib-dynamic tiers fill the 10–15.5 GB bands (CD-Q6_K 92.07, CD-Q4_K_M / CD-IQ4_K_M 91.46, CD-Q5_K_M 90.85). Below 9 GB the plain IQ ladder leads (IQ3_XS 9.22 GB / 92.07%, IQ2_S 7.83 GB / 89.02%).
CD recipes are open-source — generator at omnimergekit/scripts/generate_cd_maps.py.
Original Model Card
Gemma 4 A4B 98-Expert v6-coder (C6v3lcb) — LCB-targeted code prune (~20.8B)
Eval complete (Q6_K / llama.cpp). The full canonical 9-bench suite plus the extended LCB-medium-100 and MultiPL-E-100 code benches are filled below, every cell read from
summary.json(greedy, cohort-pinned recipe). The 128e and 98e-v5-coder anchor columns are the matching Q6_K reference runs. NVFP4A16 is a deployment format and is not separately benchmarked (cohort policy).Headline — the LCB targeting worked. LCB-medium-55 96.36% (+10.91pp vs v5-coder, and +1.81pp past the unpruned 128e), closing the −9.10pp hole that motivated the recipe; MultiPL-E macro 88.0% (+7 vs v5-coder, ≈128e); AIME recovers +10pp (53.33 → 63.33). The budget is paid on the non-code axes the LCB-only class weights deprioritized (MATH −3, IFEval −2 vs v5-coder).
A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts, top-8 + shared, 30 layers) down to 98 experts per layer using a drop map that is the most code-faithful member of the v6-coder family: v5-coder's gentle C6 layer-relevance-weighted v4-floor recipe, re-derived on the corrected v3 code-pass calibration data, then steered specifically at LiveCodeBench-medium — the one code bench where expert pruning hurt most (−9.10pp vs 128e on v5-coder). Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.
This is the head of the C6 → C6v3 → C6v3lcb line: each step holds the gentle C6 recipe fixed and changes exactly one variable (data, then target weighting), so the deltas are attributable.
Quantized formats
| Format | Repo | Notes |
|---|---|---|
| bf16 (this repo) | ManniX-ITA/gemma-4-A4B-98e-v6-coder-it |
9 shards, ~40.9 GB. The C6v3lcb drop map + shared α=1.2. |
| GGUF (llama.cpp / ollama) | ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF |
Bartowski tier sweep + ContribDynamic CD-* per-layer quants + F16 baseline. Tier sweep complete — imatrix K-quants (Q4_K_M plain), all HE+-scored. |
| NVFP4A16 (vLLM) | ManniX-ITA/gemma-4-A4B-98e-v6-coder-NVFP4A16 (planned) |
~13 GB, native vLLM, via modelopt==0.43.0. Deployment format — not separately benchmarked. |
| Ollama | mannix/gemma4-98e-v6-coder |
GGUF tier sweep, ollama pull mannix/gemma4-98e-v6-coder:<tier> (:latest = Q4_K_M). All tiers pushed. |
Eval is llama.cpp / Q6_K only for this cohort. NVFP4A16 is published as a vLLM-deployable format but is not benchmarked separately — the Q6_K scoreboard below is representative of the model's quality.
At a glance
| 128e (base) | 98e v5-coder | 98e v6-coder (C6v3lcb) | |
|---|---|---|---|
| Total params | ~26B | ~20.8B | ~20.8B |
| Active params / token | ~4B (top-8 + shared) | ~4B | ~4B |
| Experts per layer | 128 | 98 (30 dropped) | 98 (30 dropped) |
| Layers | 30 | 30 | 30 |
| Drop map | — | C6 layer-relevance v4-floor, breadth=50, _fixed data |
C6 v4-floor, breadth=50, v3 data + outlier-fix, LCB-targeted |
| Calibration classes weighted | — | code + HE + HE+ + LCB | code (3×) + LCB-medium (2×), HE/HE+ targeting OFF |
| Shared FFN α | 1.0 | 1.0 (none) | 1.2 (mlp.down_proj) |
| Built from | — | 98e v4 (re-mapped) | 128e original (fresh prune) |
Recipe
The drop map is produced by generate_drop_map_multiclass.py (omnimergekit) from
per-expert, per-class contribution scores, then applied with expert_drop.py,
then the shared expert is upweighted. Four design choices define C6v3lcb; each is
held constant from v5-coder except where noted.
1. C6 gentle base recipe (unchanged from v5-coder)
The aggregation that ranks experts for dropping:
strategy = max # per-expert score = MAX over classes (not mean/geomean)
normalize = rank # rank-normalize within each (layer, class) before aggregating
protect_top = 16 # the 16 highest-scoring experts/layer are never dropped
alpha = 2.0 # contribution sharpening exponent
breadth_bonus = 0.5 # reward experts active across many classes (anti-overfit)
v4_floor_map = v4_layer_floor_map.json # per-layer floor: never drop below v4's keep on protected layers
baseline = teacher_force_98e_p16_clean.json # tie-break anchor
strategy=max + breadth_bonus is the load-bearing pair: it favors experts that
are strongly useful to at least one class and broadly useful across classes,
rather than experts with a high average that no single task depends on. This is
the optimizer-off-manifold
lesson encoded as a recipe — max/percentile beats mean/geomean for importance.
2. v3 calibration data + outlier fix (changed from v5-coder)
v5-coder ranked experts using the _fixed code-pass traces, which turned out to be
~86% NaN in the deep layers — accidentally NaN-blind where the model has real
signal. v6-coder uses the corrected expert_neuron_v5_code_v3.json (real deep-layer
signal restored, T73.0 fp32-hot-path patch), and scrubs the residual bf16 weight-norm
artifacts with a median-based outlier clamp:
data = expert_neuron_v5_code_v3.json
outlier_mode = median
outlier_wnorm_thresh = 1e4 # clamp expert weight-norms above 1e4 to the layer median
Calibration corpus (v5_code_pass_traces.json, 360 traces): Tier-A 128-token
comprehension prompts (342 traces, 1×) + Tier-B 2048-token windowed pass-traces
(18 traces, 3×) — the long-window tier captures sustained code reasoning, which is
where pruned variants ruminate.
3. LCB-only targeting — the "lcb" in C6v3lcb (changed)
Eight contribution classes are scored; the class weights steer which specialists are protected. v6-coder zeroes the HumanEval targeting and concentrates the targeted budget on LiveCodeBench-medium:
| Class | C6v3 weight | C6v3lcb weight |
|---|---|---|
| generic_math | 1 | 1 |
| generic_logic | 1 | 1 |
| generic_code | 3 | 3 |
| generic_science | 1 | 1 |
| generic_creative | 1 | 1 |
| targeted_humaneval | 2 | 0 |
| targeted_humanevalplus | 2 | 0 |
| targeted_lcb_medium_55 | 2 | 2 |
Rationale: HE/HE+ were already at/above 128e on v5-coder (+1.22 / +1.22), so spending protection budget on them is wasted; LCB-medium was the −9.10pp hole. Removing the HE targeting frees the floor to protect LCB-relevant experts harder. The resulting map is 95% identical to C6v3 (generic_code 3× dominates), so HE+/IFEval are expected at ≈C6v3 — the bench that tests the hypothesis is LCB-medium.
4. Mandatory shared-FFN α=1.2 (cohort rule)
After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight
scales the always-on shared expert's down-projection by 1.2×. Without it the pruned
model is the "weak/ruminating" pre-T18 baseline; every coder variant carries it or
the v5-coder comparison is unfair. Verified by the .shared_applied marker in this repo.
Expert mapping
- Uniform 30 experts dropped per layer, all 30 layers → 98 kept/layer (128 → 98).
- Mean overlap with the teacher-force baseline: 17.8 / 30 dropped per layer (min 11, max 24). The recipe re-ranks ~12 of 30 dropped slots/layer away from the naive teacher-force drop — that re-ranking is where the code-faithfulness lives.
- Aggregated keep scores (rank-normalized, α=2.0, breadth-bonus) span ~0.0–3.5
per layer with mean ~1.85; the protected top-16 sit at the high end, the dropped
30 at the low end, with
boundary_ties_within_1pct2–39/layer (highest in the shallow layers 0/10 — those layers have the flattest expert-importance profile).
Full per-layer keep lists are in expert_drop_metadata.json (per_layer_keep) in
this repo; the ranking provenance (per-layer agg min/max/mean, ties, overlap-vs-baseline,
v5-only non-overlap experts) is in scripts/v6coder_C6v3lcb_drop_map.json.summary.json.
Problems solved
- Rumination on hard code/reasoning problems. Pruned Gemma 4 variants fall into
<think>loops on the hardest LCB / GPQA / AIME problems, blowing the token budget without converging. C6v3lcb attacks this two ways: (a) the LCB-targeted drop map keeps the experts that the hard-problem pass-traces actually use; (b) at eval time thelcb_medium_*_v4template applies athinking_token_budget=12288forcing function (parser=gemma4 + enable_thinking) that caps the rumination loop. T109 rumination-signal gate (GPQA-48 / AIME probes) confirmed the variant generates and converges rather than looping. - Data-vs-recipe disentanglement (T102). v6-coder isolates whether v3's earlier regressions were the data or the recipe: holding v5-coder's gentle C6 recipe fixed and only swapping in v3 data + the outlier fix. C6v3 ≈ v5-coder in smoke → the v3 data is fine and C12's aggressive recipe was the rumination cause; C6v3lcb then re-steers that clean baseline at LCB.
- Deep-layer calibration corruption (T73.0). The
_fixedtraces were ~86% NaN in deep layers; v6-coder uses real-signal v3 data with a median outlier clamp so deep-layer experts are ranked on genuine contribution, not NaN-blindness. - Unfair-comparison trap. The shared α=1.2 step is mandatory and marker-verified, so v6-coder vs v5-coder is apples-to-apples.
Scoreboard — Q6_K GGUF, llama.cpp, greedy
Full 9-bench llama.cpp Q6_K run, llama-server --reasoning-format deepseek --reasoning-budget 12288 --parallel 2, greedy (T=0, top_p=1, top_k=0, do_sample=false).
The 128e and v5-coder columns are the bartowski-Q6_K / v5-coder-Q6_K reference runs under
the identical recipe (apples-to-apples within the llama.cpp/Q6_K backend).
| Bench (n) | 128e Q6_K | v5-coder Q6_K | v6-coder Q6_K | Δ (v6 − v5c) | Δ (v6 − 128e) |
|---|---|---|---|---|---|
| ARC-Challenge-chat (1172) | 97.10% | 95.73% | 95.82% | +0.09 | −1.28 |
| GPQA Diamond flex (198) | 72.73% | 65.15% | 67.17% | +2.02 | −5.56 |
| GSM8K-100 flex | 92.00% | 87.00% | 91.00% | +4.00 | −1.00 |
| MATH-500-100 math_verify | 94.00% | 94.00% | 91.00% | −3.00 | −3.00 |
| AIME 2024 (30) | 83.33% | 53.33% | 63.33% | +10.00 | −20.00 |
| IFEval-100 (prompt_strict) | 97.00% | 94.00% | 92.00% | −2.00 | −5.00 |
| HumanEval-164 chat | 96.34% | 99.39% | 98.78% | −0.61 | +2.44 |
| HumanEval+-164 chat | 90.85% | 93.29% | 93.29% | 0.00 | +2.44 |
| LCB-medium-55 v4 | 94.55% | 85.45% | 96.36% | +10.91 | +1.81 |
Read: v6-coder lands on/above v5-coder on 8 of 9 benches and beats the unpruned 128e on every code bench (HE +2.44, HE+ +2.44, LCB-55 +1.81). The headline is LCB-medium-55 +10.91pp vs v5-coder — the targeted hole is not just closed but pushed past the base model. AIME recovers +10pp (53.33 → 63.33). The cost lands on MATH (−3) and IFEval (−2) vs v5-coder — the non-code generalist axes the LCB-only class weights (
1 1 3 1 1 0 0 2) deliberately deprioritized.
Extended code benches — LCB-medium-100 + MultiPL-E-100 (Q6_K, llama.cpp, greedy)
Run on solidpc (T112), same Q6_K / greedy recipe; MultiPL-E scored via the
nuprl/multipl-e-evaluation Docker image. 128e and v5-coder columns from the v5-coder card.
LCB-medium-100 (100-problem superset of LCB-medium-55 v4):
| Bench (n) | 128e Q6_K | v5-coder Q6_K | v6-coder Q6_K | Δ (v6 − v5c) |
|---|---|---|---|---|
| LCB-medium-100 | 95.00% | 91.00% | 96.00% | +5.00 |
(v6-coder also clears 128e on LCB-100 by +1.00pp — the LCB-targeting win holds on the 100-problem superset, not just the 55-problem v4 set.)
MultiPL-E-100 (HumanEval → Rust / Java / JS, 100/lang, chat-mode + code extraction):
| Language (n=100) | 128e Q6_K | v5-coder Q6_K | v6-coder Q6_K | Δ (v6 − v5c) |
|---|---|---|---|---|
| Rust | 83.00% | 76.00% | 82.00% | +6.00 |
| Java | 91.00% | 81.00% | 89.00% | +8.00 |
| JavaScript | 95.00% | 86.00% | 93.00% | +7.00 |
| Macro mean | 89.67% | 81.00% | 88.00% | +7.00 |
v6-coder near-fully recovers MultiPL-E to the 128e level (macro −1.67pp) from v5-coder's −8.67pp gap — code generalization in non-Python languages tracks the LCB-targeting win, not just the in-distribution LCB benches. (264/300 passed; micro = macro = 88.0%.)
Coder-field comparison — v6-coder vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)
The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy
recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:
- v6-coder — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
- Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
- Qwen3.5-9B — dense reasoning model (bartowski Q6_K).
| Bench (n) | v6-coder Q6_K | Qwen2.5-Coder-14B | Qwen2.5-Coder-7B | Qwen3.5-9B |
|---|---|---|---|---|
| ARC-Challenge-chat (1172) | 95.82% | 90.53% | 85.58% | 96.76% |
| GPQA Diamond flex (198) | 67.17% | 34.85% | 26.26% | 73.74% |
| GSM8K-100 flex | 91.00% | 89.00% | 80.00% | 79.00% |
| MATH-500-100 math_verify | 91.00% | 62.00% | 66.00% | 59.00% |
| AIME 2024 (30) | 63.33% | 10.00% | 10.00% | 56.67% |
| IFEval-100 (prompt_strict) | 92.00% | 68.00% | 54.00% | 93.00% |
| HumanEval-164 chat | 98.78% | 90.85% | 87.20% | 89.02% |
| HumanEval+-164 chat | 93.29% | 84.76% † | 83.54% | 80.49% |
| LCB-medium-55 v4 | 96.36% | 18.18% † | 12.73% | 58.18% |
| MultiPL-E-100 (macro) | 88.00% | 84.67% | 80.67% | 80.33% |
† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 reused from the same-stack GGUF HE+ sweep
(not re-run in this chain). All comparison cells filled 2026-05-28 (9B Fix-A re-eval + 14B Q6_K reference). running legend: eval in flight — table updated as
results land.
Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long
<think>reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells are being re-run after a harness fix: under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block (mis-scored as empty content). ARC (96.76%) and AIME (56.67%) were unaffected.
Answer-length analysis (anti-rumination)
v6-coder thinks longer on average than 128e on the code benches — it spends the full
bounded thinking_token_budget=12288 where 128e often answers in a few hundred tokens. The
question this section answers is whether that extra length is rumination (long thinking
that fails to reach an answer — the failure mode the LCB targeting was built to fix) or
productive reasoning that lands on a PASS. The tables below compare per-problem completion
length against 128e and v5-coder on the same problems, from omk_eval token_stats (characters
from the raw completion text; tokens via the 128e tokenizer), on the real-n benches. 128e and
v5-coder are the matching Q6_K runs. The short answer: by the rumination-as-wasted-thinking
definition, v6-coder ruminates less than 128e, not more — 128e's longest LCB outputs (up
to 51k chars) are failures, while v6-coder's long outputs overwhelmingly pass.
Per-problem completion length — characters (p50 / p90 / max):
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| GPQA Diamond (198) | 2558 / 17661 / 26572 | 2602 / 16768 / 26518 | 2584 / 17162 / 25218 |
| AIME 2024 (30) | 2091 / 6192 / 7198 | 2405 / 9239 / 10974 | 2190 / 8404 / 9689 |
| LCB-medium-55 | 4369 / 14894 / 51568 | 31222 / 38456 / 44868 | 30685 / 36487 / 41301 |
| LCB-medium-100 | 1899 / 14894 / 51568 | 30035 / 36846 / 44868 | 29514 / 36221 / 43845 |
| MultiPL-E-100 (300) | 244 / 592 / 2944 | 257 / 764 / 3083 | 246 / 594 / 2169 |
| MATH-500 (100) | 1054 / 1970 / 7520 | 1145 / 1925 / 7189 | 1087 / 1961 / 8420 |
| GSM8K (100) | 254 / 699 / 3438 | 264 / 698 / 2386 | 272 / 682 / 1332 |
| IFEval (100) | 781 / 3595 / 7425 | 862 / 4150 / 17539 | 850 / 3803 / 7200 |
| HumanEval (164) | 704 / 1303 / 8578 | 696 / 1408 / 3923 | 721 / 1296 / 4033 |
| HumanEval+ (164) | 684 / 1238 / 5704 | 709 / 1419 / 3578 | 715 / 1451 / 4498 |
| ARC-Challenge (1172) | 1203 / 1662 / 6407 | 1201 / 1655 / 43570 | 1217 / 1663 / 43927 |
Per-problem completion length — tokens (p50 / p90 / max):
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| GPQA Diamond (198) | 843 / 8189 / 8189 | 855 / 8189 / 8189 | 876 / 8189 / 8189 |
| AIME 2024 (30) | 960 / 3993 / 4021 | 1156 / 4012 / 4022 | 939 / 4011 / 4021 |
| LCB-medium-55 | 1213 / 4644 / 15585 | 12804 / 13295 / 15724 | 12799 / 13184 / 15886 |
| LCB-medium-100 | 565 / 4644 / 15947 | 12726 / 13156 / 15945 | 12735 / 13103 / 15724 |
| MultiPL-E-100 (300) | 85 / 184 / 1012 | 89 / 247 / 1017 | 86 / 184 / 1013 |
| MATH-500 (100) | 423 / 925 / 3035 | 445 / 802 / 3319 | 404 / 872 / 3318 |
| GSM8K (100) | 123 / 253 / 1107 | 114 / 221 / 858 | 120 / 254 / 396 |
| IFEval (100) | 202 / 826 / 1427 | 217 / 912 / 4058 | 234 / 850 / 4060 |
| HumanEval (164) | 226 / 431 / 2890 | 226 / 413 / 1363 | 226 / 443 / 1305 |
| HumanEval+ (164) | 223 / 430 / 1744 | 229 / 420 / 1209 | 226 / 495 / 1251 |
| ARC-Challenge (1172) | 255 / 359 / 1453 | 258 / 360 / 16266 | 262 / 361 / 16266 |
Budget-saturation incidence — share of problems whose completion reached ≥12k tokens
(at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination:
a saturated output that reaches a correct PASS is productive use of the budget. The pruned
variants saturate on nearly everything; 128e almost never does — but see the next table.
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| LCB-medium-55 | 1 / 55 (1.8%) | 54 / 55 (98.2%) | 53 / 55 (96.4%) |
| LCB-medium-100 | 2 / 100 (2.0%) | 99 / 100 (99.0%) | 97 / 100 (97.0%) |
Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — i.e. tokens burned without a correct answer:
| Bench (n) | 128e | v5-coder | v6-coder |
|---|---|---|---|
| LCB-medium-55 — saturated-and-failed | 1 / 1 (100%) | 8 / 54 (15%) | 2 / 53 (3.8%) |
| LCB-medium-100 — saturated-and-failed | 2 / 2 (100%) | 9 / 99 (9%) | 4 / 97 (4.1%) |
| LCB-100 — mean completion tokens, PASS vs FAIL | 1468 vs 8368 | 12687 vs 14241 | 12616 vs 14302 |
Every 128e output that reaches the budget cap is a failure (100%), and 128e's failed problems run 5.7× longer than its passed ones (mean 8368 vs 1468 tok): 128e only thinks long when it is lost. v6-coder thinks long on ~97% of problems but 96% of those long outputs PASS — its long thinking is overwhelmingly productive, not rumination.
Key findings:
- LCB — v6-coder ruminates less than 128e, under the only definition that matters (long
thinking that fails to reach a PASS). The naive read of the median says the opposite — 128e's
median LCB completion is ~0.6–1.2k tokens vs v6-coder's budget-capped ~12.7k — but that short
median is an artifact of mixing easy and hard problems. 128e only thinks long when it is lost,
and then it fails: every 128e output that saturates the budget is a failure (2/2 on LCB-100,
1/1 on LCB-55), and the single longest output across all three models — 51,568 chars / 15,585
tokens on
lcb/leetcode/3659— is a 128e failure, one that both v6-coder (PASS, 12,691 tok / 32,676 chars) and v5-coder solve in two-thirds the length. v6-coder, by contrast, saturates the budget on ~97% of problems but 96% of those long outputs PASS; its longest LCB-55 output is a correct answer and its max length (41–43k chars) sits below 128e's 51k. Wasted long-thinking rate: 128e 100%, v6-coder ~4%. And v6-coder beats 128e on score on both axes (96.36 / 96.00 vs 94.55 / 95.00) — more correct and a lower worst-case tail. - 128e's routing is erratic in both directions, and the targeted map repairs both. 128e
fails LCB two ways: by over-thinking (
3659: 51k chars, no answer) and by under-thinking (lcb/leetcode/3000: a premature wrong answer in 579 tokens;3793: 3101 tok) — on the very same problems v6-coder spends the full ~13k-token budget and passes. This bidirectional instability is the fingerprint of an imperfect base router; the 51k outliers and the lower 128e score are the symptom. v6-coder's LCB-targeted expert map corrects 3 of the 5 problems 128e fails on LCB-100 (2 of 3 on LCB-55). Thethinking_token_budget=12288forcing-function only bounds the loop; the targeted map is what makes the bounded thinking land on a PASS instead of a 51k-char dead end. (Versus v5-coder, v6-coder is statistically the same length — median 12799 vs 12804 tok on LCB-55 — yet passes far more: 53/55 vs 47/55, 96/100 vs 91/100.) - AIME — genuinely less rumination, +10pp. Median completion 939 tok vs v5-coder's 1156, p90 char tail 8404 < 9239, scoring 63.33% vs 53.33% — shorter chains and more correct.
- GSM8K — tightest of the three (max 396 tok; 128e 1107, v5-coder 858); no rumination tail.
- IFEval — no ruminating outlier (max 7200 chars vs v5-coder's 17539; ≈128e 7425).
- GPQA — a base-model ceiling, not a prune artifact. p90 completion tokens peg at ~8189 on all three models; the hard-question rumination is inherited from Gemma 4 and is identical across the cohort, not widened by pruning.
- ARC — one shared single-sample outlier (~16266 tok / ~44k chars) on both pruned variants, absent on 128e; p50/p90 normal and identical, so not systemic length growth.
- MultiPL-E — most compact code of the three. On the bench where the answer is the code (extracted final block, no reasoning trace), v6-coder has the tightest tail — max 2169 chars vs 128e 2944 / v5-coder 3083, p90 591 chars / 184 tok — at ≈128e pass-rate (macro 88.0%).
- Net: v6-coder is more accurate than 128e on every LCB axis (96.36 / 96.00 / MultiPL-E macro 88.0) at a higher average inference cost (it thinks the full bounded budget) but a lower worst case (max 41–43k chars vs 128e's 51k) and a far lower wasted-thinking rate (~4% vs 100%). On the short-reasoning benches (AIME / GSM8K / IFEval) it additionally reduces length vs v5-coder. The targeted map never traded length for accuracy — it traded 128e's erratic, sometimes-catastrophic tail for a higher but bounded and productive average.
Methodology / data integrity. Per-problem lengths come from omk_eval
token_statsover each bench's per-problem samples archive — lm-evalsamples_*.jsonlfor the 9-bench, and the native runners'lcb_result.samples.jsonl/mpe_result.samples.jsonl(mirrored into the durable sqlite resume cache) for LCB/MPE. MultiPL-E is tabulated, but its rows measure extracted-code length, not reasoning: its samples store only the final code block (no<think>trace,finish_reasonsall terminal,thinking_tokens_est=0), so MPE is a code-conciseness reference — not a rumination metric — on which v6-coder is the most compact. The reasoning-bearing benches (LCB / GPQA / AIME / MATH) carry the rumination signal; MPE tokens there are tokenizer-estimated (the code-extraction path returns no server token count).thinking_tokens_estis 0 across all benches under llama.cpp--reasoning-format deepseek(reasoning is emitted inline, not as a separable block), so the completion-token columns are the length signal. The LCB/MPEtoken_statsare computed over per-problem rows keyed ondoc_id(the native runners now emit it, fixed 2026-05-25) — earlier card revisions briefly collapsed these to n=1; all numbers here are at full n (LCB-55 n=55, LCB-100 n=100, MultiPL-E n=300).
Reproduce
# 1. drop map (omnimergekit)
python scripts/generate_drop_map_multiclass.py \
--data scripts/expert_neuron_v5_code_v3.json --target 98 \
--protect-top 16 --alpha 2.0 --strategy max --normalize rank \
--breadth-bonus 0.5 --v4-floor-map scripts/v4_layer_floor_map.json \
--baseline-drop-map scripts/teacher_force_98e_p16_clean.json \
--outlier-mode median --outlier-wnorm-thresh 1e4 \
--class-weights 1 1 3 1 1 0 0 2 \
--output scripts/v6coder_C6v3lcb_drop_map.json
# 2. prune + mandatory shared upweight (see scripts/build_v6coder_C6v3lcb.sh)
python scripts/expert_drop.py --source-dir google/gemma-4-26B-A4B-it \
--drop-map scripts/v6coder_C6v3lcb_drop_map.json --suffix=-v6-coder-C6v3lcb-it
python scripts/router_shared_upweight.py \
--model-dir google/gemma-4-A4B-98e-v6-coder-C6v3lcb-it \
--alpha 1.2 --target mlp.down_proj.weight
Lineage & provenance
- Base:
google/gemma-4-26B-A4B-it(128e, fresh prune — not derived from v4/v5 weights). - Cohort: v5-coder (C6
_fixed) → C6v3 (C6 v3-data) → C6v3lcb (this, LCB-targeted). - Tooling: omnimergekit —
expert_drop.py,generate_drop_map_multiclass.py,router_shared_upweight.py; eval viaomk_eval.py+ EVAL_PROTOCOL. - Eval sampler is frozen GREEDY across the whole cohort for apples-to-apples comparison.
License: Gemma Terms of Use. This is a derivative of Gemma 4 and inherits those terms.
- Downloads last month
- 16,990
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF
Base model
google/gemma-4-26B-A4B