gemma-4-A4B-98e-v6-coder-it-GGUF

GGUF quantizations of ManniX-ITA/gemma-4-A4B-98e-v6-coder-it — the LCB-targeted code prune of Gemma 4 26B-A4B (98 experts/layer, ~20.8B). All quants use imatrix with calibration data v5. The recipe + full Q6_K 9-bench scoreboard are in the embedded model card below.

Available Quantizations

Per-tier HumanEval+ (164q, chat-mode, pass@1, greedy) from the llama.cpp sweep (omk_eval humanevalplus_full, canonical summary.json .score), 2-GPU shard on pod 37588132. Q6_K is the 9-bench reference run; F16 is the unquantized baseline. CD-* tiers are ContribDynamic per-layer quants (expert-contribution-aware mixed precision). Sizes are decimal GB (÷10⁹), bpw computed against the cohort 20.8B param count to match the v5-coder card.

imatrix policy. The plain K-quant ladder was rebuilt with imatrix (calibration data v5) on 2026-05-25, which lifted the low-bit K-quants the most (e.g. Q3_K_M 85.98 → 92.68, Q3_K_S 82.93 → 87.80, Q3_K_L/XL 88.41 → 92.07). Q4_K_M is the one exception — built without imatrix, because imatrix measurably lowered its HE+ (90.85 imatrix vs 92.07 plain). IQ-tiers always carry imatrix; Q8_0/Q4_0/Q4_1 never do.

Quant Size (GB) bpw HumanEval+ pass@1 (164)
F16 39.79 16.00 — (baseline)
Q8_0 21.16 8.14 93.29%
Q6_K_L 17.98 6.91 92.68%
Q6_K 17.81 6.84 93.29%
CD-Q6_K 15.54 5.97 92.07%
Q5_K_L 15.25 5.86 91.46%
Q5_K_M 15.07 5.79 92.68%
Q5_K_S 14.19 5.45 92.07%
Q4_K_L 13.42 5.16 92.68%
Q4_K_M 13.24 5.09 92.07% (plain — no imatrix)
CD-Q5_K_M 13.07 5.03 90.85%
Q4_1 12.61 4.85 91.46%
Q4_K_S 12.21 4.69 92.68%
IQ4_NL 11.42 4.39 91.46%
Q4_0 11.42 4.39 90.85%
IQ4_XS 11.01 4.23 92.07%
Q3_K_L 10.94 4.21 92.07%
Q3_K_XL 10.69 4.11 92.07%
CD-Q4_K_M 10.65 4.10 91.46%
Q3_K_M 10.51 4.04 92.68%
CD-IQ4_K_M 10.29 3.96 91.46%
IQ3_M 9.82 3.78 92.07%
Q3_K_S 9.68 3.72 87.80%
IQ3_XS 9.22 3.54 92.07%
IQ3_XXS 8.95 3.44 90.85%
IQ2_M 8.22 3.16 90.24%
IQ2_S 7.83 3.01 89.02%
IQ2_XS 7.77 2.99 70.73%

Recommended: IQ4_XS (11.01 GB / 92.07%) is the safe 4-bit default. The imatrix rebuild opened a 92.68% cluster just below it — Q3_K_M (10.51 GB), Q4_K_S (12.21 GB), Q4_K_L (13.42 GB), Q5_K_M (15.07 GB) — so Q3_K_M is now the smallest tier at the top of the HE+ band (the ≤0.61pp steps here are within HE+'s 1-question granularity, so treat the 92.07/92.68 cluster as a tie and pick by disk). v6-coder holds ≥92.07% down to IQ3_XS (9.22 GB / 3.54 bpw). Max fidelity: Q6_K / Q8_0 (93.29%). Avoid IQ2_XS (70.73% — 2-bit cliff; use IQ2_S 89.02% at 7.83 GB instead).

How to Use

With llama.cpp (the canonical eval recipe):

llama-server -m gemma-4-A4B-98e-v6-coder-it-IQ4_XS.gguf \
    -c 32768 -ngl 99 --reasoning-format deepseek --reasoning-budget 12288

With ollama:

ollama pull mannix/gemma4-98e-v6-coder:Q4_K_M   # any tier tag; :latest = Q4_K_M

Comparison vs the 14–22B coder field

For sense-of-scale on the Q6_K headline (HE 98.78 / HE+ 93.29 / LCB-medium-55 96.36). Peer numbers are official model-card / paper / blog (linked); cross-lab numbers are noisy by ±2–5pp (different prompts, sampling, chat-vs-base framings). v6-coder is run greedy (T=0).

Coder-specialized 14–22B:

Model Params HE HE+ LCB (version) Source
98e v6-coder (this) 20.8B / 4B MoE 98.78 93.29 96.36 (LCB-medium-55 v4) this card
98e v5-coder 20.8B / 4B MoE 99.39 93.29 85.45 (LCB-medium-55 v4) v5-coder card
Qwen2.5-Coder-14B-Instruct 14.7B dense 89.6 87.2 23.4 (LCB 07/24–11/24, pre-v4) arXiv:2409.12186
DeepSeek-Coder-V2-Lite-Instruct 16B / 2.4B MoE 81.1 24.3 (LCB 12/01–06/01) arXiv:2406.11931
Codestral-22B v1 22B dense 81.1 (not published) Mistral blog
IBM Granite-20B-Code-Instruct 20B dense 60.4 arXiv:2405.04324

Generalist 14–22B (notable code/reasoning scores):

Model Params HE MATH GPQA-D IFEval Source
98e v6-coder (this) 20.8B / 4B MoE 98.78 91.00 (MATH-500) 67.17 92.00 this card
Phi-4 14B dense 82.6 80.4 (MATH) 56.1 63.0 arXiv:2412.08905
Qwen2.5-14B-Instruct 14.7B dense 81.7–86.2 73.0 (MATH) 40.9 80.0 Qwen blog
Mistral-Small-3 (24B, just above band) 24B dense ~84 70.6 (MATH) 45.3 82.1 Mistral blog

Where v6-coder sits:

  • HE / HE+: top of the band — 98.78 / 93.29, ~+9 / +6pp above Qwen2.5-Coder-14B's 89.6 / 87.2 (the published field leader). Essentially tied with v5-coder (−0.61 HE, identical HE+) while gaining +10.91pp on LCB.
  • LCB: v6-coder's 96.36 is LCB-medium-55 on v4 problems. The published Qwen2.5-Coder / DS-Coder-V2-Lite numbers above (23.4 / 24.3) are full LCB on pre-v4 windows — a different, non-comparable subset. The apples-to-apples number now exists: run on the same LCB-medium-55 v4 split and the same llama.cpp stack, Qwen2.5-Coder-14B scores 18.18% vs v6-coder's 96.36%. Read it as a same-rig measurement, not a definitive ranking — the v4 split is only 55 problems, v6-coder was specifically LCB-v4-targeted (that is the entire recipe), and Qwen takes the same llama-server chat-template penalty seen in the same-stack HE+ sweep below. Don't read the raw +78pp as general coding superiority.
  • MATH-500 91.00 / GSM8K 91.00 / AIME 63.33: top of the band for math-on-text reasoning. Phi-4's 80.4 MATH is the closest generalist; v6-coder beats it by ~11pp. AIME 63.33 is far above any published 14–22B coder AIME.
  • GPQA-Diamond 67.17 / IFEval 92.00: GPQA materially above Phi-4 (56.1) and Qwen2.5-14B (40.9). IFEval 92 beats Phi-4 (63.0) and Qwen2.5-14B (80.0).

v6-coder Q4_K_M vs Qwen3.6-35B-A3B-UD-IQ3_XXS — iso-disk 10-bench head-to-head

A whole-suite comparison at matched disk footprint: v6-coder Q4_K_M (13.24 GB / 5.09 bpw; 20.8B-total / 4B-active MoE) vs the Unsloth-Dynamic Qwen3.6-35B-A3B-UD-IQ3_XXS (14.07 GB / 3.44 bpw; 35B-total / 3B-active MoE). All 10 canonical templates, greedy T=0, through the omk_eval llama.cpp path; each model on its own validated build (v6-coder d6be315 with the 128e tokenizer; Qwen the af6528e #23643 MTP build). This is an iso-disk, cross-architecture read — a code-pruned 20.8B MoE vs a 35B/3B-active generalist at ~the same GB — not iso-bpw or iso-params. v6-coder Q4_K_M is the single plain, no-imatrix tier, i.e. its weakest-calibrated 4-bit, not a tuned imatrix build.

Bench v6-coder Q4_K_M Qwen3.6-35B-A3B IQ3_XXS Δ (v6 − Qwen)
HumanEval 96.95 95.12 +1.83
HumanEval+ 93.29 90.24 +3.05
MultiPL-E-100 (rs/java/js) 88.33 70.33 +18.00
LCB-medium-55 (v4) 83.64 78.18 +5.46
GSM8K-100 92.00 98.00 −6.00
MATH-500-100 89.00 98.00 −9.00
AIME-30 50.00 63.33 −13.33
GPQA-Diamond (198) 65.66 77.27 −11.61
IFEval-100 92.00 93.00 −1.00
ARC-Challenge 96.08 96.84 −0.76

‡ Qwen's gsm8k / math500 are the Qwen-tuned-template scores. On the vanilla canonical templates Qwen posts 0% / 4% because Qwen3.x verbalizes the few-shot delimiter ("Question:" / "Problem:") inside its CoT and trips the lm-eval stop → empty content. Re-running with the redundant stop dropped (same few-shot, greedy, thinking budget; chat turn closes on <|im_end|>) recovers the true 98% / 98%. Gemma-4 CoT never emits the delimiter, so v6-coder's canonical numbers are unaffected. v6-coder GPQA here is 65.66 (Q4_K_M, plain) vs the 67.17 Q6_K headline elsewhere on this card — different quant tier, as expected.

Read:

  • Code (the v6-coder remit) — v6 wins all four. MultiPL-E +18.00pp (88.33 vs 70.33), LCB-medium-55 +5.46, HE+ +3.05, HE +1.83. The smaller 20.8B code-pruned MoE beats the 35B generalist on every execution-scored code bench, at lower disk and on its weakest (no-imatrix) 4-bit tier.
  • Math / science — Qwen wins. GPQA-D +11.61, AIME +13.33, gsm8k / math500 +6 / +9 (on the corrected templates). A full-breadth 35B generalist out-reasons the code-specialized prune on math-on-text and graduate science — expected, and the gap is the cost of the specialization.
  • General — a wash. IFEval (92 vs 93) and ARC-Challenge (96.08 vs 96.84) are within a point.

Bottom line: at the same ~13–14 GB on disk, choose v6-coder Q4_K_M for code generation (it leads HE / HE+ / MultiPL-E / LCB) and Qwen3.6-35B-A3B for math / science reasoning. The two are complementary at this footprint, not strictly ranked — which is the point of the iso-disk framing.


Same-Stack GGUF HE+ Sweep — v6-coder vs Qwen2.5-Coder-14B-Instruct

Head-to-head HumanEval+ (164q, chat-aware shadow task) on identical hardware (single RTX 3090 24 GB) and identical eval recipe (llama-server -c 32768 -ngl 99 --parallel 2, omk_eval llama backend, lm-eval humanevalplus_full, greedy T=0). Qwen GGUFs are bartowski's Qwen2.5-Coder-14B-Instruct-GGUF. The "Comparison" tables above use paper-reported numbers; this section is what the same rig and same scorer actually measure. v6-coder per-tier HE+ is the Available Quantizations table above (complete).

Qwen2.5-Coder-14B-Instruct (14.7B dense) — bartowski quants

Tier File size bpw HE+ pass@1
IQ4_XS 8.12 GB 4.42 84.76%
Q4_0 8.54 GB 4.65 84.15%
Q4_K_M 8.99 GB 4.89 85.37%
Q5_K_M 10.51 GB 5.72 83.54%
Q6_K 12.12 GB 6.60 84.76%
Q8_0 15.70 GB 8.54 84.76%

Qwen sits at 83–85% across the whole tier ladder. The paper-reported 87.2 HE+ is ~2pp above what bartowski's GGUFs deliver on this stack — a known llama-server chat-template vs vLLM-temp=0 quirk, not a quant defect.

Head-to-head by file size (v6-coder runs lower bpw at the same disk)

Pairing by tier name is misleading — v6-coder is a 20.8B-total MoE and Qwen is a 14.7B dense, so the same tier name maps to different file sizes. The fair comparison is iso-disk: at a given GB budget, which model wins HE+? At every band v6-coder uses 1.5–3.5 bpw less than Qwen and still scores higher (all tiers landed).

Disk band Qwen2.5-Coder-14B (size / bpw / HE+) v6-coder best (size / bpw / HE+) Δ HE+
~21 GB (none — Qwen ceiling is Q8_0 15.70 GB) Q8_0 21.16 / 8.14 / 93.29%
~18 GB (none) Q6_K 17.81 / 6.84 / 93.29% new top
~15.7 GB Q8_0 15.70 / 8.54 / 84.76% Q5_K_M 15.07 / 5.79 / 92.68% +7.92
~13 GB Q6_K 12.12 / 6.60 / 84.76% Q4_K_L 13.42 / 5.16 / 92.68% +7.92
~12 GB Q6_K 12.12 / 6.60 / 84.76% Q4_K_S 12.21 / 4.69 / 92.68% +7.92
~11 GB Q5_K_M 10.51 / 5.72 / 83.54% IQ4_XS 11.01 / 4.23 / 92.07% +8.53
~10.5 GB (iso-disk) Q5_K_M 10.51 / 5.72 / 83.54% Q3_K_M 10.51 / 4.04 / 92.68% +9.14
~10 GB Q5_K_M 10.51 / 5.72 / 83.54% IQ3_M 9.82 / 3.78 / 92.07% +8.53
~9.2 GB Q4_K_M 8.99 / 4.89 / 85.37% IQ3_XS 9.22 / 3.54 / 92.07% +6.70
~9 GB Q4_K_M 8.99 / 4.89 / 85.37% IQ3_XXS 8.95 / 3.44 / 90.85% +5.48
~8 GB IQ4_XS 8.12 / 4.42 / 84.76% IQ2_M 8.22 / 3.16 / 90.24% +5.48
~7.8 GB IQ4_XS 8.12 / 4.42 / 84.76% IQ2_S 7.83 / 3.01 / 89.02% +4.26

Reads:

  1. A 92.68% cluster over a wide 92.07% plateau. The imatrix rebuild lifted the K-quants into a 92.68% top clusterQ3_K_M (10.51 GB), Q4_K_S (12.21 GB), Q4_K_L (13.42 GB), Q5_K_M (15.07 GB) — sitting just above a broad 92.07% plateau (IQ3_XS 9.22 GB up through IQ4_XS / Q4_K_M / Q5_K_S / IQ3_M / Q3_K_L / Q3_K_XL). The 0.61pp steps are within HE+'s 1-question granularity, so pick by disk: Q3_K_M (10.51 GB) is the smallest tier in the top cluster, IQ3_XS (9.22 GB) the smallest on the plateau, IQ4_XS (11.01 GB) the safe 4-bit default.
  2. vs Qwen, every band wins by +4 to +9pp at 1.5–3.5 bpw less. The biggest gap is the iso-disk ~10.5 GB point — v6 Q3_K_M (10.51 GB / 4.04 bpw / 92.68%) vs Qwen Q5_K_M (10.51 GB / 5.72 bpw / 83.54%): +9.14pp at the exact same file size, −1.68 bpw.
  3. Sub-8 GB code-grade. IQ2_S (7.83 GB / 3.01 bpw / 89.02%) beats Qwen's IQ4_XS (8.12 GB / 4.42 bpw / 84.76%) by +4.26pp at smaller disk and lower bpw. IQ2_XS (7.77 GB) is the 2-bit cliff at 70.73% — use IQ2_S.
  4. Sweep complete; imatrix lifted the low-bit K-quants most. Q8_0 / Q6_K top the cohort at 93.29%; the imatrix rebuild moved the 3-bit K-quants up sharply (Q3_K_M 85.98 → 92.68, Q3_K_S 82.93 → 87.80, Q3_K_L/XL 88.41 → 92.07), so a 3-bit tier (Q3_K_M, 10.51 GB) now leads the value frontier. The 4 CD-* contrib-dynamic tiers fill the 10–15.5 GB bands (CD-Q6_K 92.07, CD-Q4_K_M / CD-IQ4_K_M 91.46, CD-Q5_K_M 90.85). Below 9 GB the plain IQ ladder leads (IQ3_XS 9.22 GB / 92.07%, IQ2_S 7.83 GB / 89.02%).

CD recipes are open-source — generator at omnimergekit/scripts/generate_cd_maps.py.


Original Model Card

Gemma 4 A4B 98-Expert v6-coder (C6v3lcb) — LCB-targeted code prune (~20.8B)

Eval complete (Q6_K / llama.cpp). The full canonical 9-bench suite plus the extended LCB-medium-100 and MultiPL-E-100 code benches are filled below, every cell read from summary.json (greedy, cohort-pinned recipe). The 128e and 98e-v5-coder anchor columns are the matching Q6_K reference runs. NVFP4A16 is a deployment format and is not separately benchmarked (cohort policy).

Headline — the LCB targeting worked. LCB-medium-55 96.36% (+10.91pp vs v5-coder, and +1.81pp past the unpruned 128e), closing the −9.10pp hole that motivated the recipe; MultiPL-E macro 88.0% (+7 vs v5-coder, ≈128e); AIME recovers +10pp (53.33 → 63.33). The budget is paid on the non-code axes the LCB-only class weights deprioritized (MATH −3, IFEval −2 vs v5-coder).

A research checkpoint that prunes the unpruned Gemma 4 26B-A4B-it (128 experts, top-8 + shared, 30 layers) down to 98 experts per layer using a drop map that is the most code-faithful member of the v6-coder family: v5-coder's gentle C6 layer-relevance-weighted v4-floor recipe, re-derived on the corrected v3 code-pass calibration data, then steered specifically at LiveCodeBench-medium — the one code bench where expert pruning hurt most (−9.10pp vs 128e on v5-coder). Same 98e shape, same router, same attention, same norms as the rest of the cohort, plus the mandatory shared-FFN α=1.2 upweight all coder variants carry.

This is the head of the C6 → C6v3 → C6v3lcb line: each step holds the gentle C6 recipe fixed and changes exactly one variable (data, then target weighting), so the deltas are attributable.

Quantized formats

Format Repo Notes
bf16 (this repo) ManniX-ITA/gemma-4-A4B-98e-v6-coder-it 9 shards, ~40.9 GB. The C6v3lcb drop map + shared α=1.2.
GGUF (llama.cpp / ollama) ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF Bartowski tier sweep + ContribDynamic CD-* per-layer quants + F16 baseline. Tier sweep complete — imatrix K-quants (Q4_K_M plain), all HE+-scored.
NVFP4A16 (vLLM) ManniX-ITA/gemma-4-A4B-98e-v6-coder-NVFP4A16 (planned) ~13 GB, native vLLM, via modelopt==0.43.0. Deployment format — not separately benchmarked.
Ollama mannix/gemma4-98e-v6-coder GGUF tier sweep, ollama pull mannix/gemma4-98e-v6-coder:<tier> (:latest = Q4_K_M). All tiers pushed.

Eval is llama.cpp / Q6_K only for this cohort. NVFP4A16 is published as a vLLM-deployable format but is not benchmarked separately — the Q6_K scoreboard below is representative of the model's quality.

At a glance

128e (base) 98e v5-coder 98e v6-coder (C6v3lcb)
Total params ~26B ~20.8B ~20.8B
Active params / token ~4B (top-8 + shared) ~4B ~4B
Experts per layer 128 98 (30 dropped) 98 (30 dropped)
Layers 30 30 30
Drop map C6 layer-relevance v4-floor, breadth=50, _fixed data C6 v4-floor, breadth=50, v3 data + outlier-fix, LCB-targeted
Calibration classes weighted code + HE + HE+ + LCB code (3×) + LCB-medium (2×), HE/HE+ targeting OFF
Shared FFN α 1.0 1.0 (none) 1.2 (mlp.down_proj)
Built from 98e v4 (re-mapped) 128e original (fresh prune)

Recipe

The drop map is produced by generate_drop_map_multiclass.py (omnimergekit) from per-expert, per-class contribution scores, then applied with expert_drop.py, then the shared expert is upweighted. Four design choices define C6v3lcb; each is held constant from v5-coder except where noted.

1. C6 gentle base recipe (unchanged from v5-coder)

The aggregation that ranks experts for dropping:

strategy      = max              # per-expert score = MAX over classes (not mean/geomean)
normalize     = rank             # rank-normalize within each (layer, class) before aggregating
protect_top   = 16               # the 16 highest-scoring experts/layer are never dropped
alpha         = 2.0              # contribution sharpening exponent
breadth_bonus = 0.5              # reward experts active across many classes (anti-overfit)
v4_floor_map  = v4_layer_floor_map.json   # per-layer floor: never drop below v4's keep on protected layers
baseline      = teacher_force_98e_p16_clean.json   # tie-break anchor

strategy=max + breadth_bonus is the load-bearing pair: it favors experts that are strongly useful to at least one class and broadly useful across classes, rather than experts with a high average that no single task depends on. This is the optimizer-off-manifold lesson encoded as a recipe — max/percentile beats mean/geomean for importance.

2. v3 calibration data + outlier fix (changed from v5-coder)

v5-coder ranked experts using the _fixed code-pass traces, which turned out to be ~86% NaN in the deep layers — accidentally NaN-blind where the model has real signal. v6-coder uses the corrected expert_neuron_v5_code_v3.json (real deep-layer signal restored, T73.0 fp32-hot-path patch), and scrubs the residual bf16 weight-norm artifacts with a median-based outlier clamp:

data                = expert_neuron_v5_code_v3.json
outlier_mode        = median
outlier_wnorm_thresh = 1e4       # clamp expert weight-norms above 1e4 to the layer median

Calibration corpus (v5_code_pass_traces.json, 360 traces): Tier-A 128-token comprehension prompts (342 traces, 1×) + Tier-B 2048-token windowed pass-traces (18 traces, 3×) — the long-window tier captures sustained code reasoning, which is where pruned variants ruminate.

3. LCB-only targeting — the "lcb" in C6v3lcb (changed)

Eight contribution classes are scored; the class weights steer which specialists are protected. v6-coder zeroes the HumanEval targeting and concentrates the targeted budget on LiveCodeBench-medium:

Class C6v3 weight C6v3lcb weight
generic_math 1 1
generic_logic 1 1
generic_code 3 3
generic_science 1 1
generic_creative 1 1
targeted_humaneval 2 0
targeted_humanevalplus 2 0
targeted_lcb_medium_55 2 2

Rationale: HE/HE+ were already at/above 128e on v5-coder (+1.22 / +1.22), so spending protection budget on them is wasted; LCB-medium was the −9.10pp hole. Removing the HE targeting frees the floor to protect LCB-relevant experts harder. The resulting map is 95% identical to C6v3 (generic_code 3× dominates), so HE+/IFEval are expected at ≈C6v3 — the bench that tests the hypothesis is LCB-medium.

4. Mandatory shared-FFN α=1.2 (cohort rule)

After expert drop, router_shared_upweight.py --alpha 1.2 --target mlp.down_proj.weight scales the always-on shared expert's down-projection by 1.2×. Without it the pruned model is the "weak/ruminating" pre-T18 baseline; every coder variant carries it or the v5-coder comparison is unfair. Verified by the .shared_applied marker in this repo.


Expert mapping

  • Uniform 30 experts dropped per layer, all 30 layers → 98 kept/layer (128 → 98).
  • Mean overlap with the teacher-force baseline: 17.8 / 30 dropped per layer (min 11, max 24). The recipe re-ranks ~12 of 30 dropped slots/layer away from the naive teacher-force drop — that re-ranking is where the code-faithfulness lives.
  • Aggregated keep scores (rank-normalized, α=2.0, breadth-bonus) span ~0.0–3.5 per layer with mean ~1.85; the protected top-16 sit at the high end, the dropped 30 at the low end, with boundary_ties_within_1pct 2–39/layer (highest in the shallow layers 0/10 — those layers have the flattest expert-importance profile).

Full per-layer keep lists are in expert_drop_metadata.json (per_layer_keep) in this repo; the ranking provenance (per-layer agg min/max/mean, ties, overlap-vs-baseline, v5-only non-overlap experts) is in scripts/v6coder_C6v3lcb_drop_map.json.summary.json.


Problems solved

  1. Rumination on hard code/reasoning problems. Pruned Gemma 4 variants fall into <think> loops on the hardest LCB / GPQA / AIME problems, blowing the token budget without converging. C6v3lcb attacks this two ways: (a) the LCB-targeted drop map keeps the experts that the hard-problem pass-traces actually use; (b) at eval time the lcb_medium_*_v4 template applies a thinking_token_budget=12288 forcing function (parser=gemma4 + enable_thinking) that caps the rumination loop. T109 rumination-signal gate (GPQA-48 / AIME probes) confirmed the variant generates and converges rather than looping.
  2. Data-vs-recipe disentanglement (T102). v6-coder isolates whether v3's earlier regressions were the data or the recipe: holding v5-coder's gentle C6 recipe fixed and only swapping in v3 data + the outlier fix. C6v3 ≈ v5-coder in smoke → the v3 data is fine and C12's aggressive recipe was the rumination cause; C6v3lcb then re-steers that clean baseline at LCB.
  3. Deep-layer calibration corruption (T73.0). The _fixed traces were ~86% NaN in deep layers; v6-coder uses real-signal v3 data with a median outlier clamp so deep-layer experts are ranked on genuine contribution, not NaN-blindness.
  4. Unfair-comparison trap. The shared α=1.2 step is mandatory and marker-verified, so v6-coder vs v5-coder is apples-to-apples.

Scoreboard — Q6_K GGUF, llama.cpp, greedy

Full 9-bench llama.cpp Q6_K run, llama-server --reasoning-format deepseek --reasoning-budget 12288 --parallel 2, greedy (T=0, top_p=1, top_k=0, do_sample=false). The 128e and v5-coder columns are the bartowski-Q6_K / v5-coder-Q6_K reference runs under the identical recipe (apples-to-apples within the llama.cpp/Q6_K backend).

Bench (n) 128e Q6_K v5-coder Q6_K v6-coder Q6_K Δ (v6 − v5c) Δ (v6 − 128e)
ARC-Challenge-chat (1172) 97.10% 95.73% 95.82% +0.09 −1.28
GPQA Diamond flex (198) 72.73% 65.15% 67.17% +2.02 −5.56
GSM8K-100 flex 92.00% 87.00% 91.00% +4.00 −1.00
MATH-500-100 math_verify 94.00% 94.00% 91.00% −3.00 −3.00
AIME 2024 (30) 83.33% 53.33% 63.33% +10.00 −20.00
IFEval-100 (prompt_strict) 97.00% 94.00% 92.00% −2.00 −5.00
HumanEval-164 chat 96.34% 99.39% 98.78% −0.61 +2.44
HumanEval+-164 chat 90.85% 93.29% 93.29% 0.00 +2.44
LCB-medium-55 v4 94.55% 85.45% 96.36% +10.91 +1.81

Read: v6-coder lands on/above v5-coder on 8 of 9 benches and beats the unpruned 128e on every code bench (HE +2.44, HE+ +2.44, LCB-55 +1.81). The headline is LCB-medium-55 +10.91pp vs v5-coder — the targeted hole is not just closed but pushed past the base model. AIME recovers +10pp (53.33 → 63.33). The cost lands on MATH (−3) and IFEval (−2) vs v5-coder — the non-code generalist axes the LCB-only class weights (1 1 3 1 1 0 0 2) deliberately deprioritized.

Extended code benches — LCB-medium-100 + MultiPL-E-100 (Q6_K, llama.cpp, greedy)

Run on solidpc (T112), same Q6_K / greedy recipe; MultiPL-E scored via the nuprl/multipl-e-evaluation Docker image. 128e and v5-coder columns from the v5-coder card.

LCB-medium-100 (100-problem superset of LCB-medium-55 v4):

Bench (n) 128e Q6_K v5-coder Q6_K v6-coder Q6_K Δ (v6 − v5c)
LCB-medium-100 95.00% 91.00% 96.00% +5.00

(v6-coder also clears 128e on LCB-100 by +1.00pp — the LCB-targeting win holds on the 100-problem superset, not just the 55-problem v4 set.)

MultiPL-E-100 (HumanEval → Rust / Java / JS, 100/lang, chat-mode + code extraction):

Language (n=100) 128e Q6_K v5-coder Q6_K v6-coder Q6_K Δ (v6 − v5c)
Rust 83.00% 76.00% 82.00% +6.00
Java 91.00% 81.00% 89.00% +8.00
JavaScript 95.00% 86.00% 93.00% +7.00
Macro mean 89.67% 81.00% 88.00% +7.00

v6-coder near-fully recovers MultiPL-E to the 128e level (macro −1.67pp) from v5-coder's −8.67pp gap — code generalization in non-Python languages tracks the LCB-targeting win, not just the in-distribution LCB benches. (264/300 passed; micro = macro = 88.0%.)

Coder-field comparison — v6-coder vs Qwen2.5-Coder-14B / 7B + Qwen3.5-9B (Q6_K, llama.cpp, greedy)

The 9 canonical benches + MultiPL-E-100, all on the identical llama.cpp Q6_K / greedy recipe (reasoning models served with --reasoning-format deepseek --reasoning-budget 12288 --parallel 2). Architectures differ — this is a same-harness comparison, not a same-class one:

  • v6-coder — Gemma-4 26B-A4B MoE pruned to 98 experts (~20.8B total, ~A4B active), reasoning.
  • Qwen2.5-Coder-14B / 7B-Instruct — dense, non-reasoning code specialists (bartowski Q6_K).
  • Qwen3.5-9B — dense reasoning model (bartowski Q6_K).
Bench (n) v6-coder Q6_K Qwen2.5-Coder-14B Qwen2.5-Coder-7B Qwen3.5-9B
ARC-Challenge-chat (1172) 95.82% 90.53% 85.58% 96.76%
GPQA Diamond flex (198) 67.17% 34.85% 26.26% 73.74%
GSM8K-100 flex 91.00% 89.00% 80.00% 79.00%
MATH-500-100 math_verify 91.00% 62.00% 66.00% 59.00%
AIME 2024 (30) 63.33% 10.00% 10.00% 56.67%
IFEval-100 (prompt_strict) 92.00% 68.00% 54.00% 93.00%
HumanEval-164 chat 98.78% 90.85% 87.20% 89.02%
HumanEval+-164 chat 93.29% 84.76% † 83.54% 80.49%
LCB-medium-55 v4 96.36% 18.18% † 12.73% 58.18%
MultiPL-E-100 (macro) 88.00% 84.67% 80.67% 80.33%

† Qwen2.5-Coder-14B HumanEval+ / LCB-medium-55 reused from the same-stack GGUF HE+ sweep (not re-run in this chain). All comparison cells filled 2026-05-28 (9B Fix-A re-eval + 14B Q6_K reference). running legend: eval in flighttable updated as results land.

Note on Qwen3.5-9B. Qwen3.5-9B is a verbose, slow thinking model: it emits long <think> reasoning chains (often ≥1900 tokens even on a trivial GSM8K question), so it runs several× slower per question than the non-reasoning Qwen2.5-Coder models — well beyond what its 9B size would suggest. Its GSM8K / MATH-500 / GPQA cells are being re-run after a harness fix: under batched, reasoning-parsed serving the verbose thinking intermittently left the final answer inside the reasoning block (mis-scored as empty content). ARC (96.76%) and AIME (56.67%) were unaffected.


Answer-length analysis (anti-rumination)

v6-coder thinks longer on average than 128e on the code benches — it spends the full bounded thinking_token_budget=12288 where 128e often answers in a few hundred tokens. The question this section answers is whether that extra length is rumination (long thinking that fails to reach an answer — the failure mode the LCB targeting was built to fix) or productive reasoning that lands on a PASS. The tables below compare per-problem completion length against 128e and v5-coder on the same problems, from omk_eval token_stats (characters from the raw completion text; tokens via the 128e tokenizer), on the real-n benches. 128e and v5-coder are the matching Q6_K runs. The short answer: by the rumination-as-wasted-thinking definition, v6-coder ruminates less than 128e, not more — 128e's longest LCB outputs (up to 51k chars) are failures, while v6-coder's long outputs overwhelmingly pass.

Per-problem completion length — characters (p50 / p90 / max):

Bench (n) 128e v5-coder v6-coder
GPQA Diamond (198) 2558 / 17661 / 26572 2602 / 16768 / 26518 2584 / 17162 / 25218
AIME 2024 (30) 2091 / 6192 / 7198 2405 / 9239 / 10974 2190 / 8404 / 9689
LCB-medium-55 4369 / 14894 / 51568 31222 / 38456 / 44868 30685 / 36487 / 41301
LCB-medium-100 1899 / 14894 / 51568 30035 / 36846 / 44868 29514 / 36221 / 43845
MultiPL-E-100 (300) 244 / 592 / 2944 257 / 764 / 3083 246 / 594 / 2169
MATH-500 (100) 1054 / 1970 / 7520 1145 / 1925 / 7189 1087 / 1961 / 8420
GSM8K (100) 254 / 699 / 3438 264 / 698 / 2386 272 / 682 / 1332
IFEval (100) 781 / 3595 / 7425 862 / 4150 / 17539 850 / 3803 / 7200
HumanEval (164) 704 / 1303 / 8578 696 / 1408 / 3923 721 / 1296 / 4033
HumanEval+ (164) 684 / 1238 / 5704 709 / 1419 / 3578 715 / 1451 / 4498
ARC-Challenge (1172) 1203 / 1662 / 6407 1201 / 1655 / 43570 1217 / 1663 / 43927

Per-problem completion length — tokens (p50 / p90 / max):

Bench (n) 128e v5-coder v6-coder
GPQA Diamond (198) 843 / 8189 / 8189 855 / 8189 / 8189 876 / 8189 / 8189
AIME 2024 (30) 960 / 3993 / 4021 1156 / 4012 / 4022 939 / 4011 / 4021
LCB-medium-55 1213 / 4644 / 15585 12804 / 13295 / 15724 12799 / 13184 / 15886
LCB-medium-100 565 / 4644 / 15947 12726 / 13156 / 15945 12735 / 13103 / 15724
MultiPL-E-100 (300) 85 / 184 / 1012 89 / 247 / 1017 86 / 184 / 1013
MATH-500 (100) 423 / 925 / 3035 445 / 802 / 3319 404 / 872 / 3318
GSM8K (100) 123 / 253 / 1107 114 / 221 / 858 120 / 254 / 396
IFEval (100) 202 / 826 / 1427 217 / 912 / 4058 234 / 850 / 4060
HumanEval (164) 226 / 431 / 2890 226 / 413 / 1363 226 / 443 / 1305
HumanEval+ (164) 223 / 430 / 1744 229 / 420 / 1209 226 / 495 / 1251
ARC-Challenge (1172) 255 / 359 / 1453 258 / 360 / 16266 262 / 361 / 16266

Budget-saturation incidence — share of problems whose completion reached ≥12k tokens (at/near the thinking_token_budget=12288 cap). Saturation by itself is not rumination: a saturated output that reaches a correct PASS is productive use of the budget. The pruned variants saturate on nearly everything; 128e almost never does — but see the next table.

Bench (n) 128e v5-coder v6-coder
LCB-medium-55 1 / 55 (1.8%) 54 / 55 (98.2%) 53 / 55 (96.4%)
LCB-medium-100 2 / 100 (2.0%) 99 / 100 (99.0%) 97 / 100 (97.0%)

Rumination — long thinking that fails to PASS. The right metric is not median length (128e looks short only because it answers easy problems fast). It is the share of the model's budget-saturated outputs that still fail — i.e. tokens burned without a correct answer:

Bench (n) 128e v5-coder v6-coder
LCB-medium-55 — saturated-and-failed 1 / 1 (100%) 8 / 54 (15%) 2 / 53 (3.8%)
LCB-medium-100 — saturated-and-failed 2 / 2 (100%) 9 / 99 (9%) 4 / 97 (4.1%)
LCB-100 — mean completion tokens, PASS vs FAIL 1468 vs 8368 12687 vs 14241 12616 vs 14302

Every 128e output that reaches the budget cap is a failure (100%), and 128e's failed problems run 5.7× longer than its passed ones (mean 8368 vs 1468 tok): 128e only thinks long when it is lost. v6-coder thinks long on ~97% of problems but 96% of those long outputs PASS — its long thinking is overwhelmingly productive, not rumination.

Key findings:

  • LCB — v6-coder ruminates less than 128e, under the only definition that matters (long thinking that fails to reach a PASS). The naive read of the median says the opposite — 128e's median LCB completion is ~0.6–1.2k tokens vs v6-coder's budget-capped ~12.7k — but that short median is an artifact of mixing easy and hard problems. 128e only thinks long when it is lost, and then it fails: every 128e output that saturates the budget is a failure (2/2 on LCB-100, 1/1 on LCB-55), and the single longest output across all three models — 51,568 chars / 15,585 tokens on lcb/leetcode/3659 — is a 128e failure, one that both v6-coder (PASS, 12,691 tok / 32,676 chars) and v5-coder solve in two-thirds the length. v6-coder, by contrast, saturates the budget on ~97% of problems but 96% of those long outputs PASS; its longest LCB-55 output is a correct answer and its max length (41–43k chars) sits below 128e's 51k. Wasted long-thinking rate: 128e 100%, v6-coder ~4%. And v6-coder beats 128e on score on both axes (96.36 / 96.00 vs 94.55 / 95.00) — more correct and a lower worst-case tail.
  • 128e's routing is erratic in both directions, and the targeted map repairs both. 128e fails LCB two ways: by over-thinking (3659: 51k chars, no answer) and by under-thinking (lcb/leetcode/3000: a premature wrong answer in 579 tokens; 3793: 3101 tok) — on the very same problems v6-coder spends the full ~13k-token budget and passes. This bidirectional instability is the fingerprint of an imperfect base router; the 51k outliers and the lower 128e score are the symptom. v6-coder's LCB-targeted expert map corrects 3 of the 5 problems 128e fails on LCB-100 (2 of 3 on LCB-55). The thinking_token_budget=12288 forcing-function only bounds the loop; the targeted map is what makes the bounded thinking land on a PASS instead of a 51k-char dead end. (Versus v5-coder, v6-coder is statistically the same length — median 12799 vs 12804 tok on LCB-55 — yet passes far more: 53/55 vs 47/55, 96/100 vs 91/100.)
  • AIME — genuinely less rumination, +10pp. Median completion 939 tok vs v5-coder's 1156, p90 char tail 8404 < 9239, scoring 63.33% vs 53.33% — shorter chains and more correct.
  • GSM8K — tightest of the three (max 396 tok; 128e 1107, v5-coder 858); no rumination tail.
  • IFEval — no ruminating outlier (max 7200 chars vs v5-coder's 17539; ≈128e 7425).
  • GPQA — a base-model ceiling, not a prune artifact. p90 completion tokens peg at ~8189 on all three models; the hard-question rumination is inherited from Gemma 4 and is identical across the cohort, not widened by pruning.
  • ARC — one shared single-sample outlier (~16266 tok / ~44k chars) on both pruned variants, absent on 128e; p50/p90 normal and identical, so not systemic length growth.
  • MultiPL-E — most compact code of the three. On the bench where the answer is the code (extracted final block, no reasoning trace), v6-coder has the tightest tail — max 2169 chars vs 128e 2944 / v5-coder 3083, p90 591 chars / 184 tok — at ≈128e pass-rate (macro 88.0%).
  • Net: v6-coder is more accurate than 128e on every LCB axis (96.36 / 96.00 / MultiPL-E macro 88.0) at a higher average inference cost (it thinks the full bounded budget) but a lower worst case (max 41–43k chars vs 128e's 51k) and a far lower wasted-thinking rate (~4% vs 100%). On the short-reasoning benches (AIME / GSM8K / IFEval) it additionally reduces length vs v5-coder. The targeted map never traded length for accuracy — it traded 128e's erratic, sometimes-catastrophic tail for a higher but bounded and productive average.

Methodology / data integrity. Per-problem lengths come from omk_eval token_stats over each bench's per-problem samples archive — lm-eval samples_*.jsonl for the 9-bench, and the native runners' lcb_result.samples.jsonl / mpe_result.samples.jsonl (mirrored into the durable sqlite resume cache) for LCB/MPE. MultiPL-E is tabulated, but its rows measure extracted-code length, not reasoning: its samples store only the final code block (no <think> trace, finish_reasons all terminal, thinking_tokens_est=0), so MPE is a code-conciseness reference — not a rumination metric — on which v6-coder is the most compact. The reasoning-bearing benches (LCB / GPQA / AIME / MATH) carry the rumination signal; MPE tokens there are tokenizer-estimated (the code-extraction path returns no server token count). thinking_tokens_est is 0 across all benches under llama.cpp --reasoning-format deepseek (reasoning is emitted inline, not as a separable block), so the completion-token columns are the length signal. The LCB/MPE token_stats are computed over per-problem rows keyed on doc_id (the native runners now emit it, fixed 2026-05-25) — earlier card revisions briefly collapsed these to n=1; all numbers here are at full n (LCB-55 n=55, LCB-100 n=100, MultiPL-E n=300).


Reproduce

# 1. drop map (omnimergekit)
python scripts/generate_drop_map_multiclass.py \
    --data scripts/expert_neuron_v5_code_v3.json --target 98 \
    --protect-top 16 --alpha 2.0 --strategy max --normalize rank \
    --breadth-bonus 0.5 --v4-floor-map scripts/v4_layer_floor_map.json \
    --baseline-drop-map scripts/teacher_force_98e_p16_clean.json \
    --outlier-mode median --outlier-wnorm-thresh 1e4 \
    --class-weights 1 1 3 1 1 0 0 2 \
    --output scripts/v6coder_C6v3lcb_drop_map.json

# 2. prune + mandatory shared upweight (see scripts/build_v6coder_C6v3lcb.sh)
python scripts/expert_drop.py --source-dir google/gemma-4-26B-A4B-it \
    --drop-map scripts/v6coder_C6v3lcb_drop_map.json --suffix=-v6-coder-C6v3lcb-it
python scripts/router_shared_upweight.py \
    --model-dir google/gemma-4-A4B-98e-v6-coder-C6v3lcb-it \
    --alpha 1.2 --target mlp.down_proj.weight

Lineage & provenance

  • Base: google/gemma-4-26B-A4B-it (128e, fresh prune — not derived from v4/v5 weights).
  • Cohort: v5-coder (C6 _fixed) → C6v3 (C6 v3-data) → C6v3lcb (this, LCB-targeted).
  • Tooling: omnimergekitexpert_drop.py, generate_drop_map_multiclass.py, router_shared_upweight.py; eval via omk_eval.py + EVAL_PROTOCOL.
  • Eval sampler is frozen GREEDY across the whole cohort for apples-to-apples comparison.

License: Gemma Terms of Use. This is a derivative of Gemma 4 and inherits those terms.

Downloads last month
16,990
GGUF
Model size
20B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF

Quantized
(3)
this model

Papers for ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF