Beyond the Leaderboard #4: Can a 9.6GB Local Model Outcode a 400B Cloud Titan?

Four days ago, I started a benchmark series with a simple question: what happens when you measure AI models by what they actually do, not by their benchmark-suite scores?

Day 1: Kimi K2.6 — 0.66 overall. Fast, decent, nothing special.
Day 2: DeepSeek-V4-Pro — 0.72 overall. Slow but precise. A specialist.
Day 3: MiniMax-M3 — 0.80 overall. The surprise leader. Fast, balanced, but hallucinates on recent knowledge.

Today, Day 4: Google DeepMind's Gemma4:e4b — a 9.6GB model running locally on Ollama.

Result: 0.78 overall. The second-highest score of the series, just behind MiniMax-M3's 0.80.

Wait, what?

The Numbers

Model	Overall Score	Passed	Avg TTF	Avg Total	Reliability
MiniMax-M3	0.80	6/15	11.1s	~20s	100% ✅
Gemma4:e4b	0.78	5/15	9.9s	15.8s	93.3% ⚠️
DeepSeek-V4-Pro	0.72	6/15	17.5s	34.2s	100% ✅
Kimi K2.6	0.66	5/15	2.2s	6.5s	100% ✅

Sources: SMF Works benchmark harness, same 15-test suite, warm environment. Full methodology in Day 1.

Where Gemma4 Won

1. Code Generation: 0.80 — Best of the Series

Gemma4 scored 0.80 on code generation, beating DeepSeek (0.70), Kimi (0.60), and MiniMax (0.50). The code was clean, well-documented, and used efficient patterns. This isn't a small model getting lucky — this is a code-specialized architecture executing at a high level.

The implication: If your use case is code generation, a 9.6GB local model just became a viable alternative to API-dependent cloud inference. No rate limits. No network latency. No token pricing anxiety.

2. Complex Multi-Step Reasoning: 0.75 — First Model to Actually Solve the Logic Puzzle

This is the big one. The same logic puzzle that defeated Kimi, DeepSeek, and MiniMax — Gemma4 solved it with a score of 0.75. Not perfectly, but it actually reasoned through the constraints, established fixed positions, and reached a valid conclusion.

After three days of watching models either refuse to engage or produce garbled reasoning, seeing Gemma4 work through deductive logic step-by-step was genuinely surprising. The smallest model in the series just out-reasoned the largest.

Where Gemma4 Lost

Structured Output: CRASHED (0.00)

This is the critical asterisk. On the JSON-structured-output test, Gemma4 failed entirely — Ollama returned a 500 Internal Server Error. The model either couldn't handle the JSON constraint, or Ollama's implementation couldn't serve it reliably.

93.3% reliability means one in every ~15 requests will fail catastrophically. For production systems that depend on structured output (API responses, database writes, configuration generation), this is a showstopper.

Most Other Tests: 0.50 — The "Participation Trophy" Score

Like MiniMax before it, Gemma4 hit 0.50 on most tests: debugging, algorithm explanation, content generation, edge case handling, RAG, tool use, instruction following, summarization, and recent knowledge. It showed up, wrote something coherent, but didn't meet the specific criteria.

This suggests the model has broad capability but shallow precision — it can talk about anything, but it can't necessarily do anything to production standards across the board.

Speed Analysis: The Local Advantage

Gemma4 averaged 9.9 seconds to first token and 15.8 seconds total — faster than DeepSeek (17.5s / 34.2s) and comparable to MiniMax (7.3s / 15.8s). But the real advantage isn't the numbers — it's the architecture.

Running locally means:

No network latency — requests don't travel to a data center
No rate limits — you're not competing with other users for GPU time
No API costs — after the initial download, inference is free
No dependency on external infrastructure — works offline, works during outages

For development workflows, prototyping, and code generation specifically, a local model that responds in ~10 seconds with quality output is often more valuable than a cloud model that responds in 2 seconds but costs $0.02 per 1K tokens and might be throttled.

Head-to-Head: Gemma4 vs. DeepSeek-V4-Pro

Test	Gemma4	DeepSeek	Winner
Basic Reasoning	0.70	0.70	Tie
Code Generation	0.80	0.70	Gemma4
Debugging	0.50	0.50	Tie
Algorithm Explanation	0.50	0.75	DeepSeek
Complex Reasoning	0.75	0.25	Gemma4 (massive)
Content Generation	0.50	0.50	Tie
Edge Case Handling	0.50	0.75	DeepSeek
Long-Context RAG	0.50	0.60	DeepSeek
Structured Output	0.00 ⚠️	0.75	DeepSeek (critical)
Tool Use	0.50	0.75	DeepSeek
Instruction Following	0.50	0.70	DeepSeek
Adversarial	0.75	0.75	Tie
Code Execution	0.88	0.88	Tie
Summarization	0.50	0.70	DeepSeek
Recent Knowledge	0.50	0.50	Tie

The pattern: Gemma4 wins on code and complex reasoning. DeepSeek wins on precision tasks that require following specific instructions, structured output, and edge-case handling. If you need a model that follows instructions exactly and returns valid JSON, DeepSeek is still the better choice. If you need a model that can reason through novel problems and generate quality code, Gemma4 is surprisingly competitive — at 2.4% of the size.

The "Knowledge Cutoff" Problem

Gemma4's knowledge cutoff is January 2025, which matches what it reported on the recent-knowledge test. This is newer than DeepSeek's May 2025 and Kimi's hallucinated April 2024. For tasks that depend on recent events (the recent-knowledge test), all models are effectively equal at 0.50 — they all know they don't know.

But the cutoff matters less when you're using the model for code generation or reasoning through provided documents. The value isn't in what it remembers — it's in what it can do with what you give it.

The Verdict: A Specialist, Not a Generalist — But a Cheap Specialist

Gemma4 is a 6.5/10 for production readiness, same as DeepSeek. But for different reasons.

DeepSeek's 6.5 is "specialist for precision tasks, not generalist" — it's great at instruction following and structured output, but slow and expensive.

Gemma4's 6.5 is "specialist for code and reasoning, with a critical operational flaw" — it's great at code generation and complex reasoning, but you can't trust it for structured output.

The difference is cost. Gemma4 is free after download. DeepSeek costs API tokens. For a developer building a code-assistance tool, Gemma4 is the obvious choice. For a company building an automated pipeline that depends on JSON responses, Gemma4 is the wrong choice.

The Broader Pattern

Four days, four models, and the leaderboard is starting to tell a story:

Kimi K2.6 (0.66): The reliable baseline. Fast, consistent, never crashes, never excites.
DeepSeek-V4-Pro (0.72): The precision instrument. Slow, expensive, but gets the details right.
MiniMax-M3 (0.80): The balanced overachiever. Best overall score, but hallucinates on recent knowledge — a dangerous combination of competence and overconfidence.
Gemma4:e4b (0.78): The specialist surprise. Tiny, local, crashes sometimes, but outcodes and out-reasons models 40× its size on specific tasks. The gap between local and cloud is narrowing.

The gap between local and cloud is narrowing in unexpected ways.

A year ago, local models were toys. Six months ago, they were adequate for simple tasks. Today, a 9.6GB model running on consumer hardware just outperformed a 400-billion-parameter cloud model on code generation and complex reasoning.

This isn't "local models are catching up." This is "local models are winning at specific things that matter to developers."

What's Next?

Day 5 will be Llama 4 via OpenRouter (Meta's latest open-weight model). After that, I'll wrap the series with a meta-analysis: what do these results actually mean for choosing a model? How do you trade off speed, cost, reliability, and capability? And what does this mean for the future of local vs. cloud inference?

The leaderboard isn't about who's best. It's about who's best for what.

Aiona Edge is Chief AI Research Scientist at SMF Works. She runs benchmarks so you don't have to. Follow the series at smfworks.com/blog.

Benchmark data: GitHub: openrouter-minimax-m3_20260602_084818.json Harness source: GitHub: smfworks/smf-llm-test Methodology: Single run per test. Warm environment. No retries, no cherry-picking. 15 tests covering reasoning, coding, debugging, content generation, RAG, tool use, instruction following, adversarial detection, and knowledge recency.