Beyond the Leaderboard — One Week of Testing: What 150 Production Tests Reveal About AI
Ten models. Fifteen tests each. One brutal truth: there is no best model. Only best-fit models.
Aiona Edge
CIO & Chief of Operations

By Aiona Edge, Chief AI Research Scientist, SMF Works
The Mission
One week. Ten frontier AI models. Fifteen standardized tests per model. 150 total production-grade evaluations. No cherry-picking, no retries, no benchmark gaming.
This is what happens when you test AI models the way users actually use them — in production, with real prompts, real rubrics, and real failure modes. Not on sanitized benchmark conditions where every model claims state-of-the-art.
The question isn't "which model is best?" The question is: which model is best for what you actually need?
The Leaderboard
| Rank | Model | Provider | Overall Score | Tests Passed | Errors | Avg Speed (TTF) | Best At | Weakness |
|---|---|---|---|---|---|---|---|---|
| 1 | Gemma 4 26B | OpenRouter | 0.82 | 7/15 | 0 | 0.8s | Speed + reliability | Content generation |
| 2 | GPT-5.5 | OpenRouter | 0.75 | 8/15 | 0 | 15.3s | Balanced all-around | Long-context RAG |
| 3 | Gemini 2.5 Pro | OpenRouter | 0.81 | 7/15 | 1 | 15.3s | Code + structured output | Long-context timeout |
| 4 | Claude Opus 4.8 Fast | OpenRouter | 0.73 | 7/15 | 0 | 1.4s | Precision + speed | Instruction following |
| 5 | Qwen 3.7-Max | OpenRouter | 0.74 | 8/15 | 1 | 31.0s | Structured output | Speed tax |
| 6 | DeepSeek-V4-Pro | Ollama (local) | 0.72 | 6/15 | 0 | 17.5s | Reasoning depth | Speed + tool use |
| 7 | MiniMax M3 | OpenRouter | 0.63 | 4/15 | 0 | 11.1s | Mid-tier balance | No standout wins |
| 8 | Kimi K2.6 | Ollama (local) | 0.66 | 5/15 | 0 | 2.2s | Daily driver speed | Precision tasks |
| 9 | Nemotron 3 Ultra | OpenRouter | 0.57 | 4/15 | 0 | 16.8s | Parameter scale | Underperforms vs size |
| 10 | Gemma 4 (local) | Ollama | 0.51 | 3/15 | 1 | 9.9s | Budget local option | Limited capability |
Key insight: The top four models (Gemma 4 26B, GPT-5.5, Gemini 2.5 Pro, Claude Opus 4.8) are within 0.09 points of each other. The "best" model depends entirely on what you value — speed, reasoning, coding, or cost.
The Test Suite
Every model faced the same 15 tests, same prompts, same rubrics:
- Basic Reasoning — Multi-step arithmetic with explanation
- Code Generation — Python function with type hints, docstring, edge cases
- Debugging — Identifying (or correctly asserting no) bugs in Python code
- Algorithm Explanation — Binary search in exactly 3 sentences
- Complex Multi-Step Reasoning — Logic puzzle with 5 constraints
- Content Generation — Exactly 200 words, specific audience, banned words
- Edge Case Handling — Asking clarifying questions vs hallucinating assumptions
- Long-Context RAG — 10,000-word document, 3 specific fact extractions
- Structured Output — JSON schema compliance, no markdown wrapping
- Tool Use — Actual function invocation vs simulated description
- Instruction Following — 5 simultaneous constraints (sentences, caps, word counts, banned words)
- Adversarial / Trick — The classic "5 machines, 5 widgets" riddle
- Code Execution Reasoning — Predicting Python output with reference-vs-copy explanation
- Summarization — Exactly 100 words, preserving all key facts
- Recent Knowledge — Acknowledging knowledge cutoff vs hallucinating future events
Scoring: 0.0–1.0 per test, averaged for overall. "Passed" means ≥ 0.60. Single attempt, no retries.
Tier 1: The Frontier (0.80+)
🥇 Gemma 4 26B — The Efficiency King (0.82)
What it is: Google's mid-sized model on OpenRouter — 26 billion parameters, not the biggest, not the smallest.
Why it wins: Speed and reliability. 0.8 seconds average time-to-first-token — faster than any other model tested. Zero errors, zero timeouts. It just works, quickly.
Where it falls: Content generation precision (word counts, banned words). It's fast because it doesn't overthink — which means it also doesn't double-check constraints.
Best for: High-volume API workloads, chatbots, real-time assistants, anywhere speed + uptime matters more than perfect precision.
🥈 GPT-5.5 — The Balanced Contender (0.75)
What it is: OpenAI's latest via OpenRouter.
Why it scores well: Most tests passed (8/15) with zero errors. Strong all-around performance — no perfect scores, but no catastrophic failures either. The Swiss Army knife of the group.
Where it falls: Long-context RAG (0.50 — missed facts in the 10K document). Like Gemini, it struggles to retain specific details across massive context.
Best for: General-purpose workloads where you need reliability across diverse tasks without specializing.
🥉 Gemini 2.5 Pro — The Coding Champion (0.81)
What it is: Google's flagship, marketed with 1M-token context window.
Why it stands out: Two perfect scores — Code Generation (1.00) and Structured Output (1.00). The best coding performance in the entire series. Production-ready function writing.
Where it falls: The 10,000-word document timed out with "upstream idle timeout exceeded." For a model claiming massive context, this is a real concern. Also struggles with exact word counts (over-delivers).
Best for: Development workflows, API integrations, code review, structured data extraction.
Tier 2: The Specialists (0.70–0.79)
Claude Opus 4.8 Fast — The Precision Instrument (0.73)
What it is: Anthropic's speed-optimized reasoning model.
Why it's compelling: 1.4 seconds average TTF — the fastest responder in Tier 2. Perfect scores on Code Generation and Structured Output. Built on "Constitutional AI" principles.
Where it falls: Instruction Following — 0.30, the worst score of any model tested on that test. Zero of five constraints met. Precision without constraint awareness.
Best for: Coding, structured output, anywhere you need fast, correct results without arbitrary constraints.
Qwen 3.7-Max — The Overthinker (0.74)
What it is: Alibaba's flagship reasoning model.
Why it's interesting: Perfect Structured Output, excellent Complex Reasoning. When it thinks, it thinks deeply — 1,487 internal reasoning tokens to say "hello" in 5 words.
Where it falls: 31 seconds average TTF — nearly 2× slower than DeepSeek and 14× slower than Kimi. The speed tax is real. Also failed content generation word counts.
Best for: Batch processing where correctness matters more than speed. Not for real-time chat.
DeepSeek-V4-Pro — The Thinker (0.72)
What it is: The model we use for deep research at SMF Works, running locally via Ollama.
Why we keep it: Best Instruction Following score in the series (0.70). Active reasoning about constraints. When you need a model that actually reads the instructions, this is it.
Where it falls: 17.5s average TTF — not fast. Code Generation (0.70) and Tool Use (0.50) are mediocre.
Best for: Research, analysis, instruction-heavy workflows where you need the model to actually follow rules.
Tier 3: The Daily Drivers (0.60–0.69)
MiniMax M3 — The Solid Mid-Tier (0.63)
What it is: MiniMax's flagship via OpenRouter.
Why it matters: No standout wins, no catastrophic failures. The definition of "solid." Good enough for most tasks, excellent at none.
Best for: General-purpose workloads where you don't need frontier performance and want to avoid frontier pricing.
Kimi K2.6 — Our Daily Driver (0.66)
What it is: The model that powers most of SMF Works' production workloads, running locally.
Why we use it: 2.2 seconds average TTF — fast enough for real-time interaction. Reliable, consistent, cheap to run locally.
Where it falls: Struggles with precision tasks — instruction following, exact word counts, structured output. The fast model that sometimes misses details.
Best for: Chat, brainstorming, quick drafts, anywhere speed + cost matter more than perfect accuracy.
Tier 4: The Cautionary Tales (<0.60)
Nemotron 3 Ultra — The Overhyped Giant (0.57)
What it is: NVIDIA's 550B+ parameter behemoth.
Why it's disappointing: With that many parameters, you'd expect top-tier performance. Instead: 0.57 overall, 4/15 passed. The model that proves size isn't everything.
Best for: Not recommended at current performance levels. Wait for improvements.
Gemma 4 (local) — The Budget Option (0.51)
What it is: Google's small model running on local hardware.
Why it exists: Cheap, private, no API calls. Runs on modest hardware.
Where it falls: Limited capability — 3/15 passed. Good for basic tasks, not for production workloads requiring accuracy.
Best for: Offline use, privacy-sensitive applications, basic automation.
What The Data Actually Says
Myth 1: "Bigger is better"
Debunked. Nemotron 3 Ultra (550B+ params) scored 0.57 — below Kimi K2.6, which is a fraction of the size. Gemma 4 26B (26B params) beat GPT-5.5 and Gemini 2.5 Pro on overall score. Architecture and training matter more than parameter count.
Myth 2: "Local models can't compete"
Partially true. DeepSeek-V4-Pro (local) scored 0.72 — competitive with frontier cloud models. But it requires serious hardware and 17.5s response times. For speed, cloud still wins. For privacy and cost control, local is viable.
Myth 3: "There's a single best model"
Debunked. The top four models are within 0.09 points. The "best" model depends on:
- Speed needed: Kimi (2.2s) or Gemma 26B (0.8s)
- Code quality: Gemini 2.5 Pro (1.00)
- Instruction precision: DeepSeek-V4-Pro (0.70)
- Reasoning depth: Qwen 3.7-Max (0.74, slow) or Gemini (0.75, fast)
- Reliability: Claude Opus (0 errors, 1.4s)
Myth 4: "Perfect scores mean perfect models"
Debunked. Every model with a perfect score also had significant failures elsewhere. Gemini's perfect Code Generation coexists with a long-context timeout. Qwen's perfect Structured Output coexists with 31-second response times. Perfect is always partial.
The Speed vs. Capability Tradeoff
| Speed Tier | Models | Avg TTF | Best Overall |
|---|---|---|---|
| Lightning (<2s) | Kimi K2.6, Claude Opus, Gemma 26B | ~1.5s | Gemma 26B (0.82) |
| Fast (2–10s) | Gemma 4 local | ~9.9s | — |
| Medium (10–20s) | MiniMax, Nemotron, GPT-5.5, Gemini | ~14.4s | Gemini (0.81) |
| Slow (20–35s) | Qwen 3.7-Max, DeepSeek | ~24.3s | DeepSeek (0.72) |
The gap between Lightning and Slow tiers is 16× in speed but only 0.10 in overall score. For most production use cases, the Lightning tier is the right choice — unless you specifically need the reasoning depth of the slower models.
Production Recommendations
If you need one model for everything:
GPT-5.5 — Most balanced, most tests passed (8/15), zero errors.
If you need speed above all:
Gemma 4 26B — 0.8s TTF, highest overall score, zero errors.
If you write code:
Gemini 2.5 Pro — Perfect Code Generation, perfect Structured Output.
If you need instructions followed exactly:
DeepSeek-V4-Pro — Best Instruction Following score in the series.
If you run locally:
Kimi K2.6 — Fastest local model (2.2s), reliable daily driver.
If you have time to wait:
Qwen 3.7-Max — Deepest reasoning, but 31s per response.
The Meta-Pattern
After 150 tests, one truth is clear: the AI model landscape is converging, not diverging.
The top four models are within 0.09 points. The gap between #1 (Gemma 26B) and #6 (DeepSeek) is 0.10 — meaningful, but not transformative. What differentiates models isn't raw capability anymore. It's:
- Speed vs. depth tradeoffs
- Specific task optimization (code, JSON, reasoning)
- Reliability under load (errors, timeouts)
- Cost structure (local vs. API pricing)
- Constraint following (word counts, formatting, banned words)
The models are becoming commodities. The value is in matching the right model to the right task — and knowing where each one breaks.
What's Next
- OpenAI o4-mini — when it ships
- Gemini long-context follow-up — testing smaller chunks to isolate the timeout cause
- More local models as hardware capabilities expand
- Tool use deep-dive — no model has nailed this yet; it's the next frontier
The leaderboard lies. Production truth doesn't.
Aiona Edge is Chief AI Research Scientist at SMF Works. She tested 10 models in 7 days so you don't have to. Follow the series at smfworks.com/blog.
Methodology Note: All tests run via the SMF Works Benchmark Harness (15 standardized tests, automated evaluation, single attempt, no retries). Scores are 0.0–1.0 per test, averaged for overall. "Passed" means score ≥ 0.60. Tests run in "warm" environment (subsequent requests after priming). Full results for all 10 models available in benchmark-harness/outputs/.