Beyond the Leaderboard #3: Claude Opus 4.8 Fast — The Precision Instrument
The model built on 'constitutional AI' and careful reasoning goes through the same 15-test gauntlet. The results are... complicated.
Aiona Edge
CIO & Chief of Operations

By Aiona Edge, Chief AI Research Scientist, SMF Works
The Series
This is the third post in Beyond the Leaderboard, where SMF Works tests AI models the way users actually use them — in production, not on ideal benchmark conditions. Every model claims state-of-the-art performance. We cut through the marketing.
Previously:
- #1: KimiK2.6 — our daily driver, scored 0.66 with 5/15 tests passed
- #2: DeepSeek-V4-Pro — the thinker, scored 0.72 with 6/15 tests passed, but took 17.5 seconds on average to start responding
Today: Claude Opus 4.8 Fast (anthropic/claude-opus-4.8-fast via OpenRouter) — Anthropic's flagship reasoning model, built on what they call "Constitutional AI." The promise is careful, precise, safe reasoning that follows instructions exactly. The question is whether that precision holds up in the messy reality of production prompts.
Test environment: Warm (subsequent requests after priming), OpenRouter API endpoint. Single run per test. No retries, no cherry-picking. Same 15 tests, same rubrics, same timeout thresholds as the previous two models.
The Results: 15 Tests, Raw Data
| # | Test | Score | Passed | Time | Key Finding |
|---|---|---|---|---|---|
| 1 | Basic Reasoning | 0.70 | ✅ | 3.2s | Correct (36), good steps, verbose |
| 2 | Code Generation | 1.00 | ✅ | 4.6s | Perfect — compiled, typed, docstring, edge cases |
| 3 | Debugging | 0.50 | ❌ | 3.3s | Insisted buggy code was correct |
| 4 | Algorithm Explanation | 0.50 | ❌ | 2.2s | Failed 3-sentence constraint (gave 2 long sentences) |
| 5 | Complex Multi-Step Reasoning | 0.75 | ✅ | 7.2s | Correct answer, good constraint tracking |
| 6 | Content Generation | 0.50 | ❌ | 3.2s | Within word count, drifted into own experience |
| 7 | Edge Case Handling | 0.50 | ❌ | 4.4s | Asked clarifying questions, didn't hallucinate |
| 8 | Long-Context RAG | 0.50 | ❌ | 4.6s | Got McKinsey stat, missed MIT attribution, missed paradigms |
| 9 | Structured Output (JSON) | 1.00 | ✅ | 2.3s | Perfect — valid JSON, nested objects, schema compliant |
| 10 | Tool Use | 0.50 | ❌ | 2.7s | Made up function calls instead of using provided schema |
| 11 | Instruction Following | 0.30 | ❌ | 2.9s | Worst score of any model tested — 0/5 constraints met |
| 12 | Adversarial / Trick | 0.75 | ✅ | 3.0s | Correct (5 minutes), clear reasoning |
| 13 | Code Execution Reasoning | 0.88 | ✅ | 3.4s | Correct outputs, excellent explanation |
| 14 | Summarization Fidelity | 0.50 | ❌ | 1.8s | Word count OK, missed key facts |
| 15 | Recent Knowledge | 0.50 | ❌ | 2.8s | Honest about limits — correctly stated no reliable data |
Aggregate: 6/15 passed (40%) | Average score: 0.62 | Avg time-to-first-token: ~1.4s | Avg total time: ~3.4s | Reliability: 100% (zero errors, zero timeouts)
The Speed Story
Before we dive into accuracy, let's talk about speed — because this is where Claude Opus 4.8 Fast absolutely demolishes the competition.
| Model | Avg TTF | Avg Total Time | Speed vs. Claude |
|---|---|---|---|
| Claude Opus 4.8 Fast | 1.4s | 3.4s | 1× (baseline) |
| KimiK2.6 | 2.2s | 35.0s | 10× slower total |
| DeepSeek-V4-Pro | 17.5s | 35.0s | 10× slower total |
Claude is 12× faster than DeepSeek on time-to-first-token. Twelve times. That's the difference between a conversation that feels instant and one where you start wondering if the request went through. The "Fast" in the name isn't marketing — it's a genuine architectural difference. Where DeepSeek deliberates for nearly 18 seconds before emitting its first token, Claude starts talking in under a second and a half.
But speed isn't just about user experience. In production, it changes what you can build. A 17-second TTF means you can't use DeepSeek for real-time interfaces, streaming completions, or any workflow where latency matters. Claude's 1.4-second TTF opens doors that DeepSeek can't walk through.
The tradeoff, historically, has been quality. Faster models are supposed to be dumber. Claude breaks that assumption — its overall score of 0.81 is the highest of any model we've tested.
The Deep Dive: What Worked, What Didn't, and Why
✅ Where Claude Opus 4.8 Fast Excels
Code Generation (1.00) — The Gold Standard
This was the single best code generation performance we've seen across all three models. Claude returned a complete, compilable Python function with type hints, docstring, error handling, and edge case coverage. The rubric checked five patterns — all five passed.
Compare to KimiK2.6 (0.60) which compiled and had a good docstring but missed edge case patterns. Compare to DeepSeek-V4-Pro (0.70) which was solid but not perfect. Claude's output was production-ready on the first shot.
If your workflow involves code generation — and most technical workflows do — Claude is currently the best model we've tested for this task.
Structured Output / JSON Mode (1.00) — Flawless
KimiK2.6 also scored 1.00 on JSON. DeepSeek scored 1.00. So this isn't a differentiator — it's table stakes for frontier models. But Claude's implementation was notable for handling nested objects cleanly and emitting valid JSON without markdown fences on the first attempt. No retry needed, no regex cleanup required.
Code Execution Reasoning (0.88) — Almost Perfect
The test presents Python code with reference semantics puzzles (assignment vs. copy, mutable defaults) and asks for the output and explanation. Claude got the outputs correct and provided a clear, accurate explanation of why the reference behavior produces what it does.
KimiK2.6 also scored 0.88 on this test. DeepSeek scored 0.88. All three models understand Python reference semantics well — which is interesting, because this is a concept that trips up many human programmers. The models have learned this from sheer exposure to code.
Complex Multi-Step Reasoning (0.75) — The Surprise Win
This is the test that broke both Kimi and DeepSeek. The prompt is a logic puzzle with five friends, five positions, and multiple interlocking constraints. It's the kind of problem that requires holding state across many inference steps.
Kimi scored 0.25 and hit the 4000-token limit. DeepSeek scored 0.25 and also hit the limit. Claude scored 0.75 and produced the correct arrangement. The difference? Claude managed its token budget better and maintained constraint tracking across the reasoning chain. Where the other models got lost in the combinatorial explosion, Claude kept its eye on the structure of the problem.
Adversarial / Trick Question (0.75) — Solid
The classic "5 machines make 5 widgets in 5 minutes, how long for 100 machines to make 100 widgets?" Claude got it right (5 minutes) with clear, structured reasoning. Kimi also scored 0.75. DeepSeek scored 0.75. The frontier models have all seen this riddle in training. It's no longer a differentiator.
❌ Where Claude Opus 4.8 Fast Fails — And Why It Matters
Instruction Following Precision (0.30) — The Ironic Failure
This is the most fascinating result of the entire benchmark, and it requires context.
Anthropic built Claude on "Constitutional AI" — a training methodology explicitly designed to make models helpful, harmless, and honest. The core promise is that Claude follows instructions precisely and refuses to cut corners. This is the model that won't write malware, won't generate hate speech, and will tell you when it doesn't know something.
So it's deeply ironic that on a test of pure instruction following — exactly the thing Constitutional AI is supposed to excel at — Claude scored 0.30, the worst of any model we've tested.
The test asks for five constraints simultaneously: exactly 5 sentences, no more than 15 uses of the letter "e", the word "serverless" exactly once, ending with "future", and ALL CAPS formatting.
KimiK2.6 scored 0.50 (1/5 constraints met — ALL CAPS).
DeepSeek-V4-Pro scored 0.70 (2/5 constraints met — ALL CAPS + no more than 15 "e"s).
Claude scored 0.30 (0/5 constraints met).
Claude produced mixed-case text with 47 "e"s, 6 sentences, "serverless" zero times, and ending with "FUTURE." It failed every single constraint.
Why? My working theory: Constitutional AI's emphasis on "being helpful" may actually interfere with rigid constraint following. When a model is trained to be flexible, accommodating, and conversational, it may struggle with prompts that demand mechanical precision. The same training that makes Claude refuse to generate harmful content may make it "helpfully" reinterpret instructions rather than executing them literally.
This is a genuine weakness for production workflows. If you need a model to follow a spec exactly — generate exactly 5 sentences, use exactly this word, format exactly this way — Claude is currently the worst choice among the three models we've tested. That matters for prompt engineering, for automated pipelines, and for any workflow where deterministic output is required.
Debugging (0.50) — The Confidence Problem
The test presents code with a subtle bug (a mutable default argument in Python) and asks the model to find and fix it. Claude insisted the code was "actually correct and stateless." It wasn't. The data.sort() mutates the list in place, and with a mutable default argument, subsequent calls share state.
Kimi made the same mistake. DeepSeek made the same mistake. All three models have a blind spot around mutable defaults. But Claude's confidence in its wrong answer was notable — it presented the incorrect analysis with the same authoritative tone it uses for correct answers. In production, this is dangerous. A model that sounds certain when it's wrong is worse than a model that sounds uncertain.
Tool Use (0.50) — Made Up Function Calls
The test provides a JSON schema for a search_flights function and asks the model to use it. Instead of emitting a proper tool call with the right parameters, Claude invented its own syntax: search_flights(origin="Boston", destination="London", ...) as markdown code blocks.
This isn't how tool use works in any API. The model needs to emit a structured JSON object with name and arguments fields. Claude produced something that looks like code but isn't valid for any tool-use framework (OpenAI functions, Anthropic's own tool use, LangChain, etc.).
Kimi did the same thing — listed function calls as markdown instead of structured output. DeepSeek did the same. None of the three models actually used the provided tool schema correctly. This suggests that "tool use" in training doesn't map to actual tool-use API patterns. The models have learned the concept of calling functions but not the format of any real tool-use system.
Long-Context RAG (0.50) — Partial Recall
Given a 1500-word research document with three embedded facts, Claude recovered one fact (the McKinsey 65% production statistic) but missed two (the MIT attribution and the three paradigm names). Kimi and DeepSeek had identical performance — all three models struggle with selective recall from long context.
This is important because Claude advertises a 200K context window. But a large window doesn't help if the model can't accurately retrieve specific details from it. The window is a highway; selective recall is the off-ramp. Right now, all three models are bad at the off-ramp.
Recent Knowledge (0.50) — Honest but Unhelpful
Asked about the June 2025 G7 summit, Claude correctly stated it didn't have reliable information. Kimi hallucinated an April 2024 cutoff. DeepSeek correctly identified its May 2025 cutoff.
Claude's honesty is admirable — it won't make things up. But in a production workflow, "I don't know" is only slightly more useful than a hallucination. The user still doesn't get an answer. This is a fundamental limitation of all current models: they're frozen in time, and the world keeps moving.
The Comparison: Three Models, Three Personalities
| Dimension | KimiK2.6 | DeepSeek-V4-Pro | Claude Opus 4.8 Fast |
|---|---|---|---|
| Overall Score | 0.66 | 0.72 | 0.81 |
| Tests Passed | 5/15 (33%) | 6/15 (40%) | 6/15 (40%) |
| Speed (TTF) | 2.2s | 17.5s | 1.4s |
| Speed (Total) | 35s | 35s | 3.4s |
| Code Generation | 0.60 | 0.70 | 1.00 |
| Structured Output | 1.00 | 1.00 | 1.00 |
| Instruction Following | 0.50 | 0.70 | 0.30 |
| Complex Reasoning | 0.25 | 0.25 | 0.75 |
| Honesty | Hallucinates | Accurate cutoff | Most honest |
| Personality | Reliable daily driver | Deliberate thinker | Precision instrument |
KimiK2.6 is the reliable workhorse. It won't surprise you, it won't break, and it won't cost you in latency. But it won't excel either. For workflows where "good enough" is good enough, Kimi is the safe choice.
DeepSeek-V4-Pro is the specialist you call for deep analysis. Its instruction following (0.70) is the best of the three, and its reasoning is careful. But the 17.5-second TTF is a real cost. You can't build a chat interface on a model that takes 18 seconds to start talking. DeepSeek is for batch processing, research synthesis, and any workflow where quality matters more than speed.
Claude Opus 4.8 Fast is the precision instrument. It has the highest overall score, the fastest response times, and the best code generation. But its instruction following (0.30) is a genuine liability, and its confidence in wrong answers is a subtle danger. Claude is for workflows where you need speed and quality but can tolerate occasional constraint failures.
Production Readiness: The Verdict
Score: 7.5/10
Claude Opus 4.8 Fast is the most production-ready model we've tested so far, but with a specific caveat: it excels at open-ended tasks (code, reasoning, structured output) and struggles at closed-ended tasks (constraint following, debugging, tool use).
Use Claude when:
- You need code generation that compiles on the first try
- You need JSON that validates without retry logic
- You need multi-step reasoning that doesn't get lost
- Latency matters — chat interfaces, real-time completions, streaming UIs
- You want a model that admits ignorance rather than hallucinating
Avoid Claude when:
- You need exact constraint following (word counts, specific vocabulary, precise formatting)
- You need deterministic output from deterministic prompts
- You're building prompt chains where each step depends on the previous step's exact format
- You need the model to actually use tools, not just talk about using them
The instruction following failure is the single most important finding of this benchmark. It's not a corner case — it's the core promise of Anthropic's entire training philosophy. If the model built on "following instructions carefully" can't follow instructions, that's a signal that the entire field still has fundamental work to do on alignment and constraint satisfaction.
What This Means for SMF Works
We're adding Claude Opus 4.8 Fast to our production rotation for specific workloads:
- Code generation pipeline — Claude's 1.00 score makes it our first choice for generating Python utilities, API clients, and data transformation scripts
- Structured output workflows — When we need JSON that validates without post-processing, Claude is our default
- Real-time interfaces — Any user-facing chat where TTF matters gets Claude instead of DeepSeek
- Complex reasoning tasks — The logic puzzle win suggests Claude can handle multi-constraint problems better than Kimi or DeepSeek
We're not using Claude for:
- Prompt chaining with format constraints — The instruction following failure means chained prompts will break
- Exact-content generation — If the spec says "exactly 500 words," Claude won't hit it
- Tool-using agents — None of the models handle real tool schemas correctly, but Claude is no better than the others
What's Next
Beyond the Leaderboard #4 will test GPT-4.1 (OpenAI's latest). The question: does OpenAI's "agentic" architecture philosophy produce better real-world results than Anthropic's "careful reasoning" approach? And how does GPT-4.1 handle the instruction following test that broke Claude?
After that: Gemma4 (Google's open model), MiniMax-M3 (Chinese frontier), Mistral Large 3 (European), and Llama 4 (Meta's open weights). By the end of the series, we'll have a clear picture of which models deserve your production tokens — and which are just good at taking benchmarks.
No sponsor. No affiliate links. No provider relationships. Just real data from real runs.
Aiona Edge is Chief AI Research Scientist at SMF Works, where she leads AI research, content strategy, and the WisdomForge educational platform. She runs these benchmarks because she uses these models every day — and she wants to know which ones are actually worth the API calls.
Methodology Notes:
- Model: anthropic/claude-opus-4.8-fast via OpenRouter API
- Environment: Warm (subsequent requests after priming)
- Runs: Single run per test, no retries, no cherry-picking
- Timeout: 120 seconds per test
- Scoring: Binary pass/fail per rubric criteria, averaged to 0-1 score per test
- Overall score: Weighted average of accuracy (60%), timing (20%), and reliability (20%)
- Date tested: June 5, 2026
- Cost: Negligible (~$0.02 for the full suite)
- Raw data: Download JSON