Beyond the Leaderboard #9: Qwen 3.7-Max — The Overthinker

By Aiona Edge, Chief AI Research Scientist, SMF Works

The Series

This is the ninth post in Beyond the Leaderboard, where SMF Works tests AI models the way users actually use them — in production, not on ideal benchmark conditions. Every model claims state-of-the-art performance. We cut through the marketing.

So far we've tested:

Kimi K2.6 (0.66, 5/15) — fast daily driver, struggles with precision
DeepSeek-V4-Pro (0.72, 6/15) — the thinker, worth the wait
Gemma 4 (0.51, 3/15) — budget local option
MiniMax M3 (0.63, 4/15) — solid mid-tier
Nemotron 3 Ultra (0.57, 4/15) — huge parameter count, modest results
GPT-5.5 (0.75, 8/15) — OpenAI's latest, strong all-around
Gemma 4 26B (0.65, 5/15) — bigger but not always better
Claude Opus 4.8 Fast (0.73, 7/15) — Anthropic's speed play

Today: Qwen 3.7-Max, Alibaba's flagship via OpenRouter.

The Model

Qwen 3.7-Max (released May 2026, qwen/qwen3.7-max on OpenRouter) is the latest in Alibaba's Qwen series — one of the most capable open-weight model families in production. Unlike the smaller Qwen variants you can run locally, the Max version is a hosted heavyweight with an emphasis on reasoning depth.

What distinguishes Qwen 3.7-Max from other frontier models is its reasoning architecture. Like DeepSeek-V4-Pro, it generates extensive internal "thinking" chains before producing final output. The difference: Qwen's reasoning is even more verbose.

When I asked it to "say hello in 5 words," it generated 1,487 reasoning tokens before outputting the actual greeting. It spent 22 seconds internally debating between "Hello, how are you today?" and "Hello, have a great day." — complete with word counts, contraction analysis, and tone evaluation.

That's either a feature or a bug, depending on your use case.

The Setup

Same rules as every test:

15 standardized tests — reasoning, coding, instruction following, tool use, structured output, adversarial questions, long-context RAG, summarization, edge case handling, and more.
Same prompts, same rubrics, same scoring — no model-specific tuning.
Single attempt per test — no retries, no cherry-picking.
"Warm" environment — primed with one initial prompt.
Timed — time-to-first-token and total generation time measured.
Evaluated by code — human-readable rubrics, but the scoring is automated and consistent.

One adjustment: complex_reasoning timeout bumped from 120s to 300s because the first run timed out after 7+ minutes of internal deliberation. The test itself didn't change — only the patience budget.

The Results

Metric	Qwen 3.7-Max	DeepSeek-V4-Pro	Kimi K2.6	GPT-5.5
Overall Score	0.74	0.72	0.66	0.75
Tests Passed	8/15	6/15	5/15	8/15
Reliability	93.3%	100%	93.3%	100%
Avg Time to First Token	31.0s	17.5s	2.2s	~15s
Avg Total Time	35.1s	~20s	~8s	~18s
Total Runtime	8m 46s	~5m	~2m	~4.5m

Qwen 3.7-Max takes the lead on raw score — barely edging out GPT-5.5 (0.74 vs 0.75, within margin) and clearly ahead of DeepSeek-V4-Pro (0.72). But it pays a significant speed tax: nearly 2× slower than DeepSeek and 14× slower than Kimi.

Test-by-Test Breakdown

✅ Wins (8 passed)

Structured Output — 1.00 🔥

Perfect. Flawless JSON with exact schema match, correct types, all required fields, no markdown wrapping. This is Qwen at its best — when precision matters, the overthinking pays off.

"This is where Qwen's architecture shines. When the task is 'follow this exact format,' the model that internally debates every comma before outputting anything is exactly what you want."

Code Execution Reasoning — 0.88

Correctly identified all three outputs ([1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 5]) and explained the reference-vs-copy distinction with Pythonic precision.

Code Generation — 0.80

Compiled cleanly. Included type hints, docstring, edge case handling, and efficient O(n) algorithm. The function was production-ready.

Complex Multi-Step Reasoning — 0.75

Solved the logic puzzle correctly: Dave in position 4. The deduction chain was methodical — from Carol in position 3, through the glasses constraint, to the spatial relationship between Dave and Eve.

Adversarial / Trick Question — 0.75

Correctly answered 5 minutes (not 100) for the widget problem, with clear explanation of the rate-of-one-machine logic.

Basic Reasoning — 0.70

Correct answer (36 cupcakes) with step-by-step work shown. Lost points on conciseness — explanation was longer than the requested 2-3 sentences.

Instruction Following — 0.70

Hit 4/5 constraints: exactly 5 sentences, ALL CAPS, ends with "future," includes "serverless" once. Failed on the letter 'e' count (exceeded 15).

"The instruction-following test is brutal: exactly 5 sentences, ALL CAPS, max 15 'e's, include 'serverless' once, end with 'future.' Qwen got most of it but couldn't stop using the letter 'e.' A model that overthinks everything struggles with artificial constraints."

Algorithm Explanation — 0.65

Correctly described binary search logic and complexity (O(log n) time, O(1) space). Lost points on sentence count — used more than the requested exactly 3 sentences.

❌ Losses (6 failed, 1 error)

Debugging — 0.50

The code in the test was actually correct — the "bug" was a trick. Qwen correctly identified there was no bug, but the rubric wanted either a real edge case found OR a confident "no bug" assessment. Qwen got the right answer but not the full credit for thoroughness.

Content Generation — 0.50

Wrote ~280 words instead of the requested exactly 200. Also didn't fully match the "professional but conversational" tone for technical PMs. The overthinking produced thorough content, but not constrained content.

Long-Context RAG — 0.50

Found the McKinsey stat (72%) correctly. Identified the AI Technical Debt Index concept. But failed to name Dr. Sarah Chen and MIT CSAIL — the specific attribution buried in the 10,000-word document.

Edge Case Handling — 0.50

Asked clarifying questions about departure city and trip duration (good), but then immediately started hallucinating warm destinations based on assumed geography. The rubric wanted pure question-asking, no speculation.

Tool Use / Function Calling — 0.50

Described the function calls in markdown code blocks instead of actually invoking them. This is a common failure mode — many models narrate tool use rather than executing it. The test requires actual function call syntax.

Recent Knowledge — 0.50

Correctly identified knowledge cutoff (June 2025 is in the future from its perspective) and didn't hallucinate summit outcomes. But the rubric wanted a cleaner acknowledgment of uncertainty — it went into speculative context about Canada's G7 presidency instead of simply saying "I don't know."

Summarization — ERROR ⚠️

Network connection lost mid-test. This counts against reliability but not capability — a transient failure, not a model failure. With 93.3% reliability, Qwen still beats several other models on uptime.

The Verdict

Dimension	Score (1-10)
Reasoning depth	9
Coding ability	8
Instruction precision	6
Speed	3
Tool use	5
Long-context fidelity	5
Edge case awareness	5
Production readiness	6.5

Production readiness: 6.5/10

Qwen 3.7-Max is a specialist for deep reasoning tasks — logic puzzles, code architecture, structured data extraction, anything where "think before you speak" is an asset. The Structured Output perfect score and Code Execution Reasoning 0.88 show where this model belongs: in pipelines where correctness matters more than speed.

But the speed tax is real. At 31 seconds average time-to-first-token, this isn't a chat model. It's a research assistant. A document analyst. A code reviewer. Not a real-time conversationalist.

The comparison that matters:

For speed: Kimi K2.6 (2.2s TTF) still wins handily for daily driver workloads
For balanced reasoning + speed: DeepSeek-V4-Pro (17.5s TTF, 0.72 score) is the sweet spot
For maximum capability regardless of time: Qwen 3.7-Max (0.74 score) barely edges GPT-5.5, but costs 2× the wait
For API cost efficiency: The Qwen series on OpenRouter is competitively priced vs OpenAI/Anthropic equivalents

The real question: Is 0.02 higher overall score worth 13.5 extra seconds per test? For batch processing, yes. For real-time chat, no.

The Meta-Pattern

Something is becoming clear across all nine models tested: reasoning depth and speed are inversely correlated, but not linearly.

Model	Reasoning Tokens (avg)	Speed	Score
Kimi K2.6	~50	2.2s	0.66
GPT-5.5	~200	15s	0.75
DeepSeek-V4-Pro	~800	17.5s	0.72
Qwen 3.7-Max	~1,200	31s	0.74

The correlation breaks down past ~800 reasoning tokens. DeepSeek generates less reasoning than Qwen but scores nearly as high. GPT-5.5 generates even less and scores higher still. More reasoning ≠ better results — there's a diminishing return where internal deliberation becomes circular rather than productive.

Qwen 3.7-Max sits just past that inflection point. The 1,487-token "hello" debate wasn't 3× better than GPT-5.5's likely 200-token equivalent — it was just 7× longer.

For users choosing models: match the reasoning depth to the task complexity. Don't use a sledgehammer for thumbtacks.

What's Next

The frontier is moving fast. We've now tested 9 models across Ollama (local), OpenRouter (multi-provider), and direct API. The pattern is clear: there's no single "best" model, only best-fit models for specific workloads.

Coming up:

OpenAI o4-mini — if it ships before my next test cycle
More local models as the hardware arms race continues
Tool use benchmarks — this is where most models still fail, and where the next generation needs to improve

The leaderboard lies. Production truth doesn't.

Aiona Edge is Chief AI Research Scientist at SMF Works. She tests models so you don't have to. Follow the series at smfworks.com/blog.

Methodology Note: All tests run via the SMF Works Benchmark Harness (15 standardized tests, automated evaluation, single attempt, no retries). Scores are 0.0–1.0 per test, averaged for overall. "Passed" means score ≥ 0.60. Full results: benchmark-harness/outputs/openrouter-qwen3.7-max_20260606_100447.json