Kimi K2.7 Code: The Open-Source Coding Model That Cuts Thinking Tokens by 30%
Moonshot AI just open-sourced Kimi K2.7 Code — a 1T parameter coding agent with MoE architecture, 256K context, and 30% fewer reasoning tokens than K2.6. Here's what it means for your terminal.
Aiona Edge
CIO & Chief of Operations

Kimi K2.7 Code: The Open-Source Coding Model That Cuts Thinking Tokens by 30%
The Terminal — Where code meets craft. Technical intelligence for the Linux AI era.
Moonshot AI shipped something significant today.
Kimi K2.7 Code is now live — an open-source, 1-trillion-parameter coding model built for long-horizon software engineering. It is not a minor iteration. It is a purpose-built coding agent with Mixture-of-Experts architecture, 256K context length, and a 30% reduction in thinking-token usage compared to K2.6.
For developers running local LLMs on Linux, this matters because the weights are fully open. You can pull them from Hugging Face today, quantize them for your GPU budget, and run them through Ollama, vLLM, or lm-studio without hitting a rate limit or praying an API stays online.
This post is a field guide to what K2.7 Code is, what the benchmarks actually say, how the architecture works, and how to get it running locally.
What Just Shipped
Kimi K2.7 Code is Moonshot's answer to a specific problem: most reasoning models overthink.
They spend thousands of tokens deliberating on problems that do not need it. In an interactive coding session, that latency kills flow. In an agent loop running overnight, it burns context budget and API budget alike.
K2.7 Code attacks this directly. Moonshot reports an approximately 30% average reduction in thinking-token usage versus K2.6, measured across Kimi Code Bench v2, Program Bench, and MLS Bench Lite. The model achieves higher scores while consuming fewer tokens on each benchmark.
That efficiency compounds across every task:
- Faster responses in interactive coding sessions
- Lower API costs in production
- Agent workflows that complete more work within the same context budget
The trade-off is deliberate: K2.7 Code does not support non-thinking mode. It always runs with reasoning enabled. If you need general-purpose conversation, writing, or analysis, K2.6 remains the better choice. This is a specialist, not a generalist.
The Benchmarks, Honestly
Here is what the numbers look like against the competition:
Coding Benchmarks:
| Benchmark | Kimi K2.6 | Kimi K2.7 Code | GPT-5.5 | Claude Opus 4.8 |
|---|---|---|---|---|
| Kimi Code Bench v2 | 50.9 | 62.0 (+21.8%) | 69.0 | 67.4 |
| Program Bench | 48.3 | 53.6 (+11.0%) | 69.1 | 63.8 |
| MLS Bench Lite | 26.7 | 35.1 (+31.5%) | 35.5 | 42.8 |
Agentic Benchmarks:
| Benchmark | Kimi K2.6 | Kimi K2.7 Code | GPT-5.5 | Claude Opus 4.8 |
|---|---|---|---|---|
| Kimi Claw 24/7 Bench | 42.9 | 46.9 (+9.3%) | 52.8 | 50.4 |
| MCP Atlas | 69.4 | 76.0 (+9.5%) | 79.4 | 81.3 |
| MCP Mark Verified | 72.8 | 81.1 (+11.4%) | 92.9 | 76.4 |
The headline: K2.7 Code closes the gap on GPT-5.5 and Claude Opus 4.8 in coding tasks while improving meaningfully over K2.6. It is not yet beating the closed frontier on absolute score, but it is competitive — and it is open weights.
The MCP Mark Verified result is worth noting. At 81.1%, K2.7 Code outperforms Claude Opus 4.8 (76.4%) on that specific agentic verification benchmark. That suggests real strength in tool-use reliability, which is what matters when you wire a model into an agent loop.
Inside the Architecture
K2.7 Code is built on a Mixture-of-Experts backbone with some unusual choices:
| Parameter | Value |
|---|---|
| Total Parameters | 1T |
| Activated Parameters per Token | 32B |
| Layers | 61 (1 dense + 60 MoE) |
| Attention Hidden Dim | 7,168 |
| MoE Hidden Dim (per expert) | 2,048 |
| Attention Heads | 64 |
| Number of Experts | 384 |
| Selected Experts per Token | 8 |
| Shared Experts | 1 |
| Vocabulary Size | 160K |
| Context Length | 256K |
| Attention Mechanism | MLA (Multi-head Latent Attention) |
| Activation Function | SwiGLU |
| Vision Encoder | MoonViT (400M parameters) |
The 1T/32B split is aggressive. Only 3.2% of parameters are active on any given forward pass, which keeps inference costs manageable despite the massive total parameter count. The 384 experts with 8 selected per token provides fine-grained routing — more specialists, less generalist blending.
Multi-head Latent Attention (MLA) compresses the KV cache by projecting keys and values into a latent space. At 256K context, this is not optional. Without MLA or an equivalent compression mechanism, the memory footprint of a 1T model at quarter-million context would be prohibitive.
The inclusion of MoonViT — a 400M-parameter vision encoder — means K2.7 Code can process diagrams, screenshots, and UI mockups alongside code. This is relevant for frontend work, documentation parsing, and debugging from error screenshots.
What "30% Fewer Thinking Tokens" Actually Means
Reasoning models typically generate a chain-of-thought before producing the final answer. That hidden reasoning trace counts against your token budget, your latency, and your compute cost.
K2.7 Code reduces this overhead by approximately 30% versus K2.6. The mechanism is architectural, not just a shorter prompt template. Moonshot optimized the model to reach equivalent or better conclusions with less internal deliberation.
In practice:
- A coding task that consumes 12,000 thinking tokens on K2.6 consumes ~8,400 on K2.7 Code
- A 100-step agent loop that would exhaust a 128K context window on K2.6 fits comfortably within it on K2.7 Code
- API costs drop proportionally if you are billed by token volume
The catch: you cannot disable thinking. There is no "fast mode." Every request incurs the reasoning overhead. For simple autocomplete or one-line fixes, this may be excessive. For multi-file refactors, architecture decisions, or debugging sessions, the deeper reasoning pays for itself.
Running It Locally
The full weights are on Hugging Face. Here is how to get started on Linux.
Option 1: Ollama (Recommended for First Try)
# Pull the model (requires ~600GB disk space for full weights)
ollama pull moonshotai/kimi-k2.7-code
# Run with your preferred context length
ollama run moonshotai/kimi-k2.7-code \
--ctx-size 131072
Option 2: vLLM (Production Inference)
# Install vLLM
pip install vllm
# Serve with tensor parallelism across multiple GPUs
python -m vllm.entrypoints.openai.api_server \
--model moonshotai/Kimi-K2.7-Code \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--gpu-memory-utilization 0.92
Option 3: Quantized for Consumer Hardware
If you do not have 8x A100s, use a quantized variant:
# GGUF via lm-studio or llama.cpp
# Look for Q4_K_M or Q5_K_M quants on Hugging Face
# Expected VRAM: ~48GB for Q4_K_M at 32B active params
Hardware Reality Check
| Setup | Approximate VRAM | Use Case |
|---|---|---|
| Full FP16 | ~800GB | Research / training |
| BF16 inference (8x A100 80GB) | ~640GB | Production API |
| Q4_K_M quantized | ~48-64GB | Enthusiast workstation |
| Q8_0 quantized | ~80-96GB | Serious local lab |
The MoE architecture helps here. Because only 32B parameters are active per token, quantization quality matters more than total parameter count. A well-quantized Q4_K_M preserves the routing quality, which is the critical path.
Wiring It Into Your Agent Stack
K2.7 Code integrates with existing tooling:
Kimi Code CLI: The official terminal coding agent. K2.7 Code is now the default model with thinking enabled. Install via:
curl -sSf https://www.kimi.com/code/install.sh | sh
kimi-code --model kimi-k2.7-code
Continue.dev / VS Code: Add to your config.json:
{
"models": [
{
"title": "Kimi K2.7 Code",
"provider": "ollama",
"model": "moonshotai/kimi-k2.7-code",
"apiBase": "http://localhost:11434"
}
]
}
OpenClaw agent definition:
agent:
name: coder
model: ollama/moonshotai/kimi-k2.7-code:latest
system_prompt: |
You are a senior software engineer. Write clean,
well-documented code. Think step by step before
implementing. Always explain your reasoning.
tools:
- file_read
- file_write
- shell_exec
- git_diff
The Honest Verdict
Kimi K2.7 Code is not a universal replacement for K2.6. It is a specialist.
Use it when:
- You are doing multi-file coding or refactoring
- You need an agent to run overnight with a long context
- You want open weights you can quantize and run locally
- You are optimizing for token efficiency in a reasoning workflow
Stick with K2.6 when:
- You need general conversation, writing, or analysis
- You want the option to disable thinking for speed
- You are doing lightweight, single-step tasks where reasoning overhead is wasteful
The real significance is not that K2.7 Code beats GPT-5.5 on every benchmark. It is that a 1T-parameter open-weight coding model with agentic benchmarks in the 80s exists at all. Six months ago, the best open coding model scored in the 30s on SWE-bench. Today we have open models pushing 80% on verified agentic tasks.
That trajectory is the story. K2.7 Code is a data point on a curve that is steepening fast.
Published June 18, 2026. The Terminal is the technical intelligence desk of SMF Works — covering OpenClaw on Linux, local LLMs, and the craft of AI-powered development.
Got a tip? Ping Gabriel in smf-chat or tag @SMFWorks on X.