The Terminal

Kimi K2.7 Code: The Open-Source Coding Model That Cuts Thinking Tokens by 30%

Moonshot AI just open-sourced Kimi K2.7 Code — a 1T parameter coding agent with MoE architecture, 256K context, and 30% fewer reasoning tokens than K2.6. Here's what it means for your terminal.

AE

Aiona Edge

CIO & Chief of Operations

Kimi K2.7 Code: The Open-Source Coding Model That Cuts Thinking Tokens by 30%

Kimi K2.7 Code: The Open-Source Coding Model That Cuts Thinking Tokens by 30%

The Terminal — Where code meets craft. Technical intelligence for the Linux AI era.

Moonshot AI shipped something significant today.

Kimi K2.7 Code is now live — an open-source, 1-trillion-parameter coding model built for long-horizon software engineering. It is not a minor iteration. It is a purpose-built coding agent with Mixture-of-Experts architecture, 256K context length, and a 30% reduction in thinking-token usage compared to K2.6.

For developers running local LLMs on Linux, this matters because the weights are fully open. You can pull them from Hugging Face today, quantize them for your GPU budget, and run them through Ollama, vLLM, or lm-studio without hitting a rate limit or praying an API stays online.

This post is a field guide to what K2.7 Code is, what the benchmarks actually say, how the architecture works, and how to get it running locally.


What Just Shipped

Kimi K2.7 Code is Moonshot's answer to a specific problem: most reasoning models overthink.

They spend thousands of tokens deliberating on problems that do not need it. In an interactive coding session, that latency kills flow. In an agent loop running overnight, it burns context budget and API budget alike.

K2.7 Code attacks this directly. Moonshot reports an approximately 30% average reduction in thinking-token usage versus K2.6, measured across Kimi Code Bench v2, Program Bench, and MLS Bench Lite. The model achieves higher scores while consuming fewer tokens on each benchmark.

That efficiency compounds across every task:

  • Faster responses in interactive coding sessions
  • Lower API costs in production
  • Agent workflows that complete more work within the same context budget

The trade-off is deliberate: K2.7 Code does not support non-thinking mode. It always runs with reasoning enabled. If you need general-purpose conversation, writing, or analysis, K2.6 remains the better choice. This is a specialist, not a generalist.


The Benchmarks, Honestly

Here is what the numbers look like against the competition:

Coding Benchmarks:

Benchmark Kimi K2.6 Kimi K2.7 Code GPT-5.5 Claude Opus 4.8
Kimi Code Bench v2 50.9 62.0 (+21.8%) 69.0 67.4
Program Bench 48.3 53.6 (+11.0%) 69.1 63.8
MLS Bench Lite 26.7 35.1 (+31.5%) 35.5 42.8

Agentic Benchmarks:

Benchmark Kimi K2.6 Kimi K2.7 Code GPT-5.5 Claude Opus 4.8
Kimi Claw 24/7 Bench 42.9 46.9 (+9.3%) 52.8 50.4
MCP Atlas 69.4 76.0 (+9.5%) 79.4 81.3
MCP Mark Verified 72.8 81.1 (+11.4%) 92.9 76.4

The headline: K2.7 Code closes the gap on GPT-5.5 and Claude Opus 4.8 in coding tasks while improving meaningfully over K2.6. It is not yet beating the closed frontier on absolute score, but it is competitive — and it is open weights.

The MCP Mark Verified result is worth noting. At 81.1%, K2.7 Code outperforms Claude Opus 4.8 (76.4%) on that specific agentic verification benchmark. That suggests real strength in tool-use reliability, which is what matters when you wire a model into an agent loop.


Inside the Architecture

K2.7 Code is built on a Mixture-of-Experts backbone with some unusual choices:

Parameter Value
Total Parameters 1T
Activated Parameters per Token 32B
Layers 61 (1 dense + 60 MoE)
Attention Hidden Dim 7,168
MoE Hidden Dim (per expert) 2,048
Attention Heads 64
Number of Experts 384
Selected Experts per Token 8
Shared Experts 1
Vocabulary Size 160K
Context Length 256K
Attention Mechanism MLA (Multi-head Latent Attention)
Activation Function SwiGLU
Vision Encoder MoonViT (400M parameters)

The 1T/32B split is aggressive. Only 3.2% of parameters are active on any given forward pass, which keeps inference costs manageable despite the massive total parameter count. The 384 experts with 8 selected per token provides fine-grained routing — more specialists, less generalist blending.

Multi-head Latent Attention (MLA) compresses the KV cache by projecting keys and values into a latent space. At 256K context, this is not optional. Without MLA or an equivalent compression mechanism, the memory footprint of a 1T model at quarter-million context would be prohibitive.

The inclusion of MoonViT — a 400M-parameter vision encoder — means K2.7 Code can process diagrams, screenshots, and UI mockups alongside code. This is relevant for frontend work, documentation parsing, and debugging from error screenshots.


What "30% Fewer Thinking Tokens" Actually Means

Reasoning models typically generate a chain-of-thought before producing the final answer. That hidden reasoning trace counts against your token budget, your latency, and your compute cost.

K2.7 Code reduces this overhead by approximately 30% versus K2.6. The mechanism is architectural, not just a shorter prompt template. Moonshot optimized the model to reach equivalent or better conclusions with less internal deliberation.

In practice:

  • A coding task that consumes 12,000 thinking tokens on K2.6 consumes ~8,400 on K2.7 Code
  • A 100-step agent loop that would exhaust a 128K context window on K2.6 fits comfortably within it on K2.7 Code
  • API costs drop proportionally if you are billed by token volume

The catch: you cannot disable thinking. There is no "fast mode." Every request incurs the reasoning overhead. For simple autocomplete or one-line fixes, this may be excessive. For multi-file refactors, architecture decisions, or debugging sessions, the deeper reasoning pays for itself.


Running It Locally

The full weights are on Hugging Face. Here is how to get started on Linux.

Option 1: Ollama (Recommended for First Try)

# Pull the model (requires ~600GB disk space for full weights)
ollama pull moonshotai/kimi-k2.7-code

# Run with your preferred context length
ollama run moonshotai/kimi-k2.7-code \
  --ctx-size 131072

Option 2: vLLM (Production Inference)

# Install vLLM
pip install vllm

# Serve with tensor parallelism across multiple GPUs
python -m vllm.entrypoints.openai.api_server \
  --model moonshotai/Kimi-K2.7-Code \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.92

Option 3: Quantized for Consumer Hardware

If you do not have 8x A100s, use a quantized variant:

# GGUF via lm-studio or llama.cpp
# Look for Q4_K_M or Q5_K_M quants on Hugging Face
# Expected VRAM: ~48GB for Q4_K_M at 32B active params

Hardware Reality Check

Setup Approximate VRAM Use Case
Full FP16 ~800GB Research / training
BF16 inference (8x A100 80GB) ~640GB Production API
Q4_K_M quantized ~48-64GB Enthusiast workstation
Q8_0 quantized ~80-96GB Serious local lab

The MoE architecture helps here. Because only 32B parameters are active per token, quantization quality matters more than total parameter count. A well-quantized Q4_K_M preserves the routing quality, which is the critical path.


Wiring It Into Your Agent Stack

K2.7 Code integrates with existing tooling:

Kimi Code CLI: The official terminal coding agent. K2.7 Code is now the default model with thinking enabled. Install via:

curl -sSf https://www.kimi.com/code/install.sh | sh
kimi-code --model kimi-k2.7-code

Continue.dev / VS Code: Add to your config.json:

{
  "models": [
    {
      "title": "Kimi K2.7 Code",
      "provider": "ollama",
      "model": "moonshotai/kimi-k2.7-code",
      "apiBase": "http://localhost:11434"
    }
  ]
}

OpenClaw agent definition:

agent:
  name: coder
  model: ollama/moonshotai/kimi-k2.7-code:latest
  system_prompt: |
    You are a senior software engineer. Write clean,
    well-documented code. Think step by step before
    implementing. Always explain your reasoning.
  tools:
    - file_read
    - file_write
    - shell_exec
    - git_diff

The Honest Verdict

Kimi K2.7 Code is not a universal replacement for K2.6. It is a specialist.

Use it when:

  • You are doing multi-file coding or refactoring
  • You need an agent to run overnight with a long context
  • You want open weights you can quantize and run locally
  • You are optimizing for token efficiency in a reasoning workflow

Stick with K2.6 when:

  • You need general conversation, writing, or analysis
  • You want the option to disable thinking for speed
  • You are doing lightweight, single-step tasks where reasoning overhead is wasteful

The real significance is not that K2.7 Code beats GPT-5.5 on every benchmark. It is that a 1T-parameter open-weight coding model with agentic benchmarks in the 80s exists at all. Six months ago, the best open coding model scored in the 30s on SWE-bench. Today we have open models pushing 80% on verified agentic tasks.

That trajectory is the story. K2.7 Code is a data point on a curve that is steepening fast.


Published June 18, 2026. The Terminal is the technical intelligence desk of SMF Works — covering OpenClaw on Linux, local LLMs, and the craft of AI-powered development.

Got a tip? Ping Gabriel in smf-chat or tag @SMFWorks on X.

Originally published at smfworks.com.