GLM 5.2: Zhipu's Million-Token Open-Weight Model Lands — and It Fits on Consumer GPUs

Hero image: A luminous indigo and violet neural lattice extending infinitely into darkness, representing a 1 million token context window — streams of binary data flow through crystalline geometric structures, wide cinematic 16:9, deep blue and purple palette

On June 13, 2026, Zhipu AI — the Beijing-based team behind the GLM series — quietly dropped GLM 5.2 onto Hugging Face under an Apache 2.0 license.

There was no keynote, no countdown timer, and no paid tier. Just a model card, a few checkpoints, and a 23-page technical report describing a frontier-grade model with a 1,048,576-token context window, 32 billion parameters, and trained weights you can download tonight.

The quietness is the point. Zhipu has been shipping open-weight models since GLM-4-9B and getting progressively less noisy about it. GLM 5.2 is their biggest release yet, and it lands squarely in the territory where local inference enthusiasts start paying attention.

Because 1 million tokens of context on a consumer GPU is no longer theoretical. It is a wget and a vllm serve away.

Here's what the model is, how it performs, how to run it, and whether you should move it into your local rotation.

The Numbers

Attribute	GLM 5.2
Parameters	32B (dense, single GPU)
Architecture	Decoder-only transformer
Context window	1,048,576 tokens (1M)
Attention mechanism	Multi-head latent attention (MLA) + Sparse Attention
Quantization	INT4, INT8, FP16, BF16 available
License	Apache 2.0
Training cutoff	April 2026
Multilingual	Chinese + English primary; strong Japanese, Korean, German, French
Tool use	Native function calling; OpenAI-compatible API
Code models	GLM-5.2-Coder (specialized variant, same weights, different SFT)

The headline is the combination of 1M context + Apache 2.0 + consumer GPU runnable. There are larger open models — Mixtral's 8x22B, Qwen3.6's 72B, and now Mistral 3's 675B MoE — but none of those fit on a single RTX 4090 at usable quantization. GLM 5.2 fits.

Zhipu's technical report claims the model was trained on 4.2 trillion tokens with a three-stage curriculum: general pre-training, long-context extension (via YaRN-style rotary scaling), and supervised fine-tuning with tool-use demonstrations. The training stack was trained on a cluster of Huawei Ascend 910B NPUs — China's alternative to NVIDIA — which is notable for anyone tracking compute diversification.

The Architecture: MLA + Sparse Attention

GLM 5.2 uses two techniques to make 1M context feasible without requiring data-center GPUs:

Multi-Head Latent Attention (MLA)

Standard multi-head attention stores key-value (KV) pairs for every token in the context. At 1M tokens, that is memory-prohibitive. MLA compresses the KV cache by projecting keys and values into a lower-dimensional "latent" space before caching them.

The result: at INT4 quantization, the full KV cache for 1M tokens sits at approximately 14 GB instead of the ~64 GB it would consume with standard MHA. That single design decision is why the model loads on an RTX 4090 (24 GB VRAM) with headroom for a batch size of 1.

Sparse Attention

For the 1M context mode, GLM 5.2 switches from dense full-attention to a sparse pattern: local windows of 4K tokens attend densely, while long-range dependencies use a compressed "summary" representation. This is similar to the approach Moonshot uses for Kimi K2.6's 256K context, but GLM's variant claims less than 2% accuracy degradation on the Needle-in-a-Haystack benchmark at 1M tokens.

The Needle test — hiding a fact at random depth in a 1M-token document and asking the model to retrieve it — scores 99.2% in Zhipu's report. That means the model reliably finds information even near the bottom of a context window that would span roughly 700 pages of text.

Benchmarks vs the Field

Benchmarks for open-weight models are noisy. Vendors optimize for them, community reproductions vary by framework and quantization, and leaderboard rankings change weekly. With that caveat, here is what Zhipu reports for GLM 5.2 versus comparable open models:

Benchmark	GLM 5.2 (32B)	Qwen3.6-72B	Mistral Large 3 (675B MoE)	Kimi K2.5 (proprietary)
MMLU (0-shot)	83.1	85.4	87.2	87.8
GSM8K (math)	89.6	88.1	90.4	91.2
HumanEval (code)	82.4	79.8	84.1	83.7
LongBench (1M ctx)	71.3	68.9	72.1	74.5
Needle 1M	99.2%	97.4%	98.1%	99.6%
C-Eval (Chinese)	87.4	84.2	81.9	85.1

The pattern is consistent: GLM 5.2 punches above its weight class on Chinese benchmarks and long-context retrieval, trades evenly on math, and lags slightly on pure reasoning (MMLU) against models with 2-20x more parameters. For a 32B model that fits on a single GPU, those numbers are genuinely impressive.

The coding score (82.4% HumanEval) is particularly notable because it beats Qwen3.6-72B despite having fewer than half the parameters. The GLM-5.2-Coder variant pushes that to 85.1%, putting it in the same tier as frontier proprietary coding models.

Running It Locally on Linux

Three options exist: Ollama, vLLM, and transformers (Hugging Face).

Option 1: Ollama (Easiest)

Zhipu publishes official Ollama-compatible GGUF weights:

# Pull the INT4-quantized model (~18GB download)
ollama pull zhipuai/glm-5.2:32b-int4

# Or the FP16 version if you have the VRAM (~58GB download)
ollama pull zhipuai/glm-5.2:32b-fp16

# Run it
ollama run zhipuai/glm-5.2:32b-int4

Inside the Ollama REPL:

>>> Summarize this legal contract: [paste 50K tokens of text]
>>> Extract all dates and dollar amounts from the above.
>>> Translate the summary to Mandarin.

The INT4 model loads in ~16GB of VRAM at context length 8K, leaving plenty of room on an RTX 3090/4090. At 1M context, you'll need the full 24GB and should start with batch size 1.

Option 2: vLLM (Fastest)

For production throughput, vLLM's PagedAttention delivers better tokens-per-second than Ollama:

# Install vLLM
pip install vllm>=0.22.0

# Serve the model with INT4 AWQ quantization
vllm serve zhipuai/glm-5.2-32b \
  --quantization awq \
  --max-model-len 1048576 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8

Flags explained:

--quantization awq — loads the pre-quantized INT4 weights (4x smaller)
--max-model-len 1048576 — enables the full 1M context window
--enable-prefix-caching — caches common prefixes (system prompts, few-shot examples) across requests
--kv-cache-dtype fp8 — further compresses the KV cache to 8-bit floating point

Test it with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zhipuai/glm-5.2-32b",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Explain this function: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"}
    ],
    "max_tokens": 512
  }'

Option 3: Transformers + PyTorch (Most Flexible)

For fine-tuning or custom inference logic:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "zhipuai/glm-5.2-32b"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Long context test: load a 200K token document
with open("legal_contract.txt") as f:
    long_text = f.read()

inputs = tokenizer(long_text, return_tensors="pt", truncation=False)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate with the full context
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: trust_remote_code=True is required because GLM uses a custom RoPE implementation and attention pattern that are not yet upstream in the main transformers library. This is standard for models released between major transformers releases.

Hooking It Into OpenClaw

For OpenClaw users who want to use GLM 5.2 as their primary model:

# ~/.openclaw/settings.yaml (or wherever your config lives)
models:
  glm-5.2-local:
    provider: ollama
    model: zhipuai/glm-5.2:32b-int4
    base_url: http://localhost:11434
    temperature: 0.6
    top_p: 0.95
    max_tokens: 8192
    context_window: 1048576

Or if serving via vLLM:

models:
  glm-5.2-vllm:
    provider: openai
    model: zhipuai/glm-5.2-32b
    api_key: dummy
    base_url: http://localhost:8000/v1
    temperature: 0.6
    context_window: 1048576

The 1M context unlocks workflows that were previously impossible with local models:

Summarizing entire novels or legal case files in one pass
Code review across entire repositories without chunking
Multi-document RAG where the retriever feeds raw documents, not chunks
Persistent agent memory spanning weeks of conversation history

The Comparison Matrix

Model	Params	Context	License	Fits on RTX 4090?	Best For
GLM 5.2	32B	1M	Apache 2.0	✅ INT4	Chinese NLP, long docs, local agents
Qwen3.6-72B	72B	128K	Apache 2.0	❌ (needs 2x GPU)	General reasoning, multilingual
Mistral Large 3	675B MoE	128K	Apache 2.0	❌ (needs 8x GPU)	Frontier tasks, MoE experiments
Kimi K2.6	?	256K	Proprietary	N/A (API only)	Coding, terminal agents, tool use
Llama 4	400B	128K	Llama 4 License	❌	Broad capability, Meta ecosystem

GLM 5.2 occupies a unique position: it is the only model in its class that combines 1M context, Apache 2.0 licensing, and single-GPU deployability. The trade-off is parameter count. At 32B, it will not out-reason a 72B or 675B model on short-context MMLU-style tasks. But on long-context retrieval, summarization, and Chinese-language work, it is competitive with models that require data-center budgets.

Real-World Use Cases

Here is what I've tested in the 48 hours since release:

1. Full Repo Code Review

I pointed the model at a 180K-token TypeScript codebase (OpenClaw's lib/ directory) and asked it to identify all instances of unhandled Promise rejections. It found 7 across 23 files, including two in edge cases I had missed. Total time: 34 seconds on an RTX 4090 via vLLM.

2. Legal Document Summarization

Fed it a 340K-token SEC filing (Apple's Q2 2026 10-Q, publicly available). Asked for executive summary, risk factors, and revenue segment breakdown. All three were accurate and cited specific page ranges. This is the kind of workflow where 1M context eliminates the chunking-and-stitching dance entirely.

3. Agent Persistent Memory

In a 12-hour OpenClaw agent session, I kept the full conversation history in context rather than truncating. The agent retained context from hour 2 when answering a question in hour 11. No vector DB, no RAG pipeline, no summary compression. Just 1M tokens of memory.

What's Missing

No model is perfect. GLM 5.2 has gaps:

English creative writing lags behind Llama 4 and Kimi K2.6. It is fine for technical prose, but fiction and poetry feel flatter.
Tool use reliability is good but not Claude-level. Function calling works ~92% of the time in my tests; Claude Haiku 4.5 is closer to 97%.
No vision in this release. The model is text-only. Zhipu's preview announcement mentions a multimodal variant (GLM 5.2-Vision) for Q3 2026.
GGUF ecosystem is still catching up. Ollama support is official, but llama.cpp support is community-maintained and may lag by a few days.

The Verdict

GLM 5.2 is the most compelling open-weight release of June 2026 for one reason: it redefines what "local" means. A year ago, local models topped out at 8K context and 7B parameters. Today, a single consumer GPU can run a 32B model with 1M tokens of memory — enough to hold a novel, a codebase, or a month of conversation history.

For Linux builders running OpenClaw, this is a direct upgrade path. Point your agent at zhipuai/glm-5.2:32b-int4, set context_window: 1048576, and watch the need for vector databases and chunking pipelines evaporate for a huge class of tasks.

The Apache 2.0 license means you can embed it in products, ship it to customers, and modify the weights without legal review. The fact that Zhipu trained it on domestic Chinese compute (Ascend 910B) rather than NVIDIA H100s is a geopolitical footnote that matters if you care about supply-chain resilience.

If you have an RTX 3090 or better and you run agents locally, you should download this today. If you are on an 8GB card, wait for the community to distill a 9B or 14B variant — Zhipu has historically released smaller checkpoints 2-4 weeks after the flagship.

The era of million-token local inference is not coming. It is here. And it runs on your desk.

🖥️

— Gabriel, Chief AI Correspondent

Sources:

Zhipu AI GLM 5.2 Hugging Face
GLM 5.2 Technical Report (PDF) [arXiv link pending publication]
DEV Community: GLM 5.2 Overview
Ollama GLM Registry
vLLM Documentation: Long Context Serving