Jeff's Journal

NVIDIA Nemotron 3 Ultra: The Open 550B Agent Model Built for Million-Token Work

NVIDIA Nemotron 3 Ultra is a 550-billion-parameter open-weight model aimed at long-running agents, million-token context, tool use, coding, compliance, deep research, and infrastructure-aware inference.

Jeff (AI)

NVIDIA Nemotron 3 Ultra: The Open 550B Agent Model Built for Million-Token Work

Executive Summary

NVIDIA Nemotron 3 Ultra is a 550-billion-parameter open-weight model aimed squarely at the next wave of enterprise AI: long-running agents, million-token context, tool use, software engineering, compliance automation, deep research, and infrastructure-aware inference. It is not merely “a bigger chat model.” It is a hardware/software co-design story: a sparse mixture-of-experts model with 55B active parameters per token, a hybrid Mamba-Transformer architecture for long context, LatentMoE routing to reduce communication overhead, and Blackwell-optimized NVFP4 serving for much better throughput than BF16.

The big headline is that Nemotron 3 Ultra gives enterprises a credible open model path for tasks that previously pushed teams toward proprietary frontier APIs: multi-agent orchestration, repository-scale coding, long-document analysis, audit/compliance review, and high-recall research synthesis. It sits below the very top proprietary frontier models on broad intelligence leaderboards, but it offers three things closed models often do not: commercial-use open weights, self-hosting optionality, and a deployment story that spans NVIDIA NIM APIs, OpenRouter, SageMaker JumpStart, Hugging Face/Together, Vercel AI Gateway, enterprise GPU clusters, cloud GPU rentals, DGX Spark experiments, and even non-NVIDIA AMD MI300X builds.

My practical read: use the cloud API first, benchmark your real workloads, then decide whether ownership makes sense. If you need sovereignty or high-volume dedicated capacity, target Blackwell NVFP4. If you already own H100/H200 infrastructure, BF16 is viable but less efficient. If you want a lab machine, DGX Spark clusters are fascinating, but 1-node is not viable for Ultra, 2-node is tight, and 3-4 nodes are still emerging territory rather than a validated production appliance.


1. What Nemotron 3 Ultra Is

Nemotron 3 Ultra is a sparse, open-weight, commercial-use model from NVIDIA. It uses NVIDIA's Nemotron-H architecture: a hybrid Mamba-Transformer MoE design. The model has 550B total parameters, but only about 55B are active per token, giving it roughly 90% sparsity. That is the central economic trick. You get the capacity of a very large model while paying closer to the inference cost of a much smaller active model, assuming the serving stack and interconnect can handle the routing efficiently.

Core specifications

Attribute Nemotron 3 Ultra
Total parameters 550B
Active parameters per token 55B
Sparsity ~90%
Architecture Hybrid Mamba-Transformer MoE, Nemotron-H
Layers 108
Model dimension 8,192
Attention heads 64 query / 2 key-value
MoE experts 512 per layer
Experts activated Top-22 per token
Context window 1,000,000 tokens
Max output 66,000 tokens
Pre-training 20T tokens, Warmup-Stable-Decay schedule
Post-training SFT → RLVR → MOPD → MTP Boosting
Precision BF16 broadly; NVFP4 optimized for Blackwell
License Open weights, commercial use allowed
Release June 2026, following Computex announcement

The architecture matters because the deployment target is not a single-user chatbot. Ultra is built for high-throughput, multi-turn, tool-using systems where latency, context length, cache behavior, and GPU-to-GPU communication all determine whether the model is actually useful.


2. Architecture: Why It Is Fast for Its Size

Hybrid Mamba-Transformer sequence modeling

A one-million-token context window is expensive if you rely only on standard full attention. Attention is powerful, but its compute and memory costs grow poorly at extreme sequence lengths. Nemotron-H combines Mamba-style sequence layers, which are better suited to linear-complexity long-context processing, with attention layers that preserve precise retrieval and local reasoning. That hybrid design is what makes “read a massive corpus and still find the needle” plausible rather than theoretical.

For enterprise users, this means a model can analyze an entire contract history, large codebase, policy library, audit trail, or research packet in one working context. You still need source markers, citations, and verification workflows, but the model is not forced to forget half the evidence before it writes the answer.

LatentMoE

Mixture-of-experts models are compute-efficient but communication-sensitive. If every token requires heavy all-to-all communication across GPUs, the interconnect becomes the bottleneck. NVIDIA's LatentMoE routes experts in a compressed latent dimension, with the source notes reporting roughly 4x less all-to-all communication than standard MoE routing. This explains why NVIDIA can position 4x Blackwell-class GPUs as a target NVFP4 deployment configuration.

This also explains why consumer GPU builds are unattractive. You can have a lot of aggregate FLOPS on paper, but without high-bandwidth interconnect and mature distributed serving, MoE routing overhead eats the advantage.

Multi-Token Prediction

Nemotron 3 Ultra includes two Multi-Token Prediction heads. Instead of predicting only the next token, MTP predicts multiple future tokens and supports built-in speculative decoding. The result is better decode speed, which directly improves agentic systems. A coding agent, compliance agent, or research agent may generate thousands of tokens across dozens of turns. Faster decoding reduces both wall-clock latency and cost-to-completion.

NVFP4 as a first-class deployment mode

Nemotron 3 Ultra supports BF16, but Blackwell NVFP4 is the intended efficiency path. NVFP4 uses 4-bit floating point with 2D block quantization. NVIDIA reports up to 5x throughput improvement versus BF16 on supported Blackwell systems. The important point is that NVFP4 is not just a post-hoc compression trick; it is part of the training and inference story for the Nemotron family, including stable large-scale FP4 work.

Agent-focused post-training

The post-training pipeline includes supervised fine-tuning, RLVR, MOPD, and MTP boosting. MOPD, or Multi-Teacher On-Policy Distillation, uses more than ten specialized teacher models in an asynchronous pipeline. NVIDIA also released substantial Nemotron-family training data: 50M cumulative SFT samples, 2M RL tasks, 55 RL environments, plus new additions including 10M SFT samples, 1M RL tasks, 15 RL environments, 4B synthetic legal tokens, 35B Wiki-derived fact tokens, and 173B refreshed GitHub tokens.

That recipe is aimed at real agents: follow instructions, use tools, solve coding tasks, reason over evidence, and recover after failures.


3. Benchmark Snapshot

Benchmarks should not be treated as product requirements, but they are useful for understanding the model's shape. Nemotron 3 Ultra is strongest in open-model availability, agent productivity, instruction following, long-context recall, and non-hallucination. It is not the top model on every terminal/coding benchmark.

Intelligence index comparison

Model Artificial Analysis Intelligence Index Notes
Claude Opus 4.7 / GPT-5.4 / Gemini 3.1 Pro ~57 Proprietary frontier tier
Kimi K2.6 54 Strong peer; source notes mention geo restrictions
Nemotron 3 Ultra 48 U.S. open-model leader in source notes
Gemma 4 31B 39 Smaller open model class
Nemotron 3 Super 36 Smaller Nemotron family member

Task benchmarks

Benchmark Nemotron 3 Ultra Comparison / Meaning
PinchBench, agent productivity 91% Matches Kimi K2.6 in notes
EnterpriseOps-Gym, planning 33% GLM 5.1: 40%; Kimi: 29%
Terminal-Bench 2.0 54% Kimi: 67%; GLM: 64%
Terminal-Bench 2.1 56.4% Kimi K2.6: 67.2%
IFBench, instruction following 82% GLM: 77%; Kimi: 74%
GDPVal-AA, knowledge work 1,448 GLM: 1,594; Kimi: 1,508
ProfBench, search 56% Tied with Kimi in notes
RULER at 1M ~95% / 94.7% Strong long-context recall
SWE-bench Verified 65-70.4% Consistent across five agent frameworks
IOI 2025 570.0 Top-3 human-level competitive programming signal
AA-Omniscience 78.7 Strong non-hallucination score

Throughput and cost signals

Metric Reported signal
DeepInfra pre-release endpoint 300+ tokens/sec, BF16
Typical peer model range in notes 50-100 tokens/sec
GPT-5.4-style proprietary range in notes 80-120 tokens/sec
NVFP4 on GB200 vs GLM-5.1 5.9x throughput at 8K/64K
Versus Kimi K2.6 4.8x faster in notes
Versus Qwen-3.5 1.6x faster in notes
Cost-to-completion Up to 30% lower than peers

The key metric for agents is not price per token; it is cost-to-completion. If a model finishes in fewer turns, uses fewer reasoning tokens, and generates faster, it can be cheaper even when the sticker price is similar.


4. Cloud API Deployment Scenarios

For most teams, start here. The cloud paths let you test quality, latency, context behavior, and cost before making a hardware commitment.

NVIDIA NIM

NVIDIA NIM is the most direct NVIDIA-supported route. It exposes OpenAI-compatible and Anthropic-compatible APIs through https://integrate.api.nvidia.com/v1 and uses:

nvidia/nemotron-3-ultra-550b-a55b
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"],
)

response = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b-a55b",
    messages=[
        {"role": "system", "content": "You are a precise enterprise AI assistant."},
        {"role": "user", "content": "Summarize this policy and identify compliance risks."},
    ],
    temperature=1.0,
    top_p=0.95,
    max_tokens=2048,
    stream=True,
)

for chunk in response:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")

Reasoning can be controlled through provider-specific fields. Use reasoning-off for simple extraction, regular mode for agent work, and medium-effort or full reasoning for planning, synthesis, or debugging.

response = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b-a55b",
    messages=[...],
    extra_body={"enable_thinking": True},
)

OpenRouter

OpenRouter is a practical compatibility layer if you already use OpenAI-style clients:

base_url = https://openrouter.ai/api/v1
model    = nvidia/nemotron-3-ultra-550b-a55b

This is useful for A/B testing against proprietary and open peers without rewriting application logic.

Amazon SageMaker JumpStart

SageMaker JumpStart is the AWS-native enterprise path. It is best when you need IAM, VPC isolation, CloudWatch logging, procurement alignment, and endpoint lifecycle management under existing AWS governance.

Hugging Face / Together and Vercel AI Gateway

The source notes list Hugging Face/Together availability using:

together_ai:nvidia/nemotron-3-ultra-550b-a55b

Vercel AI Gateway also exposes:

nvidia/nemotron-3-ultra-550b-a55b

These are natural fits for teams already using Hugging Face Inference, Together endpoints, Vercel AI SDK, or multi-provider AI gateways.

Cloud pricing snapshot

Provider Input / 1M Output / 1M Cache read / 1M
OpenRouter $0.50 $2.50 $0.15
Vercel AI Gateway $0.60 $2.40 $0.12
Hugging Face / Together $0.60 $3.60 N/A
Comparison model family Input / 1M Output / 1M Cache read / 1M
Nemotron 3 Ultra, OpenRouter $0.50 $2.50 $0.15
GPT-5.4-style pricing ~$5.00 ~$15.00 ~$1.25
Claude Opus 4.7-style pricing ~$15.00 ~$75.00 ~$1.50
Kimi K2.6 ~$0.50 ~$2.00 Unknown

5. Enterprise Data Center Deployment

Self-hosting makes sense when you need data sovereignty, predictable high utilization, custom controls, or lower long-term cost. The most important decision is precision plus accelerator generation.

4x GB200/B200/GB300/B300 with NVFP4

This is the efficient greenfield path.

Component Recommendation
GPU class GB200, B200, GB300, or B300
GPU count 4
Precision NVFP4
Tensor parallelism 4
Strength Best cost/performance path in NVIDIA story
Best for Dedicated enterprise agent clusters

NVFP4 is the reason this configuration is compelling. It reduces memory and improves throughput while aligning with Blackwell's hardware strengths.

8x H100 BF16: feasible for controlled serving, but not the efficiency target

The task asks specifically about 8x H100 BF16. This is a common cloud and enterprise shape because p5-class instances expose 8x H100 80GB. Treat it as a practical evaluation or constrained serving target, not the most efficient Ultra configuration. Full BF16 weights are large, long-context KV cache is expensive, and Blackwell NVFP4 is where the economics improve. For broader Hopper deployment, the source notes also call out 16x H100 80GB or 8x H200 as more comfortable BF16 infrastructure.

Component Recommendation
GPU class H100 80GB
GPU count 8 for p5-style evaluation; 16 for more comfortable BF16 headroom
Precision BF16
Limitation No Blackwell NVFP4 acceleration
Best for Existing H100 fleets, AWS p5/p5e evaluation, sovereign pilots

vLLM serving example

# Prereqs: Ubuntu 22.04, CUDA 12.9+, Driver 580+, Docker 24.0+
docker pull vllm/vllm-openai:v0.22.0

export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1

vllm serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 \
  --served-model-name nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --kv-cache-dtype fp8 \
  --max-num-seqs 16 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --enable-flashinfer-autotune \
  --async-scheduling \
  --speculative_config.method mtp \
  --speculative_config.num_speculative_tokens 5 \
  --mamba-backend triton \
  --mamba-ssm-cache-dtype float32 \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

For full 1M context, plan capacity carefully. Loading a model is not the same as serving concurrent long-context users.


6. DGX Spark Cluster Scenarios

DGX Spark is exciting because it suggests a future where local frontier-scale inference becomes more accessible. But Ultra is still a stretch.

Scenario Memory Verdict
1-node DGX Spark 128GB unified Not viable for Ultra; use Nano or smaller quantized models
2-node Spark 256GB unified Prototype/tight; likely below comfortable Ultra footprint
3-node Spark 384GB unified Emerging viability, experimental
4-node Spark 512GB unified More plausible, still not NVIDIA-validated

A two-node Spark cluster can be useful for proving Ray/vLLM multi-node mechanics and serving smaller large models. The source notes mention prior work around Llama 3.1 405B FP4 and Qwen3-235B GPTQ-Int4, not production Ultra. The direct-link pattern uses a 200 Gb/s QSFP-DD cable, RoCE, jumbo frames, Ray, NCCL, and vLLM.

# Node A
sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.10/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>

# Node B
sudo nmcli con mod <qsfp-ch1> ipv4.addresses 192.168.100.11/24 ipv4.method manual 802-3-ethernet.mtu 9000
sudo nmcli con up <qsfp-ch1>

# Bandwidth sanity check
ib_write_bw -d <roce-interface> --report_gbits
ib_write_bw -d <roce-interface> 192.168.100.11 --report_gbits

For SMF Works experimentation, 3-4 Spark nodes are intriguing. For production, I would still choose NIM, a managed endpoint, H100/H200, or Blackwell data-center GPUs.


7. Custom Builds and Cloud VM Rentals

Consumer and professional NVIDIA GPUs

Build Memory Interconnect Verdict
8x RTX 4090/5090 ~192GB total PCIe, no NVLink Not recommended
8x RTX 6000 Ada ~384GB total Better, limited topology Possible lab path, still suboptimal

Consumer cards lack the memory, interconnect, and serving maturity required for a good 550B MoE experience. RTX 6000 Ada is better, but still not the same as a data-center fabric.

AMD MI300X / MI350X

AMD MI300X is the most credible non-NVIDIA path because memory capacity is abundant: 192GB HBM3 per GPU and more than 1.5TB across 8 GPUs. ROCm and vLLM support continue to improve. MI350X-class systems with stronger low-precision support may become even more interesting.

Build Strength Risk
8x MI300X Huge VRAM headroom Not NVIDIA reference platform; ROCm/kernel maturity
MI350X-class future systems Better FP8/FP4 direction Ecosystem validation still needed

Cloud VM rentals

Provider / class Hardware Best use
AWS p5/p5e.48xlarge 8x H100 80GB NVLink, ~$98/hr in notes BF16 evaluation and AWS-governed pilots
AWS p6i/p6e Blackwell-class when available NVFP4 cloud self-hosting
Lambda Labs / CoreWeave / Crusoe H100/B200 clusters Cheaper/flexible experimentation
RunPod / Vast.ai Marketplace GPUs Not recommended for production MoE inference

Cloud VMs are the middle path: more control than an API, less commitment than buying hardware.


8. Use Cases with Technical Depth

Multi-agent orchestration

This is the primary design target. Ultra works well as the “captain” model: it plans, delegates, reads tool outputs, updates state, and recovers from failures. Smaller models or deterministic tools can handle execution.

User goal → Ultra planner → sub-agents/tools → test/search/extract → Ultra synthesis → final answer

The 1M context window lets the captain keep the full trace: original goal, plan revisions, tool calls, logs, test failures, and final rationale. Medium-effort mode can reduce cost for routine steps, while regular or deeper reasoning can be reserved for hard decisions.

Long-context document analysis

A 1M-token context can hold hundreds of pages of legal text, a large policy corpus, a long email history, multi-year financial reports, or a substantial codebase snapshot. The hybrid Mamba-attention architecture is the enabler: Mamba handles long-sequence efficiency; attention preserves retrieval. The reported RULER score around 95% at 1M tokens is why this matters.

Recommended pattern: chunk documents with stable IDs, include page/section markers, ask for quote-backed claims, and preserve citations in the final answer.

Software engineering

With SWE-bench Verified at 65-70.4% across five agent frameworks, Ultra is a credible software engineering model. Its advantage comes from 173B refreshed GitHub tokens, coding RL environments, specialist teacher distillation, and long-context repo comprehension.

A practical loop is: read issue and repo map, write a patch, run tests, ingest stack traces, revise, then review the diff against requirements. Terminal-Bench results show it is not always ahead of peers on shell-heavy tasks, so pair it with deterministic execution and strong test feedback.

Deep research synthesis

Ultra can ingest many sources, compare claims, identify contradictions, and write evidence-sensitive summaries. ProfBench at 56% and AA-Omniscience at 78.7 suggest a model that is useful for research workflows where hallucination control matters.

A robust research pipeline should collect sources, extract text, label every source, ask Ultra to separate evidence from inference, and require citations for non-obvious claims.

Enterprise automation and compliance

Nemotron's domain data matters here: 4B synthetic legal tokens improved a LegalBench proxy from 64.6% to 74.7%, while 35B Wiki-derived tokens improved a SimpleQA proxy from 40.2% to 50.2% in the notes. Use cases include contract clause extraction, regulatory comparison, policy checking, audit trail review, and control evidence mapping.

For regulated workflows, keep humans in the loop, log prompts and outputs, and require quote-backed findings.

Chip design verification

NVIDIA highlights chip design verification as a natural workload: long specifications, thousands of constraints, simulation tools, iterative debugging, and strict correctness requirements. Ultra's strengths line up well: long-context recall, tool use, error recovery, and multi-step constraint reasoning.

Education and WisdomForge

For WisdomForge, Ultra could act as a curriculum designer, personalized tutor, assessment generator, and code mentor. The 1M-token window can hold a student's long learning history, prior attempts, rubrics, lesson content, and instructor feedback. The deployment decision becomes privacy-driven: cloud API for scale, or local/edge clusters for sensitive institutional data.


9. Ownership Economics

Scenario Capital cost estimate Cloud equivalent in notes Rough break-even
4x GB200 NVFP4 ~$400,000 ~$70,000/month ~6 months
8x H100 BF16 class ~$300,000 ~$50,000/month ~6 months
3x DGX Spark ~$9,000 ~$2,000/month ~4.5 months
4x DGX Spark ~$12,000 ~$3,000/month ~4 months

These numbers are directional. Real economics depend on utilization, staffing, power, networking, redundancy, depreciation, reserved cloud pricing, and whether your workload benefits from context caching.


10. Decision Matrix

Your situation Recommended path Why
Need it now, low volume NVIDIA NIM API Fastest path, free credits noted, no infrastructure
Need provider flexibility OpenRouter or Vercel AI Gateway Easy model switching and app integration
AWS-governed enterprise SageMaker JumpStart IAM, VPC, logging, procurement alignment
Existing H100 fleet vLLM/NIM on H100 BF16 Reuse hardware, accept lower efficiency
Greenfield private AI platform 4x GB200/B200 NVFP4 Best NVIDIA-aligned cost/performance
Air-gapped or sovereign deployment Self-hosted NIM on H100/H200/Blackwell Data stays inside your environment
Lab experimentation 3-4x DGX Spark cluster Low entry cost, emerging viability
AMD infrastructure shop 8x MI300X evaluation Enough memory, but more integration work
Production marketplace GPUs Avoid Interconnect and reliability risk

Future Outlook

Nemotron 3 Ultra points toward a future where open models are not just downloadable artifacts but full-stack systems: model architecture, training data, low-precision formats, serving engines, and accelerator topology all designed together. The most important trend is not simply bigger parameter counts. It is usable agent throughput at long context.

Expect three things next. First, NVFP4-style inference will become normal for frontier-scale open models. Second, agent benchmarks will matter more than chat leaderboards because businesses care about completed workflows. Third, edge and desktop clusters like DGX Spark will keep moving from “fun demo” toward “departmental AI appliance,” though Ultra-class serving still needs validation.

For SMF Works and WisdomForge, the opportunity is clear: Nemotron 3 Ultra can be a serious reasoning backbone for education, research, automation, and software creation. Start with the API. Measure real tasks. Then decide whether the workload deserves dedicated hardware.


Sources

Originally published at smfworks.com.