How is this model different from Nemotron 3 Nano (30B/A3B)?

They use different architectures. Nvidia Nemotron Nano 9B V2 is a dense 9B model with a context window of 131.1K tokens. Nemotron 3 Nano is a sparse MoE (30B total, 3B active) with a 1M-token context for multi-agent throughput. Choose based on whether your constraint is footprint (9B v2) or context scale (Nemotron 3 Nano).

What does thinking budget control mean in practice?

You can instruct the model to reason briefly or in depth on a per-request basis. Brief reasoning produces faster, cheaper responses for straightforward tasks. Deep reasoning takes longer but improves accuracy on complex problems.

Where are input and output prices listed?

Pricing appears on this page and updates as providers adjust their rates. AI Gateway routes traffic through the configured provider.

Dashboard

Nvidia Nemotron Nano 9B V2

Nvidia Nemotron Nano 9B V2 is a dense hybrid Mamba-Transformer reasoning model that matches or exceeds Qwen3-8B accuracy at up to 6x the throughput, with built-in thinking budget control.

ReasoningTool Use

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'nvidia/nemotron-nano-9b-v2',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out Nvidia Nemotron Nano 9B V2 by NVIDIA. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Amazon Bedrock

131K

0.2s

152tps

$0.06/M

$0.23/M

—

08/18/2025

DeepInfra

131K

0.3s

51tps

$0.04/M

$0.16/M

—

08/18/2025

More models by NVIDIA

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

nvidia/nemotron-3-ultra-550b-a55b

0.3s

222tps

$0.37/M

$1.08/M

Read:$0.12/M

Write:—

—

06/04/2026

nvidia/nemotron-3-super-120b-a12b

256K

0.2s

126tps

$0.15/M

$0.65/M

Read:—

Write:$0.06/M

—

03/18/2026

nvidia/nemotron-3-nano-30b-a3b

262K

0.4s

32tps

$0.05/M

$0.24/M

—

12/01/2024

nvidia/nemotron-nano-12b-v2-vl

131K

0.2s

$0.20/M

$0.60/M

—

12/01/2024

About Nvidia Nemotron Nano 9B V2

NVIDIA released Nvidia Nemotron Nano 9B V2 on August 18, 2025 as the compressed reasoning variant of the Nemotron Nano 2 family. It is a 9B-parameter model with a context window of 131.1K tokens.

Nvidia Nemotron Nano 9B V2 matches or exceeds Qwen3-8B on complex reasoning tasks at up to 6x the throughput. The hybrid Mamba-Transformer architecture contributes to this efficiency. Mamba layers handle sequence processing with sub-quadratic memory scaling, while Transformer attention layers maintain precision on retrieval-heavy tasks within the context window.

Nvidia Nemotron Nano 9B V2 also supports thinking budget control. You can prompt it to reason briefly for simple tasks (faster, cheaper) or thoroughly for hard problems (slower, more accurate). Adjust the latency-accuracy tradeoff at inference time without switching models. Technical report and assets: https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html.

What To Consider When Choosing a Provider

Configuration: Nvidia Nemotron Nano 9B V2 is a compact dense reasoning model. Evaluate whether its capability tier fits your workload before committing at production scale. Compare $0.06 and $0.23.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Nvidia Nemotron Nano 9B V2

Best For

High-throughput reasoning: Workloads where 6x speed over comparable models matters
Thinking budget control: Applications that vary reasoning depth per request
Cost-sensitive production: Compact reasoning models that reduce infrastructure spend

Consider Alternatives When

1M-token context: Nemotron 3 Nano (30B/3B active) supports that scale
Vision or multimodal: Nemotron Nano 12B v2 VL is the right choice
Multi-agent orchestration: The sparse MoE design of Nemotron 3 Nano is better suited to that pattern

Conclusion

Nvidia Nemotron Nano 9B V2 is a dense reasoning model. It delivers high throughput and accuracy with thinking budget control for tuning the speed-accuracy tradeoff per request. Route it through AI Gateway.

Frequently Asked Questions

How is this model different from Nemotron 3 Nano (30B/A3B)?
They use different architectures. Nvidia Nemotron Nano 9B V2 is a dense 9B model with a context window of 131.1K tokens. Nemotron 3 Nano is a sparse MoE (30B total, 3B active) with a 1M-token context for multi-agent throughput. Choose based on whether your constraint is footprint (9B v2) or context scale (Nemotron 3 Nano).
What does thinking budget control mean in practice?
You can instruct the model to reason briefly or in depth on a per-request basis. Brief reasoning produces faster, cheaper responses for straightforward tasks. Deep reasoning takes longer but improves accuracy on complex problems.
Where are input and output prices listed?
Pricing appears on this page and updates as providers adjust their rates. AI Gateway routes traffic through the configured provider.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Nvidia Nemotron Nano 9B V2

Playground

Providers

More models by NVIDIA

About Nvidia Nemotron Nano 9B V2

What To Consider When Choosing a Provider

When to Use Nvidia Nemotron Nano 9B V2

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions