Why does "30B total, 3B active" matter for inference cost?

You pay for compute proportional to the active parameters, not the total. Nemotron 3 Nano 30B A3B runs at speeds and costs closer to a 3B dense model but draws on 30B parameters of learned knowledge. The MoE routing mechanism selects the relevant subset per token.

How does the Mamba architecture enable the context of 262.1K tokens?

Mamba layers process sequences with linear-time complexity rather than the quadratic scaling of standard attention. That makes it practical to hold 262.1K tokens in context without the memory explosion that would make pure-attention models infeasible at that length.

How does Nemotron 3 Nano 30B A3B differ from Nemotron Nano 9B v2?

They use different architectures. Nemotron 3 Nano 30B A3B is a sparse MoE with 30B total/3B active parameters and a context window of 262.1K tokens. Nemotron Nano 9B v2 is a dense 9B model with a 128K-token context window. Choose Nemotron 3 Nano 30B A3B for throughput across multi-agent systems and Nano 9B v2 as a compact reasoning model.

Where are hosted input and output prices listed?

Current pricing is shown on this page. AI Gateway routes across providers, and rates may vary by provider.

Nemotron 3 Nano 30B A3B

View Status

Nemotron 3 Nano 30B A3B is a sparse hybrid Mamba-Transformer mixture-of-experts (MoE) model with 30B total parameters but only 3B active per token. It supports a context window of 262.1K tokens with throughput closer to a 3B dense model than a 30B one.

Reasoning

import { streamText } from 'ai'

const result = streamText({
  model: 'nvidia/nemotron-3-nano-30b-a3b',
  prompt: 'Why is the sky blue?'
})

Playground

Try out Nemotron 3 Nano 30B A3B by NVIDIA. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Nemotron 3 Nano 30B A3B

NVIDIA announced Nemotron 3 Nano 30B A3B on December 1, 2024 as the first model in the Nemotron 3 family. The core idea is architectural efficiency at scale. 30B total parameters provide a broad knowledge base, but only 3B activate for any given token. This keeps inference cost and speed in the range of much smaller models.

Three layer types interleave throughout the architecture. Mamba-2 layers handle sequence processing with linear-time complexity. This makes the context window of 262.1K tokens feasible without the quadratic memory growth of pure attention. Transformer attention layers appear at strategic depths to maintain precise associative recall: the ability to pick out a specific fact from a large context. Mixture-of-experts (MoE) routing selects which expert parameters activate for each token, keeping compute proportional to the 3B active count rather than the full 30B.

Weights and recipes are available under the NVIDIA Open Model License. Deployment cookbooks for vLLM, SGLang, and TensorRT-LLM are also provided. Overview and techniques: https://deepinfra.com/nvidia/Nemotron-3-Nano-30B-A3B.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

262K

0.7s

84tps

$0.05/M

$0.24/M

—

12/01/2024

More models by NVIDIA

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

256K

0.5s

192tps

$0.15/M

$0.65/M

—

03/18/2026

131K

0.3s

136tps

$0.06/M

$0.23/M

—

08/18/2025

131K

0.4s

140tps

$0.20/M

$0.60/M

—

12/01/2024

What To Consider When Choosing a Provider

Configuration: With a context window of 262.1K tokens, entire codebases or multi-document evidence sets fit in a single call. Plan context usage carefully. Filling the window is possible, but model the cost and latency implications ahead of time. Compare $0.05 and $0.24.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Nemotron 3 Nano 30B A3B

Best For

Concurrent multi-agent systems: Running many lightweight agents where per-agent throughput matters
Long-context tasks: Holding entire codebases, extended session histories, or multi-document sets in one call
Agentic tool-calling workflows: Multi-step pipelines with chained actions

Consider Alternatives When

Maximum reasoning depth: Nemotron 3 Super (120B/12B active) handles complex multi-agent planning
Vision-language tasks: Nemotron Nano 12B v2 VL is the multimodal option
Smaller context needs: A 128K context window is sufficient and the 262.1K tokens capacity goes unused
Compact dense reasoning: Nemotron Nano 9B v2 targets a dense model profile

Conclusion

Nemotron 3 Nano 30B A3B delivers the throughput of a small model with the knowledge breadth of a large one. Its hybrid Mamba-Transformer MoE architecture and context of 262.1K tokens suits tasks that require holding large amounts of information in a single pass. Use AI Gateway to route traffic with unified auth.

Frequently Asked Questions

Why does "30B total, 3B active" matter for inference cost?
You pay for compute proportional to the active parameters, not the total. Nemotron 3 Nano 30B A3B runs at speeds and costs closer to a 3B dense model but draws on 30B parameters of learned knowledge. The MoE routing mechanism selects the relevant subset per token.
How does the Mamba architecture enable the context of 262.1K tokens?
Mamba layers process sequences with linear-time complexity rather than the quadratic scaling of standard attention. That makes it practical to hold 262.1K tokens in context without the memory explosion that would make pure-attention models infeasible at that length.
How does Nemotron 3 Nano 30B A3B differ from Nemotron Nano 9B v2?
They use different architectures. Nemotron 3 Nano 30B A3B is a sparse MoE with 30B total/3B active parameters and a context window of 262.1K tokens. Nemotron Nano 9B v2 is a dense 9B model with a 128K-token context window. Choose Nemotron 3 Nano 30B A3B for throughput across multi-agent systems and Nano 9B v2 as a compact reasoning model.
Where are hosted input and output prices listed?
Current pricing is shown on this page. AI Gateway routes across providers, and rates may vary by provider.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Nemotron 3 Nano 30B A3B

Playground

About Nemotron 3 Nano 30B A3B

Providers

More models by NVIDIA

What To Consider When Choosing a Provider

When to Use Nemotron 3 Nano 30B A3B

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions

Playground

About Nemotron 3 Nano 30B A3B

Providers

More models by NVIDIA