Skip to content

Nemotron 3 Nano 30B A3B

View Status

Nemotron 3 Nano 30B A3B is a sparse hybrid Mamba-Transformer mixture-of-experts (MoE) model with 30B total parameters but only 3B active per token. It supports a context window of 262.1K tokens with throughput closer to a 3B dense model than a 30B one.

Reasoning
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'nvidia/nemotron-3-nano-30b-a3b',
prompt: 'Why is the sky blue?'
})

What To Consider When Choosing a Provider

  • Configuration: With a context window of 262.1K tokens, entire codebases or multi-document evidence sets fit in a single call. Plan context usage carefully. Filling the window is possible, but model the cost and latency implications ahead of time. Compare $0.05 and $0.24.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Nemotron 3 Nano 30B A3B

Best For

  • Concurrent multi-agent systems: Running many lightweight agents where per-agent throughput matters
  • Long-context tasks: Holding entire codebases, extended session histories, or multi-document sets in one call
  • Agentic tool-calling workflows: Multi-step pipelines with chained actions

Consider Alternatives When

  • Maximum reasoning depth: Nemotron 3 Super (120B/12B active) handles complex multi-agent planning
  • Vision-language tasks: Nemotron Nano 12B v2 VL is the multimodal option
  • Smaller context needs: A 128K context window is sufficient and the 262.1K tokens capacity goes unused
  • Compact dense reasoning: Nemotron Nano 9B v2 targets a dense model profile

Conclusion

Nemotron 3 Nano 30B A3B delivers the throughput of a small model with the knowledge breadth of a large one. Its hybrid Mamba-Transformer MoE architecture and context of 262.1K tokens suits tasks that require holding large amounts of information in a single pass. Use AI Gateway to route traffic with unified auth.

Frequently Asked Questions

  • Why does "30B total, 3B active" matter for inference cost?

    You pay for compute proportional to the active parameters, not the total. Nemotron 3 Nano 30B A3B runs at speeds and costs closer to a 3B dense model but draws on 30B parameters of learned knowledge. The MoE routing mechanism selects the relevant subset per token.

  • How does the Mamba architecture enable the context of 262.1K tokens?

    Mamba layers process sequences with linear-time complexity rather than the quadratic scaling of standard attention. That makes it practical to hold 262.1K tokens in context without the memory explosion that would make pure-attention models infeasible at that length.

  • How does Nemotron 3 Nano 30B A3B differ from Nemotron Nano 9B v2?

    They use different architectures. Nemotron 3 Nano 30B A3B is a sparse MoE with 30B total/3B active parameters and a context window of 262.1K tokens. Nemotron Nano 9B v2 is a dense 9B model with a 128K-token context window. Choose Nemotron 3 Nano 30B A3B for throughput across multi-agent systems and Nano 9B v2 as a compact reasoning model.

  • Where are hosted input and output prices listed?

    Current pricing is shown on this page. AI Gateway routes across providers, and rates may vary by provider.