Nvidia released Nemotron 3 Super on Wednesday, a 120-billion-parameter open model that activates only 12 billion parameters at inference time. The numbers matter, so let’s walk through the math.
Multi-agent systems generate up to 15 times the token volume of a standard chatbot conversation. That volume is the single biggest constraint on deploying agents at enterprise scale. Nemotron 3 Super is Nvidia’s answer to this “thinking tax,” and the architecture tells you exactly where the industry is heading.
Time to geek out. The model combines three techniques that have been developing independently: Mamba state-space layers for memory efficiency, transformer attention layers for precision reasoning, and a novel “Latent” mixture-of-experts routing system that compresses tokens before they reach the experts. The result: 4x as many expert specialists activated for the same inference cost. Nvidia claims 5x higher throughput than the previous Nemotron Super and 2.2x higher throughput than comparably sized open models like GPT-OSS-120B.
Nvidia pre-trained this model natively in 4-bit floating point (NVFP4) on Blackwell GPUs. Most quantized models start at full precision and get compressed after training, which always introduces accuracy loss. Nemotron 3 Super learned to be accurate within the constraints of 4-bit arithmetic from the first gradient update. On Blackwell hardware, this delivers 4x faster inference than 8-bit models on Hopper, with no meaningful accuracy degradation.
The one-million-token context window is the other forcing function. In a multi-agent workflow, agents pass context back and forth constantly. A short context window forces the system to re-reason across long conversations, which destroys efficiency. A million tokens means a software agent can load an entire codebase at once. A financial analysis agent can hold thousands of pages of reports in memory without segmentation. Goal drift drops significantly when the model never has to trim its own memory.
The “open” release is the most interesting part. Nvidia published weights, over 10 trillion tokens of training data, 15 reinforcement learning environments, and full training recipes on Hugging Face. The license (Nvidia Open Model License) is commercially permissive with attribution requirements and a patent termination clause if you litigate against Nvidia. It is not Apache 2.0, but it is good enough for most production deployments. Nvidia makes no claim on outputs you generate. You own your derivative models.
According to Nvidia, Perplexity integrated Nemotron 3 Super for search. CodeRabbit, Factory, and Greptile are building it into their code review agents. Enterprise platforms including Palantir, Siemens, Dassault Systemes, and Cadence are customizing it for workflow automation. Cloud availability spans Google Cloud Vertex AI, Oracle Cloud, and (soon) AWS Bedrock and Azure.
If you’re running multi-agent deployments or evaluating agent infrastructure, the math works. Nemotron 3 Super gives you frontier-class reasoning depth at a fraction of the inference cost. The open weights mean you control the model, your data stays on your infrastructure, and you can fine-tune for your domain.
There is a downside: you need at minimum 8x H100 GPUs for self-hosting, which prices out smaller teams. Cloud API access solves that, but you trade some control for convenience.
Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it. This work was created with the assistance of various generative AI models.