Back
Nemotron 3 UltraNemotron 3.5 ASRNemotron 3.5 Content Safety
Eigen AI Team2026/06/03

Eigen AI Delivers Day-0 Inference for the NVIDIA Nemotron™ 3 Family: Ultra, ASR, and Content Safety

Eigen AI Delivers Day-0 Inference for the NVIDIA Nemotron™ 3 Family: Ultra, ASR, and Content Safety


Palo Alto, California, June 4, 2026 — Eigen AI today announced available-at-launch inference support for three new open models in the NVIDIA Nemotron™ 3.x family: Nemotron 3 Ultra, Nemotron 3.5 ASR, and Nemotron 3.5 Content Safety. Working in close collaboration with NVIDIA, Eigen AI is serving all three models through EigenInference from day zero — giving developers a production-ready path to frontier reasoning, real-time multilingual speech, and enterprise-grade safety guardrails the moment the models become available.

All three models are accessible today through the Eigen AI Model Studio for enterprise customers and developers building the next generation of agentic systems. At the center of the family is NVIDIA Nemotron 3 Ultra, an open frontier-reasoning model built for long-running autonomous agents, delivering up to 5x faster inference and up to 30% lower cost while maintaining frontier-level reasoning.


In Agentic AI, the Model Is in Service of the Agent

Agents plan, call tools, delegate work, check results, and complete tasks. As workflows grow longer and more autonomous, the measure that matters is no longer raw model quality alone — it is the speed of task completion at a given accuracy. As agents operate across hundreds of turns, faster inference and lower cost directly translate into more completed tasks, and better economics at scale.

That is the principle behind NVIDIA Nemotron: a family of open models built for long-running agentic AI, designed so developers can use the right model for the right job. Reasoning models orchestrate and plan. Efficient models handle high-volume tool calling and validation. Speech models power real-time voice agents. Safety models enforce enterprise guardrails. Together, they work alongside proprietary frontier models to deliver higher accuracy and efficiency across the agent workflow.

The three models Eigen AI is bringing online at launch cover three distinct layers of that stack — and each is now optimized to run on EigenInference.


Nemotron 3 Ultra: Frontier Reasoning for Long-Running Agents

NVIDIA Nemotron 3 Ultra is a frontier-reasoning open model built for long-running, autonomous agents, across coding, deep research, and enterprise automation. Optimized for high-throughput agent workflows, it delivers up to 5x faster inference and up to 30% lower cost while maintaining frontier-level reasoning performance.

  • 550B parameters with 55B active, using a hybrid Transformer-Mamba Mixture-of-Experts (MoE) architecture
  • Up to 1M-token context, retaining conversation history and plan state across long agent sessions and enabling cross-document reasoning
  • Text in, text out, post-trained for popular agent harnesses
  • Fully open — open weights, open datasets, and open recipes, so teams can fine-tune for their domain and deploy on their own infrastructure for maximum privacy and control
  • Leading accuracy — top results on the Artificial Analysis Intelligence Index, and across reasoning, coding, and agentic-task benchmarks
  • Up to 5x faster inference, enabling higher-throughput reasoning and long-running agent workflows
  • Up to 30% lower cost, improving efficiency for complex agentic tasks at scale

Ultra is built for the hardest calls in an agent workflow: architectural planning and multi-file refactors in week-long autonomous coding sessions, final synthesis across hundreds of contradictory research sources, persistent tool-using enterprise workflows, and verification across thousands of interdependent constraints in EDA and chip design.


Nemotron 3.5 ASR: Real-Time Multilingual Speech for Voice Agents

NVIDIA Nemotron 3.5 ASR is an open, streaming speech-recognition model built for real-time, multilingual voice agents.

  • 0.6B-parameter streaming model, taking audio in and producing text out
  • 40 language-locale combinations across roughly 36 languages
  • Runtime-configurable latency, with streaming chunk sizes as low as 80 ms, so teams can tune the trade-off between latency and accuracy per workload
  • Native punctuation and capitalization, producing clean, readable transcripts out of the box

Its cache-aware streaming design processes each new audio chunk while reusing prior context, avoiding the redundant overlapping computation of traditional buffered streaming — which keeps end-to-end delay low without sacrificing transcription quality. That makes it a strong fit for voice agents, call centers, meeting transcription, in-car assistants, and live captioning.


Nemotron 3.5 Content Safety: Customizable Multimodal Guardrails

NVIDIA Nemotron 3.5 Content Safety is an open, efficient multimodal, multilingual safety model for enterprise AI guardrails — across text, images, and operator-defined policies.

  • Compact 4B-parameter model, built on Google Gemma-3-4B-it, with a 128K-token context window
  • Multimodal inputs — text, image, and text-plus-image — for both prompt and response moderation in a single model
  • 23 safety categories based on the Aegis v2 taxonomy, with 12 languages out of the box
  • Reasoning support — chain-of-thought safety reasoning that explains why content is flagged, with a step-by-step rationale
  • Custom policy support — operators can define domain-specific content rules that go beyond the default taxonomy, evaluated at inference time
  • On public multimodal safety benchmarks, it leads external moderators — including the top harmful-F1 score on VLGuard — while its compact 4B footprint matches or beats 8–12B alternatives.

Because it is compact and fully self-hostable, Content Safety fits cleanly into prompt/response moderation, content-classification pipelines, policy enforcement, and sovereign or air-gapped deployments where no data can leave the customer boundary.


Eigen AI Innovation: Production-Ready Serving on EigenInference

Bringing frontier open models into production is rarely as simple as loading the weights. Large hybrid-MoE reasoning models, low-latency streaming speech, and always-on safety moderation each place very different demands on compute, memory, and scheduling — and meeting all of them at production scale requires system-level optimization across the full stack.

That is what EigenInference is built for. Through a close collaboration with NVIDIA, the NVFP4 build of Nemotron 3 Ultra is optimized for production and deployed on NVIDIA Blackwell GPUs via EigenInference, applying the same full-stack optimization pipeline that has made EigenInference the #1 GPU-based provider across 25 leading open models on Artificial Analysis. Nemotron 3.5 ASR and Nemotron 3.5 Content Safety are likewise served on EigenInference with hardware-efficient optimization tuned to their streaming and moderation workloads.

Across all three models, EigenInference brings:

  • Full-stack optimization spanning advanced quantization, speculative decoding, custom CUDA and Triton kernels, KV-cache optimization, and multi-granular sparsity
  • Optimized expert routing and memory utilization for hybrid Transformer-Mamba MoE workloads
  • Streaming-aware scheduling for low, predictable latency on real-time speech
  • Production-grade reliability and deployment — traffic-aware autoscaling, multi-replica high availability, and token-level observability

The result is the throughput and stability enterprises need to run these models continuously, without building or maintaining a custom serving stack in-house.


Why This Matters for Developers

For teams building agentic systems, the path from model release to production is usually slow and complex. Day-0 support on EigenInference collapses that gap:

  • Orchestration & deep research — put Nemotron 3 Ultra at the center of long-running agent loops for planning, coding, and multi-source synthesis, with faster inference and lower cost that allow more reasoning cycles, deeper planning, and faster task completion within the same operational budget
  • Real-time voice agents — wire Nemotron 3.5 ASR into call centers, meeting capture, and in-car or live-caption experiences with configurable, low-latency streaming across dozens of languages
  • Safety at the boundary — run Nemotron 3.5 Content Safety as a guardrail on both inputs and outputs, with custom-policy reasoning that fits regulated and sovereign deployments

Developers get NVIDIA's newest open models and the throughput of Eigen AI's optimized inference stack — production-ready from launch day.


Get Started

All three Nemotron 3.x models are available through the Eigen AI Model Studio:

  • API-based access on a per-token basis for rapid prototyping
  • Dedicated endpoints for production workloads

To talk with our team about deploying Nemotron 3 Ultra, Nemotron 3.5 ASR, or Nemotron 3.5 Content Safety in your agent system, get in touch with an AEI expert.


About NVIDIA Nemotron

The NVIDIA Nemotron family is a collection of open models, datasets, and tools built for long-running agentic AI. Designed to help developers use the right model for the right job, the family spans frontier reasoning, specialized agents, real-time speech, enterprise safety, and information retrieval. Together, Nemotron models power a growing ecosystem of open, efficient, and production-ready AI systems.

About Eigen AI

Eigen AI is a leading pioneer in Artificial Efficient Intelligence (AEI), delivering high-performance solutions for enterprises demanding elite speed and accuracy. Founded by a world-class team, the company transforms raw open models into hyper-optimized, agentic intelligence. Through its EigenLoop platform — EigenData, EigenTrain, and EigenInference — Eigen AI delivers remarkably precise, hardware-efficient reliability across cloud, private cloud, on-prem, and edge deployments. The company is headquartered in Palo Alto, California.

Artificial Efficient Intelligence — AGI Tomorrow, AEI Today.