Eigen AI Achieves #1 Speed Across 25 Leading Open Models on Artificial Analysis

In November 2025, we announced that Eigen AI's serving APIs were officially listed on Artificial Analysis with record-breaking throughput for our first set of models. In February 2026, we expanded that leadership to 11 models. Today, we're proud to share that Eigen AI now holds the #1 output speed across 25 of the most widely used open-source models tracked by Artificial Analysis — more than doubling our footprint in just six weeks.

This milestone reflects the depth and breadth of EigenInference’s full-stack optimization across every major model family, architecture type, and workload category in the open-source ecosystem.

25 Models, One Consistent Result: Fastest in Class

Our infrastructure delivers peak output throughput (tokens/sec) across a massive diversity of architectures — from dense language models and Mixture-of-Experts (MoE) to vision-language models, reasoning specialists, and coding-optimized variants. On Artificial Analysis, Eigen AI consistently ranks as the #1 GPU-based provider* for all 25 models listed below.

Model	Eigen Output Speed (tokens/sec)	Workload
GPT-OSS-120B (high)	913	General
GPT-OSS-120B (low)	738	General
Kimi K2.5 (Non-Reasoning)	401	General
GLM-5 (Non-reasoning)	200	General
DeepSeek V3.2 (Reasoning)	80	Reasoning
DeepSeek V3.1 (Reasoning)	279	Reasoning
DeepSeek V3.1 Terminus (Reasoning)	143	Reasoning
DeepSeek V3.1 Terminus	140	General
Llama-4 Maverick	338	General
Llama-4 Scout	506 (1k coding)	Coding
Llama-3.3-70B	294	General
Llama-3.1-8B	764 (1k coding)	Coding
Qwen3.5 397B A17B (Reasoning)	142	Reasoning
Qwen3.5 397B A17B (Non-reasoning)	149	General
Qwen3 Coder 480B	250 (10k general) / 374 (1k coding)	General / Coding
Qwen3 235B A22B 2507 (Reasoning)	178	Reasoning
Qwen3 235B A22B 2507 Instruct	320	General
Qwen3 Next 80B A3B (Reasoning)	322	Reasoning
Qwen3-VL 235B A22B	81	Vision-Language
Qwen3-VL 30B A3B (Reasoning)	255	Vision-Language / Reasoning
Qwen3-VL 30B A3B (Non-reasoning)	262	Vision-Language
Qwen3 30B A3B (Reasoning)	259	Reasoning
Qwen3 30B A3B (Non-reasoning)	269	General
Qwen3 8B (Reasoning)	360	General
Qwen3 8B (Non-reasoning)	384	General

- #1 speed across GPU providers, excluding ASIC providers, under default Artificial Analysis settings (10K input) unless noted otherwise. Benchmarks as of March 15, 2026.

Eigen AI Output Speed — 25 Top #1 Speed Models

Figure 1: Eigen AI ranks as the #1 GPU-based provider for all 25 models on Artificial Analysis.

Figure 2: Eigen AI achieves 918 output tokens per second for GPT-OSS-120B on Artificial Analysis.

Figure 3: Eigen AI achieves 244 output tokens per second for Qwen3 Coder 480B on Artificial Analysis.

From 11 to 25: What Changed

The rapid expansion from 11 to 25 #1 rankings wasn’t just about adding more models to our catalog — it reflects continued investment across the full EigenInference optimization stack:

Broader architecture coverage. We extended our optimizations to the latest model releases, including DeepSeek V3.2, GLM-5, Kimi-K2.5, MiniMax-M2.5, Qwen3.5, Qwen3, and the Llama family. Each architecture brings unique challenges — new attention mechanisms, larger expert counts, hybrid reasoning modes — and our team has delivered leading throughput on every one of them.

Deeper MoE optimization. Mixture-of-Experts models like Qwen3 Coder 480B, Qwen3.5 397B, and DeepSeek V3.2 require careful expert routing, GPU scheduling, and KV cache and memory management to perform well in production. Our custom CUDA and Triton kernels, combined with advanced speculative decoding and quantization, deliver consistent speed advantages on these architectures.

Reasoning and multi-modal breadth. We now hold #1 rankings across general, reasoning, coding, and vision-language workloads — demonstrating that EigenInference is not tuned for a single modality but optimized across the full spectrum of production use cases.

The EigenInference Engine: Why We’re Consistently #1

Our performance leadership is the result of deep, full-stack optimization at every layer. The key pillars of the EigenInference engine include:

Advanced Quantization. Our pioneering NVFP4 and FP3 quantization methods, combined with system-level optimizations, match or exceed FP16/BF16 accuracy while delivering the throughput benefits of low-precision compute. This includes high-fidelity PTQ (Post-Training Quantization) and QAT (Quantization-Aware Training).

Speculative Decoding & Custom Kernels. We leverage state-of-the-art auto-regressive methods (such as Eagle-3) and diffusion-based methods (such as DFlash) to predict future tokens and accelerate generation. Our hand-optimized CUDA and Triton kernels for GEMM and Attention are purpose-built for NVFP4, extracting maximum performance from the hardware.

KV Cache Optimization. Long-context generation is often bottlenecked by memory bandwidth. We address this with NVFP4 KV cache quantization and intelligent cache pruning, enabling larger batch sizes and longer sequences without sacrificing speed.

Multi-Granular Sparsity. EigenInference employs a multi-level sparsity strategy — architectural sparsity (pruning channels, heads, experts, and layers), hardware-aware N:M weight sparsity for Tensor Core acceleration, and dynamic token-level sparsity (token dropping and merging) to reduce the quadratic cost of attention in real-time.

System-Level Runtime Optimizations. Continuous batching, parallel computing, graph capture, and CPU offloading ensure maximum resource utilization across every GPU cycle.

Production-Grade Reliability at Scale

Speed is only meaningful when it comes with stability. Every Eigen AI endpoint is backed by production-grade deployment infrastructure through EigenDeploy:

Traffic-aware autoscaling that adjusts to real-time demand
Hot restarts and multi-replica high availability to eliminate downtime
Token-level traces and real-time metrics for full observability
Flexible deployment across serverless, on-demand, and dedicated instances

What’s Next

We’re continuing to expand our model coverage and deepen our optimization stack. As new frontier open models are released — with increasingly complex architectures, larger expert counts, and multi-modal capabilities — Eigen AI will be there on day one with production-ready, #1-speed implementations.

Experience top-tier inference for your own applications.

Artificial Efficient Intelligence — AGI Tomorrow, AEI Today.