Introduction
Large language models (LLMs) have made remarkable progress in scientific reasoning, yet two persistent bottlenecks remain.
First, explicit retrieval interrupts reasoning, forcing models to pause mid-thought to call external tools — a phenomenon we term the Tool Tax.
Second, multi-agent workflows often average all solutions equally, diluting strong reasoning with weaker paths.
Eigen-1 addresses both with a unified framework combining implicit retrieval and hierarchical collaboration.
It achieves 48.3 % accuracy on Humanity’s Last Exam (HLE) Bio/Chem Gold — the highest reported result — while cutting token usage by 53.5 % and agent steps by 43.7 %.
The system delivers faster, more coherent reasoning across scientific domains, validated also on SuperGPQA and TRQA.
Code and resources: https://github.com/tangxiangru/Eigen-1

When LLMs confront expert-level problems (e.g., biochemistry or genetics), factual recall alone fails.
Existing retrieval-augmented systems require explicit tool calls: the model stops, queries a database, then rebuilds its context.
This stop-and-resume cycle introduces the Tool Tax — lost tokens, latency, and broken logical flow.
Our analysis on HLE Bio/Chem revealed that knowledge gaps and reasoning failures co-occur in 85 % of cases, showing that knowledge integration and reasoning continuity must be solved together.

Architecture Overview
Eigen-1 is built atop SGLang-compatible serving infrastructure and integrates three core modules:
A token-level retrieval system that detects semantic uncertainty in real time and injects external evidence without pausing reasoning.
Instead of democratic averaging, Eigen-1 introduces anchor–reference refinement.
Each candidate solution temporarily acts as an anchor, improved by targeted repairs from peer references — fixing logic gaps, numerical slips, or methodological flaws.
A dynamic evaluation loop scoring logic, correctness, and explanation on a 0–5 scale.
Low-scoring solutions trigger automatic corrections until convergence, ensuring adaptive improvement without endless iterations.
Together, these modules form a continuous reasoning-retrieval loop that keeps thought coherence while improving factual precision.
Traditional LLMs often misremember formulas (e.g., confusing θ \= 2 Nₑ μ with θ \= 4 Nₑ μ) or fail to reintegrate retrieved facts.
Eigen-1’s Monitor-Based RAG automatically detects such uncertainty, retrieves the correct relation, and re-injects it seamlessly — solving the problem in one coherent reasoning stream.
In another example, a haplotype-counting task across F₁ → F₃ generations required understanding recombination constraints.
Eigen-1’s Monitor triggered retrieval (“maximum number of recombination change-points”) and injected key facts (“at most one breakpoint per gamete”).
The reasoning then converged correctly on 30 unique haplotypes

Experimental Results
| Benchmark | Accuracy (%) ↑ | Token Usage ↓ | Step Count ↓ | Base Model |
|---|---|---|---|---|
| HLE Bio/Chem Gold | 48.3 | −53.5 % | −43.7 % | DeepSeek V3.1 |
| SuperGPQA (Biology Hard) | 69.6 | — | — | DeepSeek V3.1 |
| TRQA (Literature) | 54.7 | — | — | DeepSeek V3.1 |
These results confirm that implicit augmentation + hierarchical refinement enhance both accuracy and efficiency
| Configuration | Accuracy (%) | Tokens (K) | Steps |
|---|---|---|---|
| Baseline (no RAG) | 25.3 | 483.6 | 43.4 |
| Explicit RAG | 41.4 | 470.6 | 94.8 |
| + Monitor + Querier + Injector | 40.3 | 229.5 | 53.1 |
| + HSR + QAIR (Full Eigen-1) | 48.3 | 218.9 | 53.4 |
Key insight: Explicit retrieval doubles reasoning steps without proportional benefit.
Eigen-1 recovers reasoning coherence while maintaining speed and quality.

Furthermore, analysis of multi-agent diversity shows that
Implications and Future Work
Eigen-1 demonstrates that architectural design, not scale alone, drives progress in scientific reasoning.
By embedding retrieval into continuous generation and structuring multi-agent collaboration hierarchically, the system achieves both efficiency and transparency.
Future directions include:
Eigen-1 was developed by the Eigen AI Research Team in collaboration with researchers from Yale University, Shanghai Jiao Tong University, Fudan University, UCLA, Oxford, and Shanghai AI Lab.
Code and resources: https://github.com/tangxiangru/Eigen-1