Function calling enables LLM agents to interact with tools, APIs, and external environments. As LLM agents are increasingly deployed in real-world applications, the reliability of function calling systems has become a critical bottleneck. However, existing function calling datasets suffer from critical issues in environment consistency, tool reliability, and evaluation correctness.
In this post, we introduce EigenData, a self-evolving system for generating, auditing, and repairing function calling data. Applied to real-world benchmarks like BFCL, EigenData reveals that 71.5% of samples contain critical errors and provides a system-level solution to fix them.
Know more: https://arxiv.org/abs/2603.05553v1
EigenData CLI:
https://docs.eigenai.com/products/eigendata-cli/intro
.png)
Unlike standard language modeling, a function-calling data sample involves structured inputs (user queries, tool definitions, and environment state) and multi-step outputs (assistant actions, tool calls, and state transitions).
This introduces three coupled sources of complexity:
This makes function calling datasets fundamentally harder to construct and evaluate than standard LLM training data.
Empirically, we observe that existing datasets suffer from systemic issues:
Notably, applied to the Berkeley Function-Calling Leaderboard (BFCL), EigenData reveals that: 71.5% of samples contain critical issues affecting correctness or evaluation.

These issues can mislead model selection and obscure genuine progress.
EigenData formulates data generation as a system-level problem, rather than a prompt engineering task.
The system consists of three interacting components coordinated by a central controller:
.png)
Each component is responsible for a distinct layer of the data generation pipeline.

The DatabaseAgent generates structured environments that serve as ground truth.
Key responsibilities include:
This ensures that generated tasks are grounded in valid and diverse state spaces.
The CodingAgent generates executable tool environments, including:
Crucially, it operates in a closed-loop debugging process:
This guarantees that tool behavior is functionally correct and verifiable, rather than syntactically plausible.

The DataAgent generates multi-turn interaction trajectories involving:
Beyond generation, it also evaluates and iteratively refines trajectories.
To improve data quality, EigenData employs a self-evolving strategy:
This allows the system to adaptively refine generation strategies over time.
Existing benchmarks for function-calling agents predominantly rely on action-level evaluation, including:
This formulation implicitly assumes a single canonical trajectory, which does not hold in real-world systems where multiple execution paths can yield the same valid result.
As a consequence, current evaluation protocols suffer from two systematic issues:
EigenData instead adopts an outcome-based evaluation paradigm, where correctness is defined over the resulting environment state:
This shift decouples correctness from specific execution traces and aligns evaluation with task-level objectives. In practice, it provides:
EigenData formulates data generation as a closed-loop system that integrates synthesis, validation, and repair.
The pipeline operates iteratively as:

Rather than regenerating full trajectories, EigenData performs targeted, agent-driven repair through verification–modification loops, enabling localized fixes to schemas, implementations, and trajectory segments.
At each iteration, detected inconsistencies are attributed to one of three sources:
Corresponding components are then updated:
This process induces a self-improving feedback loop, where system components co-evolve to reduce error rates over time.
As a result, EigenData enables:
Over successive iterations, the system effectively bootstraps its own data quality, reducing reliance on manual curation.
We instantiate EigenData on the BFCL-V3 benchmark to study its ability to audit, diagnose, and repair real-world datasets.
Rather than assuming dataset correctness, EigenData treats existing benchmarks as imperfect artifacts and performs structured analysis across multiple dimensions. Specifically, it audits:
This process reveals that dataset errors are not isolated, but systematically distributed across components, including:
To address these issues, EigenData applies a component-wise repair pipeline, coordinated by EigenCore:
Crucially, repairs are propagated across components to maintain cross-system consistency, ensuring that updates to schemas, implementations, and trajectories remain aligned.
After repair, the benchmark exhibits:
EigenData suggests a shift in how data is conceptualized in agent systems.
Rather than treating data as a static artifact, it should be viewed as a system-level object co-defined by:
Under this view, data generation becomes a closed-loop systems problem, requiring:
This perspective aligns data construction with the same principles used in system design: correctness, composability, and feedback-driven improvement. These improvements are not just theoretical. In practice, they translate into significantly more stable and effective post-training for interactive agents.
We demonstrate this in our work on Reliable Post-Training for Interactive Tool-Using Agents, where self-evolving data and verifiable rewards lead to large gains on real-world benchmarks.
This formulation has several implications for production ML systems.
First, data infrastructure becomes a primary scaling bottleneck, rather than a secondary concern. Improving model performance requires not only better architectures, but also tighter control over data generation and validation pipelines.
Second, evaluation must shift from trace-level supervision to outcome-level verification, ensuring that metrics reflect task success rather than intermediate behavior.
Finally, agent performance is fundamentally tied to environment fidelity. Without accurate and executable environments, improvements in model capability do not reliably translate into real-world performance.
EigenData formulates function-calling data generation as a self-evolving system that jointly models:
By integrating these components into a unified, feedback-driven pipeline, EigenData enables:
This reframing positions data not as a byproduct of model development, but as a central object of system design.
Taken together, these system-level improvements extend beyond data quality itself, shaping how interactive agents can be reliably trained and evaluated in practice. In particular, they enable more stable post-training and verifiable optimization in realistic deployment settings, as we further explored in Reliable Post-Training for Interactive Tool-Using Agents.
Resource:
To make EigenData accessible to practitioners, we have released a command-line interface (CLI) that exposes the platform’s core capabilities—including data generation, schema refinement, auditing, and repair—through a unified, scriptable workflow.
The CLI and its documentation are available at: https://docs.eigenai.com/products/eigendata-cli/intro
Additional evaluation results: https://arxiv.org/abs/2603.05553v1