While recent Vision-Language Models (VLMs) have demonstrated impressive capabilities in describing images, they often struggle when asked to “think” with them—especially when solving multi-step problems that require external tools like calculators or object detectors.
Today, we are introducing VISTA-Gym, a scalable training environment designed to incentivize tool-integrated reasoning, and VISTA-R1, a model trained within this environment that significantly outperforms state-of-the-art baselines.
VISTA-R1-8B outperforms comparable open-source models by 9.51%–18.72% across 11 reasoning-intensive benchmarks, demonstrating that reinforcement learning (RL) is key to unlocking true agentic capabilities in VLMs.

Current approaches to equipping VLMs with tools often rely on simple prompting or supervised fine-tuning. However, our research shows a counter-intuitive result: directly augmenting VLMs with tools often degrades accuracy.
Without explicit training on when and how to use tools, models treat them as distractors. They struggle with:
We identified that to solve this, models need to transition from passive perception to active, tool-integrated reasoning (TIR).
To bridge this gap, we built VISTA-Gym, a high-throughput RL environment compatible with Ray and Gymnasium APIs.
Key features of VISTA-Gym include:
VISTA-R1 is trained using a two-stage recipe designed to foster a “Think-Before-You-Act” methodology.
We first use supervised fine-tuning (SFT) to teach the model the basic syntax of tool usage. We synthesized expert trajectories using GPT-5 and densified the reasoning steps using Qwen3-VL-Thinking to create high-quality training data.
The crucial performance leap comes from online RL. We employ Group Relative Policy Optimization (GRPO), which normalizes advantages within a group of outputs to reduce variance.
Our reward function is designed to be sparse yet strictly format-aware:
We evaluated VISTA-R1 against top-tier proprietary and open-source models across 5 in-distribution and 6 out-of-distribution (OOD) benchmarks.

Key findings:
Our ablation studies confirm that RL is the primary driver of these gains. While SFT provides a +3.46% improvement, adding RL contributes a massive +10.19% gain, showing the model learns to self-correct and coordinate tools through trial and error.

VISTA-Gym and VISTA-R1 demonstrate that equipping VLMs with tools requires more than just API access—it requires a fundamental shift in training methodology towards Agentic RL. By scaling up the environment and refining the reward mechanism, we can unlock models that truly “think” with images.
The code and environment are available now for the community to build upon.