From Repetitive Benchmark Analysis to Self-Automating Agents: A Copilot Applied Science Story

In the world of software engineering, automation often starts as a way to eliminate drudgery, only to create new systems that need maintenance. For an AI researcher on the Copilot Applied Science team, this pattern took a revolutionary turn: they built a tool that automates the intellectual toil of analyzing coding agent trajectories, freeing themselves and their team to focus on higher-level creative work. This Q&A explores the journey, the challenges, and the insights gained from creating eval-agents, a framework that turns repetitive analysis into an autonomous process.

What sparked the creation of eval-agents?

The daily work involved evaluating coding agent performance against standardized benchmarks like TerminalBench2 or SWEBench-Pro. Each task generates a trajectory—a detailed JSON file listing the agent's thoughts and actions. With dozens of tasks per benchmark and many runs per day, that meant analyzing hundreds of thousands of lines of code. The researcher initially used GitHub Copilot to surface patterns, reducing the reading load to a few hundred lines. But the engineer inside saw this as a repetitive loop begging for automation. Thus, eval-agents was born: a tool to automate the intellectual labor of pattern discovery, letting the researcher focus on insights rather than data sifting.

From Repetitive Benchmark Analysis to Self-Automating Agents: A Copilot Applied Science Story — Source: github.blog

How does eval-agents change the analysis workflow?

Instead of manually inspecting trajectory files, the eval-agents tool runs custom agents that autonomously review the data. The user defines analysis goals (e.g., find failure patterns) and the agent executes the search, summarizing findings in natural language. This cuts the time from hours to minutes and eliminates the mental overhead of repeated pattern-matching. The output is a concise report, highlighting key trends and anomalies—exactly what a researcher needs to make decisions about agent improvements. The key insight is that the same agent technology being evaluated can now evaluate itself, creating a self-accelerating loop.

What were the design principles behind eval-agents?

Three core goals guided the implementation: easy sharing and reuse, simple authoring of new agents, and making coding agents the primary contribution vehicle. The first two leverage GitHub's collaborative nature—the project is hosted on GitHub, with clear documentation and templates. The third principle ensures that any team member can create an agent by writing a script or a configuration file, no deep AI expertise needed. This aligns with the researcher's background as an OSS maintainer on the GitHub CLI, where accessibility and community-driven development were paramount.

How did the tool improve team collaboration?

Before eval-agents, each analyst worked in isolation, manually wrangling trajectories. Now, agents are shared via GitHub repositories—anyone can clone, modify, or extend them. Team members contribute new agents for specific benchmarks or analysis types, creating a library of reusable components. This shifts the team's focus from manual data processing to agent development and refinement. The researcher notes that this has enabled faster, more consistent evaluations across the team, as everyone uses the same automated framework. It also lowers the barrier for junior members to perform complex analyses.

What lessons were learned about using GitHub Copilot in this process?

The researcher discovered that combining Copilot with agent-driven development unlocks unprecedented speed. Copilot helped generate initial analysis scripts and patterns, but the real power came from treating those scripts as building blocks for autonomous agents. A key lesson: don't just automate the repetitive, automate the exploration. By teaching agents to ask questions of the data (e.g., "Which tasks did the agent fail on?" and "What common mistakes appear?"), the researcher created a system that continuously learns and adapts. This iterative loop—use Copilot, explore, automate—can be applied to many data-intensive tasks beyond benchmarks.

What's the future of agent-driven development on the team?

The immediate plan is to expand the agent library to cover more benchmarks and analysis types. The team is also exploring how to collaboratively improve agents through peer review and pull requests. Longer-term, the researcher envisions a self-service platform where any engineer can deploy an agent to monitor or evaluate their own code. The ultimate goal is to remove toil not just for analysts, but for every developer—making agent-driven development a standard part of the software lifecycle. The success of eval-agents shows that automating intellectual work is not only possible but essential for scaling talent.

Tags: