Build Your Own Evaluation Agent with GitHub Copilot: A Step-by-Step Guide

Introduction

Are you tired of manually sifting through hundreds of thousands of lines of code to analyze how your AI coding agents perform? As an AI researcher at GitHub, I faced this exact challenge when evaluating agents on benchmarks like TerminalBench2 and SWEBench-Pro. The repetitive task of reading trajectories — JSON files that capture every thought and action an agent takes — was a perfect candidate for automation. Using GitHub Copilot, I created a tool called eval-agents that not only automated my own analysis but also enabled my entire team to build custom solutions. In this guide, I'll walk you through the same process so you can create your own agent-driven analysis tool. By the end, you'll have a reusable system that saves hours of intellectual toil.

Build Your Own Evaluation Agent with GitHub Copilot: A Step-by-Step Guide — Source: github.blog

What You Need

GitHub Copilot subscription (Individual, Business, or Enterprise)
Programming environment (VS Code or any Copilot-compatible editor)
Python 3.8+ installed (or your language of choice for scripting)
Sample evaluation data (e.g., trajectory JSON files from benchmarks like SWEBench-Pro)
GitHub repository for version control and collaboration
Basic understanding of JSON, agents, and benchmarks (optional but helpful)

Step-by-Step Guide

Step 1: Identify the Repetitive Analysis Pattern

The first step is to pinpoint the exact task you want to automate. In my case, I was analyzing agent trajectories to find common failure modes or successful strategies. Each trajectory is a JSON file with hundreds of lines — and I had dozens of such files per benchmark run. The repetitive pattern was: load a trajectory, search for specific actions or errors, and compile statistics. Write down your own repetitive loop. For example:

Open a trajectory JSON file.
Extract the agent's action sequence.
Check for specific tool calls or error messages.
Aggregate results across all trajectories.

This pattern becomes the foundation for your agent.

Step 2: Use GitHub Copilot to Explore Data Patterns

Before building a full automation, explore your data with Copilot’s help. In VS Code, open a few trajectory files and use the Copilot Chat or inline suggestions to write quick exploratory scripts. For instance, ask Copilot: “Write a Python script that reads all JSON files in a folder and prints the first action of each trajectory.” This gives you a feel for the data structure and helps you discover patterns. Copilot can also suggest regex patterns for extracting specific information. Save these snippets — they’ll be the building blocks of your agent.

Step 3: Design Your Evaluation Agent

Now, design the agent that will automate your pattern. An agent is simply a script that performs a series of steps autonomously. Define the inputs (trajectory files), processing logic (pattern detection), and outputs (summaries or reports). Use Copilot to brainstorm by describing your design in comments, e.g., # This agent should: 1. Load each trajectory 2. Extract tool calls 3. Count errors 4. Output a CSV. Copilot will generate code blocks that you can adapt. Ensure your agent is modular so it can be reused or extended later.

Step 4: Implement the Agent with Copilot

Start coding with Copilot by your side. Create a new Python file and begin typing your agent’s skeleton. Copilot will suggest function signatures, loops, and data processing logic. Let it generate the bulk of the code while you guide it with clear variable names and comments. For example, type:

def analyze_trajectory(filepath):
    """Extract key metrics from a single trajectory JSON."""
    # Copilot will fill in the rest

Iterate quickly: as you accept suggestions, test with a sample file. Use Copilot’s inline chat to fix errors or optimize performance. The goal is to get a working prototype in minutes.

Step 5: Test and Refine the Agent

Run your agent on a small set of trajectories first. Check that the output matches what you expected. If something is off, highlight the problematic code and ask Copilot to debug it — for example, “This function returns None when the file is missing; add error handling.” Refine the agent to handle edge cases like empty trajectories or malformed JSON. You can also create unit tests with Copilot’s assistance by describing the test cases. Once it works on sample data, run it on your full dataset.

Step 6: Package and Share the Agent

To make your agent easy to share and use (like I did with eval-agents), create a GitHub repository. Structure the repo with a clear README, a src/ folder for the agent code, and a tests/ folder. Use Copilot to generate the README by describing the agent’s purpose and usage. Add a requirements.txt for dependencies. Consider making the agent configurable via command-line arguments or a config file. This allows teammates to run it on their own data without modifying the code. Push your repo and invite collaborators.

Step 7: Enable the Team to Author New Agents

The real power comes when others can contribute their own agents. In your repository, create a template for new agents. Use Copilot to document the template with comments and examples. Teach your team to fork the repo, copy the template, and customize it with Copilot’s help. Encourage them to share improvements via pull requests. I found that this approach turned my team into active contributors — they built agents for specific benchmarks or custom metrics. The result: a thriving ecosystem of analysis tools.

Tips for Success

Start small. Don’t try to automate everything at once. Focus on one repetitive task and expand.
Leverage Copilot’s context. Keep related files open in your editor so Copilot understands your project structure.
Document as you go. Use comments and READMEs — they help both humans and Copilot generate better code later.
Version control everything. Even experiments. It makes rollback and collaboration easier.
Share early, share often. Get feedback from colleagues to refine the agent’s utility.
Iterate based on real data. Your agent is only as good as the patterns it finds. Continuously update it as new benchmarks emerge.

By following these steps, you can transform tedious analysis into an automated, collaborative process. You might even find yourself shifting from manual reviewer to tool builder — just like I did. Happy automating!

Tags: