How to Diagnose Multi-Agent System Failures with Automated Attribution: A Step-by-Step Guide
Introduction
Large Language Model (LLM) multi-agent systems are powerful tools for tackling complex tasks through collaboration. However, when these systems fail, developers often face a tedious debugging process: sifting through massive logs to pinpoint which agent caused the failure and at what step. Researchers from Penn State University, Duke University, and partners like Google DeepMind recently introduced a solution called Automated Failure Attribution, along with the Who&When benchmark dataset. This guide will walk you through applying this method to your own multi-agent projects, transforming a frustrating “needle-in-a-haystack” hunt into a streamlined diagnostic workflow.

What You Need
- Access to the Who&When dataset (hosted on Hugging Face) and the open-source code (available on GitHub). Both are linked in the original research paper.
- Basic familiarity with LLM multi-agent systems and their interaction logs (e.g., agent messages, task states).
- Python environment with libraries like PyTorch and Transformers.
- A sample failing multi-agent scenario (either from your own system or a simulated one) to test the attribution methods.
Step-by-Step Guide
Step 1: Understand the Automated Failure Attribution Problem
Before diving into code, grasp the core challenge: given a failed multi-agent task, you need to identify which agent was responsible at which point in the interaction chain. This is not about blame, but about root-cause localization. The research defines this as a new problem—Automated Failure Attribution—and provides the first benchmark to evaluate solutions. Read the paper (linked above) to understand the formal definition and existing manual debugging pitfalls.
Step 2: Set Up the Who&When Dataset
The Who&When dataset contains multiple multi-agent task scenarios with labeled failure points. To get started:
- Visit the Hugging Face dataset page (link).
- Download the dataset using the
datasetslibrary:from datasets import load_dataset; dataset = load_dataset('Kevin355/Who_and_When') - Familiarize yourself with the structure: each entry includes a log of agent interactions, the final outcome (success/failure), and ground-truth labels for the responsible agent and timestamp.
Step 3: Choose an Automated Attribution Method
The paper evaluates several methods. For your guide, we’ll focus on the simplest baseline—Rule-based Chain-of-Thought (CoT)—and the most effective one—Attribution via Agent Tracing (AAT). You can find implementations in the GitHub repository.
- Clone the repository:
git clone https://github.com/mingyin1/Agents_Failure_Attribution - Install dependencies:
pip install -r requirements.txt - Run the baseline method on a sample from Who&When to confirm setup:
python baseline_cot.py --dataset who_and_when
Step 4: Apply Attribution to Your Own System’s Logs
To diagnose failures in your own multi-agent system, you’ll need to format your logs to match the dataset’s structure. The code expects a JSON or CSV file with fields for agent names, timestamps, and messages. Follow these sub-steps:

- Extract interaction logs from your system. Each message should include sender, recipient, content, and a sequential step number.
- Add a label column if you already know the failure cause (for testing). Otherwise, leave it blank.
- Run the AAT model on your logs:
python aat_model.py --input my_logs.json --output attributions.json - The output will list per-log entries the predicted responsible agent and the step where the error likely occurred.
Step 5: Interpret Results and Iterate
With attributions in hand, you can now efficiently debug. For each failure:
- Review the predicted agent’s actions around the identified step.
- Check for communication errors, misunderstood instructions, or knowledge gaps.
- Apply fixes—e.g., adjust prompt, improve agent memory, or add validation checks.
- Re-run the system to verify the fix.
The benchmark shows that automated attribution accelerates debugging by up to 3× compared to manual log archaeology.
Tips for Success
- Start with provided examples before using your own logs. The Who&When dataset includes diverse failure types (e.g., planning, retrieval, reasoning errors).
- Combine methods: Use rule-based CoT for a quick first pass, then AAT for deeper analysis.
- Log verbosely: The more structured your logs, the better the attribution accuracy. Include task context and final outputs.
- Contribute back: The dataset is open for expansion—if you encounter a unique failure, consider adding it to Who&When.
- Stay updated: The research was accepted as a Spotlight at ICML 2025; watch for future improvements in attribution models.
By following this guide, you’ll turn the daunting task of debugging multi-agent failures into a systematic, automated process. Happy diagnosing!
Related Articles
- Mastering the Chaos: A Step-by-Step Guide to Regaining Calm on Overwhelming Days
- Breaking New Ground in Astrophysics: Low-Energy Nuclear Reactions Measured in Storage Ring
- From a Dream to the Moon: Anton Kiriwas's Path to NASA's Artemis Missions
- Xenonauts 2: Commanding a Fractured Alliance in an Endless Cold War
- Bringing Light to Cameroon: How IEEE Smart Village Powers Rural Communities
- 10 Critical Steps to Build Climate Resilience Through Granular Data
- Understanding Cyclone-Induced Landslides: A Step-by-Step Guide to Analyzing the Papua New Guinea Event
- 5 Stunning Mars Panoramas Revealed by NASA's Twin Rovers