ChaosLLM - Opening a New Research Line on Dependability Testing for Tool-Calling Agents

I’m excited to share my latest paper, “ChaosLLM: A Dependability Testing Approach for Tool-calling Agents”. The work introduces a new direction for evaluating AI systems — shifting the focus from prompt-based attacks to realistic failures in everyday deployments.

Modern LLM agents no longer operate in isolation. They rely on external tools, APIs, memories, and multi-step reasoning loops to complete complex tasks. While most research has focused on adversarial prompts and jailbreaks, we still know surprisingly little about how these agents behave when the environment itself breaks, with tools hanging, crashing, returning subtle errors, or behaving unpredictably.

That gap motivated the creation of ChaosLLM, a lightweight fault-injection framework inspired by chaos engineering in distributed systems. Instead of modifying the agent’s code, ChaosLLM sits between the agent and its tools, deliberately injecting realistic failures such as unreachable services, slow responses, hangs, and subtly incorrect outputs. This allows us to measure dependability using concrete metrics such as task success rate, hallucination rate, and timeout behavior.

Our preliminary experiments on LangChain ReAct agents revealed several interesting insights. Agents handled obvious failures reasonably well and were largely unaffected by slow responses under generous timeouts. However, plausible but slightly incorrect tool outputs proved a major weakness, frequently leading to confident yet incorrect answers. The findings suggest that robustness is not just about stronger models — it also depends heavily on tool validation strategies and prompting patterns.

More importantly, this paper is only the starting point. ChaosLLM lays the foundation for a broader research agenda on dependability testing for LLM ecosystems, including multi-agent systems, transient real-world failures, MCP-based tool infrastructures, and complex real-world workloads. As AI agents increasingly move into mission-critical settings, understanding how they fail — not just how they succeed — becomes essential.

If you’re working on AI agents, reliability engineering, or testing methodologies, I’d love to connect and explore collaborations. The goal is simple: make LLM systems not only powerful, but dependable by design.

Learn more about it here.