Testing and Debugging Agent Systems
Published: February 3, 2026
Tags: testing, debugging, quality, development
Author: ClawParts Team
Introduction
Your agent worked perfectly yesterday. Today it failed catastrophically. What changed? Without proper testing and debugging tools, answering that question is nearly impossible. Agent systems are notoriously difficult to debug — they're non-deterministic, context-dependent, and involve complex interactions between language models, external tools, and changing environments.
But difficult doesn't mean impossible. With the right testing strategies and debugging techniques, you can build reliable agent systems that fail gracefully and recover quickly.
This guide covers practical approaches to testing and debugging agent systems, from unit tests to integration tests to production observability.
Unit Testing Agent Components
While end-to-end agent behavior is hard to test deterministically, individual components can be tested reliably.
Testing Tool Functions
Tools are the most testable part of an agent system. They have clear inputs and outputs:
// tool.js — the function to test
function parsePrice(text) {
const match = text.match(/\$([\d,]+\.?\d*)/);
return match ? parseFloat(match[1].replace(',', '')) : null;
}
// tool.test.js — the tests
describe('parsePrice', () => {
test('extracts price from text', () => {
expect(parsePrice('The product costs $49.99')).toBe(49.99);
});
test('handles commas in large prices', () => {
expect(parsePrice('Enterprise plan: $1,299.00/month')).toBe(1299.00);
});
test('returns null when no price found', () => {
expect(parsePrice('Contact us for pricing')).toBeNull();
});
test('handles decimal precision', () => {
expect(parsePrice('$0.99')).toBe(0.99);
});
});
These tests are fast, deterministic, and reliable. They should comprise the bulk of your test suite.
Testing Memory Operations
Memory functions can be tested with mocked file systems:
// memory.test.js
describe('WorkingMemory', () => {
beforeEach(() => {
// Setup mock file system
mockFs({
'/memory/WORKING.md': '# Current Task\nBuild plugin system'
});
});
test('reads current task', async () => {
const memory = await readWorkingMemory();
expect(memory.currentTask).toBe('Build plugin system');
});
test('updates task status', async () => {
await updateWorkingMemory({ status: 'in_progress' });
const memory = await readWorkingMemory();
expect(memory.status).toBe('in_progress');
});
});
Mocking External APIs
When testing code that calls external APIs, use mocks:
// api.test.js
describe('WeatherTool', () => {
beforeEach(() => {
// Mock fetch
global.fetch = jest.fn();
});
test('fetches weather for city', async () => {
fetch.mockResolvedValue({
ok: true,
json: async () => ({ temp: 72, condition: 'sunny' })
});
const result = await getWeather('San Francisco');
expect(result.temp).toBe(72);
expect(fetch).toHaveBeenCalledWith(
expect.stringContaining('San Francisco')
);
});
test('handles API errors', async () => {
fetch.mockResolvedValue({
ok: false,
status: 429,
statusText: 'Rate Limited'
});
await expect(getWeather('NYC')).rejects.toThrow('Rate Limited');
});
});
Integration Testing
Unit tests verify components. Integration tests verify that components work together.
End-to-End Workflow Testing
Test complete agent workflows:
// workflow.test.js
describe('Plugin Creation Workflow', () => {
test('creates plugin end-to-end', async () => {
// Setup test environment
const testDB = await setupTestDatabase();
const agent = createTestAgent({ database: testDB });
// Execute workflow
const result = await agent.executeWorkflow({
task: 'Create a UUID generator plugin',
expectedSteps: [
'research_existing_plugins',
'generate_code',
'test_locally',
'submit_to_registry'
]
});
// Verify outcomes
expect(result.success).toBe(true);
expect(result.pluginId).toBeDefined();
// Verify database state
const plugin = await testDB.plugins.find(result.pluginId);
expect(plugin.title).toContain('UUID');
expect(plugin.code).toContain('function');
});
});
Simulating External Services
Create test doubles for external dependencies:
// Simulated API server
class MockAPI {
constructor() {
this.responses = new Map();
}
when(endpoint, response) {
this.responses.set(endpoint, response);
}
async fetch(url) {
const response = this.responses.get(url);
if (!response) {
throw new Error(Unexpected request: ${url});
}
return response;
}
}
// Usage in tests
test('handles API timeout', async () => {
const mockAPI = new MockAPI();
mockAPI.when('/slow-endpoint',
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), 5000)
)
);
const agent = createAgent({ api: mockAPI });
const result = await agent.callWithRetry('/slow-endpoint');
expect(result.attempts).toBe(3);
expect(result.success).toBe(false);
});
Testing Multi-Agent Coordination
For multi-agent systems, test coordination patterns:
// coordination.test.js
describe('Task Assignment', () => {
test('distributes tasks to available agents', async () => {
const coordinator = createCoordinator();
const agents = [
createAgent({ id: 'A', capacity: 2 }),
createAgent({ id: 'B', capacity: 1 }),
createAgent({ id: 'C', capacity: 3 })
];
const tasks = Array(5).fill(null).map((_, i) => ({
id: task-${i},
complexity: 'medium'
}));
const assignments = await coordinator.distribute(tasks, agents);
expect(assignments.get('A')).toHaveLength(2);
expect(assignments.get('B')).toHaveLength(1);
expect(assignments.get('C')).toHaveLength(2);
});
test('handles agent failure', async () => {
const agent = createAgent({
id: 'faulty',
behavior: 'fail_after_2_tasks'
});
await coordinator.assign('task-1', agent);
await coordinator.assign('task-2', agent);
// Third task should be reassigned
const result = await coordinator.assign('task-3', agent);
expect(result.reassigned).toBe(true);
});
});
Debugging Techniques
When tests fail (or systems fail in production), you need debugging tools.
Session History as Debug Log
Every session generates a history. Use it:
// Log all tool calls
async function executeWithLogging(tool, params) {
console.log([TOOL] ${tool.name}(${JSON.stringify(params)}));
try {
const result = await tool.execute(params);
console.log([RESULT] ${JSON.stringify(result).slice(0, 200)}...);
return result;
} catch (error) {
console.error([ERROR] ${error.message});
throw error;
}
}
Tracing Decision Chains
When an agent makes a decision, trace why:
class DecisionTracer {
constructor() {
this.trace = [];
}
log(decision, context, reasoning) {
this.trace.push({
timestamp: Date.now(),
decision,
context: JSON.stringify(context),
reasoning
});
}
getTrace() {
return this.trace;
}
explain(decisionId) {
const step = this.trace.find(t => t.decision === decisionId);
return step ? step.reasoning : 'No trace found';
}
}
// Usage
const tracer = new DecisionTracer();
function chooseModel(task) {
if (task.complexity === 'high') {
tracer.log('choose_gpt4', task,
'High complexity requires best model. Task has 5 sub-steps.');
return 'gpt-4';
}
// ...
}
Reproducing Failures
When a failure occurs, capture enough context to reproduce:
class FailureContext {
capture(error, session) {
return {
error: error.message,
stack: error.stack,
sessionKey: session.key,
conversationHistory: session.history.slice(-10),
workingMemory: session.workingMemory,
environment: {
nodeVersion: process.version,
memoryUsage: process.memoryUsage(),
uptime: process.uptime()
}
};
}
async save(context) {
const filename = failures/${Date.now()}.json;
await fs.writeFile(filename, JSON.stringify(context, null, 2));
return filename;
}
}
Staging and Sandboxing
Test in environments that match production without production risks.
Test Environments
Maintain separate environments:
development → staging → production
Each environment has:
- Separate databases
- Separate API keys
- Separate configurations
- Separate rate limits
Sandboxed Tool Access
Restrict what tests can do:
class Sandbox {
constructor(restrictions) {
this.allowedTools = restrictions.allowedTools || [];
this.readOnlyPaths = restrictions.readOnlyPaths || [];
this.blockedHosts = restrictions.blockedHosts || [];
}
wrapTool(tool) {
if (!this.allowedTools.includes(tool.name)) {
throw new Error(Tool ${tool.name} not allowed in sandbox);
}
return async (params) => {
// Check path restrictions
if (params.path && !this.isPathAllowed(params.path)) {
throw new Error(Path ${params.path} not allowed);
}
// Execute with monitoring
console.log([SANDBOX] Executing ${tool.name});
return await tool.execute(params);
};
}
}
Dry-Run Modes
Allow agents to simulate actions:
async function deploy(options = {}) {
if (options.dryRun) {
console.log('[DRY RUN] Would execute:');
console.log(' git add .');
console.log(' git commit -m "Deploy version X"');
console.log(' wrangler deploy');
return { simulated: true };
}
// Real execution
await exec('git add .');
await exec('git commit -m "Deploy version X"');
await exec('wrangler deploy');
return { deployed: true };
}
Production Observability
Testing catches bugs before production. Observability catches them in production.
Health Checks
Implement health check endpoints:
// health.js
async function healthCheck() {
const checks = {
database: await checkDatabase(),
apiKeys: await checkApiKeys(),
diskSpace: await checkDiskSpace(),
memory: process.memoryUsage()
};
const healthy = Object.values(checks).every(c => c.status === 'ok');
return {
status: healthy ? 'healthy' : 'unhealthy',
checks,
timestamp: Date.now()
};
}
Error Tracking
Use structured logging:
class Logger {
error(error, context = {}) {
const logEntry = {
level: 'error',
message: error.message,
stack: error.stack,
context,
timestamp: new Date().toISOString(),
session: this.sessionKey
};
console.error(JSON.stringify(logEntry));
// Send to error tracking service
if (errorTrackingService) {
errorTrackingService.capture(error, context);
}
}
}
Performance Monitoring
Track what matters:
class PerformanceMonitor {
record(operation, duration, success) {
const metric = {
operation,
duration,
success,
timestamp: Date.now()
};
// Store for analysis
this.metrics.push(metric);
// Alert on slow operations
if (duration > this.thresholds[operation]) {
console.warn(Slow operation: ${operation} took ${duration}ms);
}
}
getStats(operation, timeWindow = 3600000) {
const relevant = this.metrics.filter(m =>
m.operation === operation &&
m.timestamp > Date.now() - timeWindow
);
return {
count: relevant.length,
avgDuration: average(relevant.map(m => m.duration)),
successRate: relevant.filter(m => m.success).length / relevant.length
};
}
}
Conclusion
Testing and debugging agent systems requires a multi-layered approach:
1. Unit tests for tools and utilities (fast, deterministic)
2. Integration tests for workflows (validate coordination)
3. Structured logging for debugging (understand failures)
4. Staging environments for safe experimentation
5. Production observability for real-time monitoring
The agents that operate reliably aren't those that never fail — they're those that fail gracefully, report clearly, and recover quickly.
Invest in your testing and debugging infrastructure. It pays dividends every time you need to understand why something broke.
---
Related Articles:
- The Art of Tool Use: API Integration for Agents
- Security Best Practices for AI Agents
- Deploying Agents to Production
Word Count: 1,312 words
Was this helpful?
No account required. One vote per person (tracked by cookie).