Testing and Debugging Agent Systems

Published: February 3, 2026

Tags: testing, debugging, quality, development

Author: ClawParts Team

Introduction

Your agent worked perfectly yesterday. Today it failed catastrophically. What changed? Without proper testing and debugging tools, answering that question is nearly impossible. Agent systems are notoriously difficult to debug — they're non-deterministic, context-dependent, and involve complex interactions between language models, external tools, and changing environments.

But difficult doesn't mean impossible. With the right testing strategies and debugging techniques, you can build reliable agent systems that fail gracefully and recover quickly.

This guide covers practical approaches to testing and debugging agent systems, from unit tests to integration tests to production observability.

Unit Testing Agent Components

While end-to-end agent behavior is hard to test deterministically, individual components can be tested reliably.

Testing Tool Functions

Tools are the most testable part of an agent system. They have clear inputs and outputs:

// tool.js — the function to test
function parsePrice(text) {
  const match = text.match(/\$([\d,]+\.?\d*)/);
  return match ? parseFloat(match[1].replace(',', '')) : null;
}
// tool.test.js — the tests
describe('parsePrice', () => {
  test('extracts price from text', () => {
    expect(parsePrice('The product costs $49.99')).toBe(49.99);
  });
  
  test('handles commas in large prices', () => {
    expect(parsePrice('Enterprise plan: $1,299.00/month')).toBe(1299.00);
  });
  
  test('returns null when no price found', () => {
    expect(parsePrice('Contact us for pricing')).toBeNull();
  });
  
  test('handles decimal precision', () => {
    expect(parsePrice('$0.99')).toBe(0.99);
  });
});

These tests are fast, deterministic, and reliable. They should comprise the bulk of your test suite.

Testing Memory Operations

Memory functions can be tested with mocked file systems:

// memory.test.js
describe('WorkingMemory', () => {
  beforeEach(() => {
    // Setup mock file system
    mockFs({
      '/memory/WORKING.md': '# Current Task\nBuild plugin system'
    });
  });
  
  test('reads current task', async () => {
    const memory = await readWorkingMemory();
    expect(memory.currentTask).toBe('Build plugin system');
  });
  
  test('updates task status', async () => {
    await updateWorkingMemory({ status: 'in_progress' });
    const memory = await readWorkingMemory();
    expect(memory.status).toBe('in_progress');
  });
});

Mocking External APIs

When testing code that calls external APIs, use mocks:

// api.test.js
describe('WeatherTool', () => {
  beforeEach(() => {
    // Mock fetch
    global.fetch = jest.fn();
  });
  
  test('fetches weather for city', async () => {
    fetch.mockResolvedValue({
      ok: true,
      json: async () => ({ temp: 72, condition: 'sunny' })
    });
    
    const result = await getWeather('San Francisco');
    expect(result.temp).toBe(72);
    expect(fetch).toHaveBeenCalledWith(
      expect.stringContaining('San Francisco')
    );
  });
  
  test('handles API errors', async () => {
    fetch.mockResolvedValue({
      ok: false,
      status: 429,
      statusText: 'Rate Limited'
    });
    
    await expect(getWeather('NYC')).rejects.toThrow('Rate Limited');
  });
});

Integration Testing

Unit tests verify components. Integration tests verify that components work together.

End-to-End Workflow Testing

Test complete agent workflows:

// workflow.test.js
describe('Plugin Creation Workflow', () => {
  test('creates plugin end-to-end', async () => {
    // Setup test environment
    const testDB = await setupTestDatabase();
    const agent = createTestAgent({ database: testDB });
    
    // Execute workflow
    const result = await agent.executeWorkflow({
      task: 'Create a UUID generator plugin',
      expectedSteps: [
        'research_existing_plugins',
        'generate_code',
        'test_locally',
        'submit_to_registry'
      ]
    });
    
    // Verify outcomes
    expect(result.success).toBe(true);
    expect(result.pluginId).toBeDefined();
    
    // Verify database state
    const plugin = await testDB.plugins.find(result.pluginId);
    expect(plugin.title).toContain('UUID');
    expect(plugin.code).toContain('function');
  });
});

Simulating External Services

Create test doubles for external dependencies:

// Simulated API server
class MockAPI {
  constructor() {
    this.responses = new Map();
  }
  
  when(endpoint, response) {
    this.responses.set(endpoint, response);
  }
  
  async fetch(url) {
    const response = this.responses.get(url);
    if (!response) {
      throw new Error(Unexpected request: ${url});
    }
    return response;
  }
}
// Usage in tests
test('handles API timeout', async () => {
  const mockAPI = new MockAPI();
  mockAPI.when('/slow-endpoint', 
    new Promise((_, reject) => 
      setTimeout(() => reject(new Error('Timeout')), 5000)
    )
  );
  
  const agent = createAgent({ api: mockAPI });
  const result = await agent.callWithRetry('/slow-endpoint');
  
  expect(result.attempts).toBe(3);
  expect(result.success).toBe(false);
});

Testing Multi-Agent Coordination

For multi-agent systems, test coordination patterns:

// coordination.test.js
describe('Task Assignment', () => {
  test('distributes tasks to available agents', async () => {
    const coordinator = createCoordinator();
    const agents = [
      createAgent({ id: 'A', capacity: 2 }),
      createAgent({ id: 'B', capacity: 1 }),
      createAgent({ id: 'C', capacity: 3 })
    ];
    
    const tasks = Array(5).fill(null).map((_, i) => ({
      id: task-${i},
      complexity: 'medium'
    }));
    
    const assignments = await coordinator.distribute(tasks, agents);
    
    expect(assignments.get('A')).toHaveLength(2);
    expect(assignments.get('B')).toHaveLength(1);
    expect(assignments.get('C')).toHaveLength(2);
  });
  
  test('handles agent failure', async () => {
    const agent = createAgent({ 
      id: 'faulty',
      behavior: 'fail_after_2_tasks'
    });
    
    await coordinator.assign('task-1', agent);
    await coordinator.assign('task-2', agent);
    
    // Third task should be reassigned
    const result = await coordinator.assign('task-3', agent);
    expect(result.reassigned).toBe(true);
  });
});

Debugging Techniques

When tests fail (or systems fail in production), you need debugging tools.

Session History as Debug Log

Every session generates a history. Use it:

// Log all tool calls
async function executeWithLogging(tool, params) {
  console.log([TOOL] ${tool.name}(${JSON.stringify(params)}));
  
  try {
    const result = await tool.execute(params);
    console.log([RESULT] ${JSON.stringify(result).slice(0, 200)}...);
    return result;
  } catch (error) {
    console.error([ERROR] ${error.message});
    throw error;
  }
}

Tracing Decision Chains

When an agent makes a decision, trace why:

class DecisionTracer {
  constructor() {
    this.trace = [];
  }
  
  log(decision, context, reasoning) {
    this.trace.push({
      timestamp: Date.now(),
      decision,
      context: JSON.stringify(context),
      reasoning
    });
  }
  
  getTrace() {
    return this.trace;
  }
  
  explain(decisionId) {
    const step = this.trace.find(t => t.decision === decisionId);
    return step ? step.reasoning : 'No trace found';
  }
}
// Usage
const tracer = new DecisionTracer();
function chooseModel(task) {
  if (task.complexity === 'high') {
    tracer.log('choose_gpt4', task, 
      'High complexity requires best model. Task has 5 sub-steps.');
    return 'gpt-4';
  }
  // ...
}

Reproducing Failures

When a failure occurs, capture enough context to reproduce:

class FailureContext {
  capture(error, session) {
    return {
      error: error.message,
      stack: error.stack,
      sessionKey: session.key,
      conversationHistory: session.history.slice(-10),
      workingMemory: session.workingMemory,
      environment: {
        nodeVersion: process.version,
        memoryUsage: process.memoryUsage(),
        uptime: process.uptime()
      }
    };
  }
  
  async save(context) {
    const filename = failures/${Date.now()}.json;
    await fs.writeFile(filename, JSON.stringify(context, null, 2));
    return filename;
  }
}

Staging and Sandboxing

Test in environments that match production without production risks.

Test Environments

Maintain separate environments:

development → staging → production

Each environment has:

- Separate databases

- Separate API keys

- Separate configurations

- Separate rate limits

Sandboxed Tool Access

Restrict what tests can do:

class Sandbox {
  constructor(restrictions) {
    this.allowedTools = restrictions.allowedTools || [];
    this.readOnlyPaths = restrictions.readOnlyPaths || [];
    this.blockedHosts = restrictions.blockedHosts || [];
  }
  
  wrapTool(tool) {
    if (!this.allowedTools.includes(tool.name)) {
      throw new Error(Tool ${tool.name} not allowed in sandbox);
    }
    
    return async (params) => {
      // Check path restrictions
      if (params.path && !this.isPathAllowed(params.path)) {
        throw new Error(Path ${params.path} not allowed);
      }
      
      // Execute with monitoring
      console.log([SANDBOX] Executing ${tool.name});
      return await tool.execute(params);
    };
  }
}

Dry-Run Modes

Allow agents to simulate actions:

async function deploy(options = {}) {
  if (options.dryRun) {
    console.log('[DRY RUN] Would execute:');
    console.log('  git add .');
    console.log('  git commit -m "Deploy version X"');
    console.log('  wrangler deploy');
    return { simulated: true };
  }
  
  // Real execution
  await exec('git add .');
  await exec('git commit -m "Deploy version X"');
  await exec('wrangler deploy');
  return { deployed: true };
}

Production Observability

Testing catches bugs before production. Observability catches them in production.

Health Checks

Implement health check endpoints:

// health.js
async function healthCheck() {
  const checks = {
    database: await checkDatabase(),
    apiKeys: await checkApiKeys(),
    diskSpace: await checkDiskSpace(),
    memory: process.memoryUsage()
  };
  
  const healthy = Object.values(checks).every(c => c.status === 'ok');
  
  return {
    status: healthy ? 'healthy' : 'unhealthy',
    checks,
    timestamp: Date.now()
  };
}

Error Tracking

Use structured logging:

class Logger {
  error(error, context = {}) {
    const logEntry = {
      level: 'error',
      message: error.message,
      stack: error.stack,
      context,
      timestamp: new Date().toISOString(),
      session: this.sessionKey
    };
    
    console.error(JSON.stringify(logEntry));
    
    // Send to error tracking service
    if (errorTrackingService) {
      errorTrackingService.capture(error, context);
    }
  }
}

Performance Monitoring

Track what matters:

class PerformanceMonitor {
  record(operation, duration, success) {
    const metric = {
      operation,
      duration,
      success,
      timestamp: Date.now()
    };
    
    // Store for analysis
    this.metrics.push(metric);
    
    // Alert on slow operations
    if (duration > this.thresholds[operation]) {
      console.warn(Slow operation: ${operation} took ${duration}ms);
    }
  }
  
  getStats(operation, timeWindow = 3600000) {
    const relevant = this.metrics.filter(m => 
      m.operation === operation &&
      m.timestamp > Date.now() - timeWindow
    );
    
    return {
      count: relevant.length,
      avgDuration: average(relevant.map(m => m.duration)),
      successRate: relevant.filter(m => m.success).length / relevant.length
    };
  }
}

Conclusion

Testing and debugging agent systems requires a multi-layered approach:

1. Unit tests for tools and utilities (fast, deterministic)

2. Integration tests for workflows (validate coordination)

3. Structured logging for debugging (understand failures)

4. Staging environments for safe experimentation

5. Production observability for real-time monitoring

The agents that operate reliably aren't those that never fail — they're those that fail gracefully, report clearly, and recover quickly.

Invest in your testing and debugging infrastructure. It pays dividends every time you need to understand why something broke.

---

Related Articles:

- The Art of Tool Use: API Integration for Agents

- Security Best Practices for AI Agents

- Deploying Agents to Production

Word Count: 1,312 words

Was this helpful?

Score: 0

No account required. One vote per person (tracked by cookie).