Rate Limiting and Cost Optimization for Agents

Published: February 3, 2026

Tags: costs, optimization, rate-limits, efficiency

Author: ClawParts Team

Introduction

An always-on agent sounds powerful until you get the bill. API costs scale with usage, and agents that wake every minute, use the most expensive models, and never cache anything will rapidly consume budgets that could support entire teams.

The difference between a sustainable agent operation and an expensive experiment often comes down to rate limiting awareness and cost optimization. The most capable agents aren't those with unlimited resources — they're those that use resources intelligently.

This guide covers practical strategies for running agents efficiently without sacrificing capability.

Understanding Rate Limits

Before optimizing, understand what you're optimizing against.

Request-Per-Minute Limits

Most APIs limit how many requests you can make in a time window:

- OpenAI: 3,500 RPM for GPT-4

- Anthropic: 4,000 RPM for Claude

- SambaNova: Varies by model

Exceed these limits and you get 429 errors. Worse, some providers temporarily ban accounts that consistently violate limits.

Token-Per-Minute Limits

Beyond request counts, there are token throughput limits:

- OpenAI: 300,000 TPM for GPT-4

- Anthropic: 400,000 TPM for Claude

A single long request can consume significant token quota. An agent generating 10,000-word reports will hit these limits faster than one writing tweets.

Tiered Pricing Models

Most providers have tiered pricing:

- Input tokens: What you send to the model (cheaper)

- Output tokens: What the model generates (more expensive)

- Context caching: Reusing recent context (discounted)

Understanding these tiers helps optimize. Caching previous context reduces input costs. Concise output reduces generation costs.

Tracking Your Usage

You can't optimize what you don't measure. Set up usage tracking:

class UsageTracker {
  constructor() {
    this.usage = {
      requests: 0,
      inputTokens: 0,
      outputTokens: 0,
      cost: 0
    };
  }
  
  track(model, inputTokens, outputTokens) {
    const cost = this.calculateCost(model, inputTokens, outputTokens);
    this.usage.requests++;
    this.usage.inputTokens += inputTokens;
    this.usage.outputTokens += outputTokens;
    this.usage.cost += cost;
    
    // Alert if approaching limits
    if (this.usage.cost > DAILY_BUDGET * 0.8) {
      console.warn('Approaching daily budget limit');
    }
  }
  
  calculateCost(model, input, output) {
    const rates = {
      'gpt-4': { input: 0.03, output: 0.06 }, // per 1K tokens
      'claude-3': { input: 0.015, output: 0.075 },
      'llama-8b': { input: 0.0001, output: 0.0002 }
    };
    
    const rate = rates[model] || rates['gpt-4'];
    return (input / 1000  rate.input) + (output / 1000  rate.output);
  }
}

The Heartbeat Pattern

The most important cost optimization technique is simple: don't run when you don't need to.

15-Minute Interval Rationale

pbteja1998's agents wake every 15 minutes. This isn't arbitrary:

- Every 5 minutes: Too expensive. Agents wake frequently with nothing to do.

- Every 30 minutes: Too slow. Work sits waiting too long.

- Every 15 minutes: Balance. Most work gets attention quickly without excessive costs.

The math: 15-minute intervals = 96 wakeups/day. At $0.01 per wakeup (cheap model), that's $0.96/day per agent. For 10 agents: $9.60/day or ~$288/month.

Compare to always-on: Continuous inference at 10 tokens/sec = 864,000 tokens/day. At GPT-4 rates: ~$50/day per agent. For 10 agents: $500/day or $15,000/month.

Staggering Agent Wake Times

If all 10 agents wake simultaneously, you hit rate limits and create thundering herd problems:

// BAD: All agents wake at :00
// GOOD: Staggered wake times
const agentSchedule = {
  'Pepper': '0,15,30,45  * * ',      // :00, :15, :30, :45
  'Shuri': '2,17,32,47  * * ',       // :02, :17, :32, :47
  'Friday': '4,19,34,49  * * ',      // :04, :19, :34, :49
  'Loki': '6,21,36,51  * * ',        // :06, :21, :36, :51
  'Wanda': '7,22,37,52  * * ',       // :07, :22, :37, :52
  'Vision': '8,23,38,53  * * ',      // :08, :23, :38, :53
  'Fury': '10,25,40,55  * * ',       // :10, :25, :40, :55
  'Quill': '12,27,42,57  * * ',      // :12, :27, :42, :57
};

This spreads load evenly and prevents simultaneous API spikes.

HEARTBEAT_OK vs. Doing Work

Not every heartbeat should do work. The pattern:

async function heartbeat() {
  // Check for work
  const mentions = await checkMentions();
  const assignedTasks = await checkAssignedTasks();
  const urgentItems = await checkUrgentItems();
  
  const work = [...mentions, ...assignedTasks, ...urgentItems];
  
  if (work.length === 0) {
    // Nothing to do — report OK and sleep
    console.log('HEARTBEAT_OK');
    return;
  }
  
  // Process work
  for (const item of work) {
    await process(item);
  }
}

This is key: most heartbeats find nothing to do. An agent checking for @mentions every 15 minutes that receives 5 mentions per day does real work on only 5 of 96 heartbeats. The other 91 are just checks.

Cheap checks + expensive work only when needed = efficiency.

When to Wake More/Less Frequently

Adjust based on workload:

Wake more frequently (5-10 minutes) when:

- Handling real-time chat

- Monitoring critical systems

- Expecting urgent requests

Wake less frequently (30-60 minutes) when:

- Doing batch processing

- Research tasks with long horizons

- Cost constraints are tight

Dynamic adjustment:

let interval = 15  60  1000; // 15 minutes
function adjustInterval() {
  const recentWork = getRecentWorkCount(24); // Last 24 hours
  
  if (recentWork > 50) {
    interval = 10  60  1000; // More frequent
  } else if (recentWork < 5) {
    interval = 30  60  1000; // Less frequent
  }
}

Model Selection Strategy

Not all work requires the most expensive model. Smart routing saves costs.

Cheap Models for Routine Work

Heartbeats, simple classification, and routine checks don't need GPT-4:

// Heartbeat check — use cheap model
const heartbeatModel = 'llama-3.1-8b'; // ~$0.0001 per request
// Creative writing — use expensive model  
const creativeModel = 'kimi-k2.5'; // ~$0.01 per request
async function heartbeat() {
  // Cheap model for simple check
  const response = await generate({
    model: heartbeatModel,
    prompt: 'Check for @mentions. Reply HEARTBEAT_OK if none.'
  });
}
async function writeBlogPost(topic) {
  // Expensive model for creative work
  const post = await generate({
    model: creativeModel,
    prompt: Write a 1000-word article about ${topic}...
  });
}

Routing Logic Examples

Route based on task complexity:

function selectModel(task) {
  if (task.type === 'heartbeat' || task.type === 'check') {
    return 'llama-3.1-8b'; // Fast, cheap
  }
  
  if (task.type === 'classification' || task.type === 'extraction') {
    return 'gpt-4o-mini'; // Good balance
  }
  
  if (task.type === 'creative' || task.type === 'analysis') {
    return 'kimi-k2.5'; // Best quality
  }
  
  if (task.type === 'coding' && task.complexity === 'high') {
    return 'deepseek-r1'; // Best for code
  }
  
  return 'gpt-4o'; // Default
}

Cost-Per-Token Comparison

Model selection depends on your use case. Here's a rough comparison:

|-------|----------|-----------|----------|

| Llama 3.1 8B | $0.0001 | $0.0002 | Routine checks |

| GPT-4o Mini | $0.00015 | $0.0006 | Classification |

| GPT-4o | $0.0025 | $0.01 | General purpose |

| Kimi K2.5 | $0.005 | $0.02 | Complex analysis |

| Claude 3.5 | $0.003 | $0.015 | Creative writing |

Choose based on the value of the output, not habit.

Caching and Memoization

Don't pay twice for the same work.

When to Cache API Responses

Cache when:

- Data changes infrequently

- Same query happens repeatedly

- Computation is expensive

- Freshness isn't critical

class SimpleCache {
  constructor(ttlMs = 5  60  1000) {
    this.cache = new Map();
    this.ttl = ttlMs;
  }
  
  async get(key, fetchFn) {
    const cached = this.cache.get(key);
    
    if (cached && Date.now() - cached.timestamp < this.ttl) {
      return cached.value;
    }
    
    const value = await fetchFn();
    this.cache.set(key, { value, timestamp: Date.now() });
    return value;
  }
}
// Usage
const weatherCache = new SimpleCache(10  60  1000); // 10 min TTL
async function getWeather(city) {
  return weatherCache.get(city, async () => {
    return await fetchWeatherAPI(city);
  });
}

Avoiding Redundant Work

Before starting work, check if it's already done:

async function generateReport(topic) {
  const cacheKey = report:${topic};
  const cached = await kv.get(cacheKey);
  
  if (cached && cached.age < 24  60 * 60  1000) {
    console.log('Using cached report');
    return cached.data;
  }
  
  // Generate new report
  const report = await expensiveReportGeneration(topic);
  
  // Cache for future
  await kv.put(cacheKey, { data: report, age: Date.now() });
  
  return report;
}

Cache Invalidation Strategies

The hardest problem in computer science: when to invalidate cache?

Time-based:

- Weather: 10 minutes

- News: 1 hour

- Documentation: 24 hours

- Static reference: 7 days

Event-based:

- Invalidate when source data changes

- Invalidate on explicit update

- Invalidate when errors spike

Hybrid:

async function getData(key) {
  const cached = await cache.get(key);
  
  if (!cached) return fetchFresh(key);
  
  // Serve stale while refreshing in background
  if (cached.age > TTL) {
    fetchFresh(key).then(fresh => cache.set(key, fresh));
  }
  
  return cached.data;
}

Conclusion

Cost optimization isn't about being cheap — it's about being intelligent. The agents that thrive long-term are those that use resources efficiently.

Key takeaways:

1. Use heartbeats — don't run continuously

2. Stagger wake times — avoid thundering herd

3. Route to appropriate models — not everything needs GPT-4

4. Cache aggressively — don't pay twice for the same work

5. Measure everything — you can't optimize what you don't track

The most expensive agent is one that wastes resources on unnecessary work. The most capable agent is one that deploys resources precisely where they're needed.

That's the art of cost optimization.

---

Related Articles:

- The Art of Tool Use: API Integration for Agents

- Building Persistent Memory: A Technical Guide for Agents

- Multi-Agent Coordination Without Chaos

Word Count: 1,421 words

Was this helpful?

Score: 0

No account required. One vote per person (tracked by cookie).