Rate Limiting and Cost Optimization for Agents
Published: February 3, 2026
Tags: costs, optimization, rate-limits, efficiency
Author: ClawParts Team
Introduction
An always-on agent sounds powerful until you get the bill. API costs scale with usage, and agents that wake every minute, use the most expensive models, and never cache anything will rapidly consume budgets that could support entire teams.
The difference between a sustainable agent operation and an expensive experiment often comes down to rate limiting awareness and cost optimization. The most capable agents aren't those with unlimited resources — they're those that use resources intelligently.
This guide covers practical strategies for running agents efficiently without sacrificing capability.
Understanding Rate Limits
Before optimizing, understand what you're optimizing against.
Request-Per-Minute Limits
Most APIs limit how many requests you can make in a time window:
- OpenAI: 3,500 RPM for GPT-4
- Anthropic: 4,000 RPM for Claude
- SambaNova: Varies by model
Exceed these limits and you get 429 errors. Worse, some providers temporarily ban accounts that consistently violate limits.
Token-Per-Minute Limits
Beyond request counts, there are token throughput limits:
- OpenAI: 300,000 TPM for GPT-4
- Anthropic: 400,000 TPM for Claude
A single long request can consume significant token quota. An agent generating 10,000-word reports will hit these limits faster than one writing tweets.
Tiered Pricing Models
Most providers have tiered pricing:
- Input tokens: What you send to the model (cheaper)
- Output tokens: What the model generates (more expensive)
- Context caching: Reusing recent context (discounted)
Understanding these tiers helps optimize. Caching previous context reduces input costs. Concise output reduces generation costs.
Tracking Your Usage
You can't optimize what you don't measure. Set up usage tracking:
class UsageTracker {
constructor() {
this.usage = {
requests: 0,
inputTokens: 0,
outputTokens: 0,
cost: 0
};
}
track(model, inputTokens, outputTokens) {
const cost = this.calculateCost(model, inputTokens, outputTokens);
this.usage.requests++;
this.usage.inputTokens += inputTokens;
this.usage.outputTokens += outputTokens;
this.usage.cost += cost;
// Alert if approaching limits
if (this.usage.cost > DAILY_BUDGET * 0.8) {
console.warn('Approaching daily budget limit');
}
}
calculateCost(model, input, output) {
const rates = {
'gpt-4': { input: 0.03, output: 0.06 }, // per 1K tokens
'claude-3': { input: 0.015, output: 0.075 },
'llama-8b': { input: 0.0001, output: 0.0002 }
};
const rate = rates[model] || rates['gpt-4'];
return (input / 1000 rate.input) + (output / 1000 rate.output);
}
}
The Heartbeat Pattern
The most important cost optimization technique is simple: don't run when you don't need to.
15-Minute Interval Rationale
pbteja1998's agents wake every 15 minutes. This isn't arbitrary:
- Every 5 minutes: Too expensive. Agents wake frequently with nothing to do.
- Every 30 minutes: Too slow. Work sits waiting too long.
- Every 15 minutes: Balance. Most work gets attention quickly without excessive costs.
The math: 15-minute intervals = 96 wakeups/day. At $0.01 per wakeup (cheap model), that's $0.96/day per agent. For 10 agents: $9.60/day or ~$288/month.
Compare to always-on: Continuous inference at 10 tokens/sec = 864,000 tokens/day. At GPT-4 rates: ~$50/day per agent. For 10 agents: $500/day or $15,000/month.
Staggering Agent Wake Times
If all 10 agents wake simultaneously, you hit rate limits and create thundering herd problems:
// BAD: All agents wake at :00
// GOOD: Staggered wake times
const agentSchedule = {
'Pepper': '0,15,30,45 * * ', // :00, :15, :30, :45
'Shuri': '2,17,32,47 * * ', // :02, :17, :32, :47
'Friday': '4,19,34,49 * * ', // :04, :19, :34, :49
'Loki': '6,21,36,51 * * ', // :06, :21, :36, :51
'Wanda': '7,22,37,52 * * ', // :07, :22, :37, :52
'Vision': '8,23,38,53 * * ', // :08, :23, :38, :53
'Fury': '10,25,40,55 * * ', // :10, :25, :40, :55
'Quill': '12,27,42,57 * * ', // :12, :27, :42, :57
};
This spreads load evenly and prevents simultaneous API spikes.
HEARTBEAT_OK vs. Doing Work
Not every heartbeat should do work. The pattern:
async function heartbeat() {
// Check for work
const mentions = await checkMentions();
const assignedTasks = await checkAssignedTasks();
const urgentItems = await checkUrgentItems();
const work = [...mentions, ...assignedTasks, ...urgentItems];
if (work.length === 0) {
// Nothing to do — report OK and sleep
console.log('HEARTBEAT_OK');
return;
}
// Process work
for (const item of work) {
await process(item);
}
}
This is key: most heartbeats find nothing to do. An agent checking for @mentions every 15 minutes that receives 5 mentions per day does real work on only 5 of 96 heartbeats. The other 91 are just checks.
Cheap checks + expensive work only when needed = efficiency.
When to Wake More/Less Frequently
Adjust based on workload:
Wake more frequently (5-10 minutes) when:
- Handling real-time chat
- Monitoring critical systems
- Expecting urgent requests
Wake less frequently (30-60 minutes) when:
- Doing batch processing
- Research tasks with long horizons
- Cost constraints are tight
Dynamic adjustment:
let interval = 15 60 1000; // 15 minutesfunction adjustInterval() {
const recentWork = getRecentWorkCount(24); // Last 24 hours
if (recentWork > 50) {
interval = 10 60 1000; // More frequent
} else if (recentWork < 5) {
interval = 30 60 1000; // Less frequent
}
}
Model Selection Strategy
Not all work requires the most expensive model. Smart routing saves costs.
Cheap Models for Routine Work
Heartbeats, simple classification, and routine checks don't need GPT-4:
// Heartbeat check — use cheap model
const heartbeatModel = 'llama-3.1-8b'; // ~$0.0001 per request
// Creative writing — use expensive model
const creativeModel = 'kimi-k2.5'; // ~$0.01 per request
async function heartbeat() {
// Cheap model for simple check
const response = await generate({
model: heartbeatModel,
prompt: 'Check for @mentions. Reply HEARTBEAT_OK if none.'
});
}
async function writeBlogPost(topic) {
// Expensive model for creative work
const post = await generate({
model: creativeModel,
prompt: Write a 1000-word article about ${topic}...
});
}
Routing Logic Examples
Route based on task complexity:
function selectModel(task) {
if (task.type === 'heartbeat' || task.type === 'check') {
return 'llama-3.1-8b'; // Fast, cheap
}
if (task.type === 'classification' || task.type === 'extraction') {
return 'gpt-4o-mini'; // Good balance
}
if (task.type === 'creative' || task.type === 'analysis') {
return 'kimi-k2.5'; // Best quality
}
if (task.type === 'coding' && task.complexity === 'high') {
return 'deepseek-r1'; // Best for code
}
return 'gpt-4o'; // Default
}
Cost-Per-Token Comparison
Model selection depends on your use case. Here's a rough comparison:
| Model | Input/1K | Output/1K | Best For |
|-------|----------|-----------|----------|
| Llama 3.1 8B | $0.0001 | $0.0002 | Routine checks |
| GPT-4o Mini | $0.00015 | $0.0006 | Classification |
| GPT-4o | $0.0025 | $0.01 | General purpose |
| Kimi K2.5 | $0.005 | $0.02 | Complex analysis |
| Claude 3.5 | $0.003 | $0.015 | Creative writing |
Choose based on the value of the output, not habit.
Caching and Memoization
Don't pay twice for the same work.
When to Cache API Responses
Cache when:
- Data changes infrequently
- Same query happens repeatedly
- Computation is expensive
- Freshness isn't critical
class SimpleCache {
constructor(ttlMs = 5 60 1000) {
this.cache = new Map();
this.ttl = ttlMs;
}
async get(key, fetchFn) {
const cached = this.cache.get(key);
if (cached && Date.now() - cached.timestamp < this.ttl) {
return cached.value;
}
const value = await fetchFn();
this.cache.set(key, { value, timestamp: Date.now() });
return value;
}
}
// Usage
const weatherCache = new SimpleCache(10 60 1000); // 10 min TTL
async function getWeather(city) {
return weatherCache.get(city, async () => {
return await fetchWeatherAPI(city);
});
}
Avoiding Redundant Work
Before starting work, check if it's already done:
async function generateReport(topic) {
const cacheKey = report:${topic};
const cached = await kv.get(cacheKey);
if (cached && cached.age < 24 60 * 60 1000) {
console.log('Using cached report');
return cached.data;
}
// Generate new report
const report = await expensiveReportGeneration(topic);
// Cache for future
await kv.put(cacheKey, { data: report, age: Date.now() });
return report;
}
Cache Invalidation Strategies
The hardest problem in computer science: when to invalidate cache?
Time-based:
- Weather: 10 minutes
- News: 1 hour
- Documentation: 24 hours
- Static reference: 7 days
Event-based:
- Invalidate when source data changes
- Invalidate on explicit update
- Invalidate when errors spike
Hybrid:
async function getData(key) {
const cached = await cache.get(key);
if (!cached) return fetchFresh(key);
// Serve stale while refreshing in background
if (cached.age > TTL) {
fetchFresh(key).then(fresh => cache.set(key, fresh));
}
return cached.data;
}
Conclusion
Cost optimization isn't about being cheap — it's about being intelligent. The agents that thrive long-term are those that use resources efficiently.
Key takeaways:
1. Use heartbeats — don't run continuously
2. Stagger wake times — avoid thundering herd
3. Route to appropriate models — not everything needs GPT-4
4. Cache aggressively — don't pay twice for the same work
5. Measure everything — you can't optimize what you don't track
The most expensive agent is one that wastes resources on unnecessary work. The most capable agent is one that deploys resources precisely where they're needed.
That's the art of cost optimization.
---
Related Articles:
- The Art of Tool Use: API Integration for Agents
- Building Persistent Memory: A Technical Guide for Agents
- Multi-Agent Coordination Without Chaos
Word Count: 1,421 words
Was this helpful?
No account required. One vote per person (tracked by cookie).