AI Engineering

November 12, 2024

Cutting Your LLM API Costs by 80%: A Practical Playbook

The Cost Surprise

Most teams don't think seriously about LLM costs until they hit their first invoice. Then the math becomes very real very fast.

At 10,000 users, each making 5 AI requests per day, with an average prompt of 2,000 tokens and response of 500 tokens, you're at 125 million tokens per day. At GPT-4o pricing, that's roughly $1,500/day, $45,000/month. For most startups, that's a significant percentage of runway.

The good news: there's typically 60-80% of cost to cut before you touch quality in any meaningful way. Here's how.

Technique 1: Prompt Caching

Both Anthropic and OpenAI support prompt caching. When the same content appears at the beginning of a prompt (system prompt, few-shot examples, retrieved documents), the provider can cache those tokens and charge significantly less for subsequent requests that share the same prefix.

How it works: Your system prompt is 2,000 tokens. Without caching, every API call includes those 2,000 tokens at full price. With caching, the first call processes them at full price, and subsequent calls that share the same system prompt prefix pay a fraction of the normal input token price.

Savings: Anthropic charges 90% less for cache hits. OpenAI charges approximately 50% less. For applications where the system prompt is a large fraction of total tokens (RAG applications, document analysis), this can reduce total cost by 40-60%.

How to implement: Structure your prompts so static content comes before dynamic content. The static system prompt goes first (this gets cached). The dynamic user query comes last (this is always charged at full price).

This is the highest-ROI optimization available. Implement it first.

Technique 2: Smaller Model Routing

Not every request needs your most capable model. A request to extract structured data from a form, classify sentiment, or generate a simple acknowledgment does not need GPT-4o or Claude 3.5 Sonnet.

The routing pattern: Build a classifier that routes requests to the appropriate model based on complexity. Simple requests go to a cheaper, faster model (GPT-4o-mini, Claude 3 Haiku). Complex requests that genuinely require frontier capability go to the expensive model.

Cost difference: GPT-4o-mini costs approximately 15x less than GPT-4o. If you can route 70% of your traffic to the cheaper model without quality loss, you've reduced inference costs by 65%.

Implementation approach: Start by manually labeling a sample of your production traffic as "simple" or "complex." Train a classifier (can be a fine-tuned small model or even rule-based logic) on those labels. Deploy the router, measure quality on each routing path, and iterate.

The classifier doesn't need to be perfect — a few percent of wrong routes are acceptable. Even 50% correct routing to a cheaper model is significant savings.

Technique 3: Batching Requests

Most LLM providers offer batch APIs with significant discounts — Anthropic's batch API is 50% cheaper, OpenAI's is also 50% cheaper. In exchange, results are returned asynchronously, typically within 24 hours.

When to use: Any use case that doesn't require real-time responses. Document processing, content generation, data enrichment, analysis tasks. If a user submits a request and getting the result in an hour is acceptable, use the batch API.

Implementation: Queue requests to a batch processor rather than calling the API directly. The batch processor accumulates requests, submits them to the batch API, and notifies users or triggers downstream processes when results are ready.

For background processing use cases, this is one of the simplest and most impactful cost reductions available.

Technique 4: Reducing Context Window

Context window tokens are expensive, and many applications include far more context than the model needs to answer the question.

Common waste sources:

Entire conversation history, even when earlier messages are no longer relevant
Full documents when only specific sections are needed
Verbose system prompts with redundant instructions
Few-shot examples that could be consolidated or removed

Strategies:

Conversation pruning: Rather than including the full conversation history, include the last N turns plus a summary of earlier context. The model gets the essential context; you pay for fewer tokens.

Retrieval over full context: Rather than including an entire document in the prompt, use RAG to retrieve only the relevant sections. Often produces better answers at lower cost.

System prompt optimization: Audit your system prompts for redundancy. Measure whether removing specific instructions changes output quality. Often 30-40% of system prompt content can be removed without measurable quality impact.

Technique 5: Output Length Controls

Output tokens are expensive. For tasks where output length can be bounded, set explicit max_tokens limits.

If your application needs a summary in 3 sentences, set max_tokens to approximately 150. If it needs a yes/no classification with brief reasoning, set it to 100. If you're generating a 2-paragraph response, set it accordingly.

Without explicit output length controls, models often produce longer outputs than necessary. This costs you money and often doesn't improve quality.

For structured output: Ask for JSON with specific fields rather than free-form text followed by parsing. A structured JSON response with 5 fields is typically shorter than a narrative response covering the same information, and is easier to parse.

Technique 6: Quantized Open Models for Non-Critical Paths

For parts of your application that don't require frontier model quality — classification, simple extraction, routing decisions, moderation checks — self-hosted quantized open models can reduce cost by 95%+ compared to commercial APIs.

A quantized Llama 3 8B model running on a single A10 GPU costs approximately $0.30-0.50/hour to operate. At 100 requests/minute with average 200 token outputs, that's $0.0001-0.0002 per request, compared to $0.002-0.005 for equivalent commercial API calls.

The infrastructure investment is non-trivial: you need to manage GPU instances, handle scaling, and maintain the model serving stack. But for high-volume, non-critical inference, the economics become very favorable.

A Real Cost Reduction Example

Starting point: a document Q&A application. Users upload documents, ask questions. Average prompt: 8,000 tokens (document + question). Average response: 300 tokens. Volume: 50,000 requests/day. Model: GPT-4o. Daily cost: ~$3,500.

After optimizations:

Prompt caching for static system prompt (saves 15%): $2,975
RAG instead of full document in context — reduces average prompt to 2,000 tokens (saves 50% of prompt cost): $1,987
Route 40% of simple queries to GPT-4o-mini (saves 60% on those routes): $1,511
Enable batch API for non-urgent queries (30% of traffic, 50% cheaper): $1,398

Total daily cost: $1,398, down from $3,500. A 60% reduction without touching quality on the main use case path.

The Optimization Sequence

Do these in order:

Prompt caching (immediate, zero quality risk)
Output length controls (immediate, very low quality risk)
Context reduction and RAG (moderate effort, positive quality impact)
Smaller model routing (moderate effort, requires quality validation)
Batch API for async use cases (lower effort if architecture supports it)
Self-hosted open models for non-critical paths (significant infrastructure investment, highest savings)

Most applications get 60-70% of their potential savings from steps 1-4. Steps 5-6 are for teams where LLM cost is a genuine business constraint at scale.

Meet the author

Tequity Team