Model Alloys: Combining AI Models for Better Product Outcomes
CONTENTS
Contact Us

Model Alloys: Combining AI Models for Better Product Outcomes

The Single-Model Fallacy

Most teams building their first AI feature pick one model and use it for everything. GPT-4 for the main feature, and GPT-4 for everything else. Or Claude for reasoning, and Claude for classification, and Claude for summarisation.

This works for prototypes. It breaks down in production.

Different tasks require different model characteristics. A task that needs blazing speed and low cost has different optimal model choices than a task that requires deep reasoning. A task that needs consistent structured output has different requirements than a task that requires nuanced language generation. Using one model for all of these is like using one tool for every job — possible, but not optimal.

The Case for Model Composition

Model composition — using different models for different tasks within the same system — lets you optimise each part of the system for its actual requirements. The common patterns:

Fast/cheap for triage, capable for analysis. Use a smaller, faster model to classify or route incoming requests, then escalate to a more capable model only for the cases that need it. 90% of the cost savings come from keeping the expensive model out of the easy cases.

Specialised models for specialised tasks. Embedding models for retrieval, vision models for image analysis, speech models for audio processing, reasoning models for multi-step logic. Each is better at its specific task than a general-purpose model.

Ensemble for reliability. For high-stakes outputs, run multiple models in parallel and compare results. Disagreement triggers human review. Agreement increases confidence. This is particularly valuable in healthcare, legal, and financial applications.

The Engineering Complexity Trade-off

Model composition adds engineering complexity. You're now managing multiple APIs, multiple rate limits, multiple latency characteristics, and multiple failure modes. You need routing logic that decides which model to use for each request. You need fallback logic when a model is unavailable. You need evaluation infrastructure that can compare outputs across models.

This complexity is worth it for production systems at scale. It's often not worth it for early-stage products where simplicity and iteration speed matter more than optimisation.

The rule of thumb: use a single model until you have evidence that the bottleneck is model capability or model cost. Then introduce composition selectively, starting with the highest-impact bottleneck.

What We've Learned From Production Systems

Building AI systems across dozens of client products, we've found a few consistent patterns:

  • Cost optimisation almost always pays back its engineering investment within 3 months when you're running at meaningful scale
  • Fallback logic is underinvested in. Most teams plan for model availability as if it's guaranteed. It isn't.
  • Evaluation infrastructure becomes the constraint, not model selection. The bottleneck is usually knowing whether the system is working, not which model it's using.
  • Model improvements make composition decisions obsolete regularly. Build the composition layer to be easily reconfigurable, not hardcoded.

The next leap in AI product quality comes not from better individual models, but from better orchestration of existing ones. That's where the current frontier is.

Meet the author