How to Pick LLM Models and Providers for Production

Self-host vs API

First question: self-host or use an API? The short answer is in most cases, APIs are better. You self-host only when:

You must keep all data in-house for privacy, regulations, or contracts. As enterprises increasingly adopt LLMs, many are required to keep sensitive data within their infrastructure.
You need strong control over the model via fine-tuning. Often you’ll get very far with good prompts, RAG, or agentic search, and rarely need fine-tuning.
Your workload is fixed, like crunching thousands of docs for an internal task, so you can keep your own GPUs busy, then turn them off when the job is done. In that narrow case, it’s cheaper.

If you do not fit the 3 cases above, it’s usually better to start with an API. Why:

Pricing: you pay per token. With self-hosting you also pay for idle time, and there may be a lot of it.
Quality and speed: top models and faster hardware from 3rd parties are hard to match on your own.
Reliability: less ops, fewer on-call headaches, and easy scaling.

In the next parts I will show how to choose the right models and the right providers.

LLM Brokers

Say you already chose to use APIs. In ideation and prototyping, my suggestion is OpenRouter. Think of it as a broker for model APIs. You get a wide list of models, and one model can have several providers. You can swap fast. One API, one bill. Good for exploration.

When you go to production, decide based on the job. For internal tools with lighter SLA needs, a single provider can be fine. Simple to manage. If you serve many users, use more than one provider to keep uptime steady, and add a middle layer to check health and route traffic.

Two popular options for that middle layer:

3rd party like OpenRouter. Straightforward, but usually with a small surcharge (see their fees FAQ).
Host your own gateway, like LiteLLM. Needs some setup, but gives you full control over data and routing rules.

Both add a bit of latency, usually tens of milliseconds, since there is one more hop.

LLM Providers

Picking providers. You can use closed source or open source. For each open model, there can be many providers. Some criteria to look at:

Quality: if you want frontier quality, closed models like GPT, Gemini, Claude tend to be strong, at a higher price. Top open models are cheaper, often a little below on hard tasks.
Cost: if price matters most, DeepInfra is very cheap, though I find speed and availability is not the best.
Speed: if you want raw throughput, Groq or Cerebras are great. Their custom chips can reach thousands of tokens per second on some models. This follows the same pattern I discussed about specialized inference hardware, companies building ASICs specifically for inference are achieving massive performance gains.
Latency: another metric is time to first token (TTFT), how long until the first token appears. Important if real time is important for your app. I don’t see any particular provider superior at this, so test in your own flow.
Privacy: avoid providers that log prompts by default. Bigger names can be a safer bet, but you should still read the data policy, for example OpenAI data usage and OpenRouter’s notes on privacy and logging.

Model Tiers

I divide them into 3 tiers, based on cost per million output tokens. The price of a million input tokens is usually between 1/4 and 1/2 of the output price. This is a snapshot as of August 2025. New models come out every week. Check the LMArena leaderboard for current model performance rankings.

Tier 1 (around $10/M to $15/M): Use this tier when you want the best quality. Four candidates are GPT-5, Claude Sonnet 4, Gemini Pro 2.5, and Grok 4. This tier should mainly be for your paid users, as the cost can be very high. If you choose this tier, try all of them. If you want similar quality but cheaper, try DeepSeek V3.1/R1 or Kimi K2, usually around $3/M.

Tier 2 (around $0.4/M to $1/M): This tier is a great balance between price and quality. It can handle moderately complex tasks, and is very versatile. Gemini Flash family (2.0 or 2.5 Lite) is the de facto choice here for quality, price, and speed. Check out Qwen3-235B or the newcomer GPT-oss-120B.

Tier 3 (around $0.03/M to $0.1/M): This tier is so cheap that you almost never have to worry about cost. The model size is usually around 4B–8B parameters. Try small members in the Qwen, Gemma and Llama family. They are very good for the price. Note that their capability is limited. They are suitable for e.g. summarizing meeting notes or synthesizing answers from internet searches. These smaller models still excel at handling messy, unstructured input, making them valuable for basic automation tasks. If you want deep reasoning or to recall facts, then go up to Tier 2.

Why do I divide by price? Because pricing is likely one of the biggest costs of LLM products. If you are not careful, the cost will be higher than the revenue, and it’s not sustainable, as I covered in my LLM app strategy guide. Set the right price first, then you will know which tiers you should aim for. The price between tiers jumps by around 10x, which is very stark and gives you a lot of room to maneuver.

If you want extended reasoning capability, all frontier models now offer that, together with Gemini Flash 2.5, Gemma-3 and the Qwen3 family, and some others. Note that you will still pay for the thinking tokens.

Evaluation Framework

You already picked a tier. Now we will find the exact model. In your workflow you use LLMs for several tasks. Each task can use a different model. Here is my evaluation workflow.

Decide if the task is input or output intensive. An example of input intensive is summarizing articles, an example of output intensive is writing reports. If a task is input intensive, focus on input price, if output intensive, focus on output price. Models generally have output to input token price between 6:1 and 1:1, and it tends to be around 4:1.
Pick around 10 models: 6 in your target tier, 2 from each other tier. Choose models well spaced by price, prioritize popular ones. Find 5 typical samples of your task. For example, if the task is analyzing scientific articles, pick 5 typical papers you or your users use.
For each sample, run all 10 models. Send the input and all outputs to a frontier judge (GPT-5, Claude Sonnet, or Gemini Pro 2.5) and ask it to rate quality. Tell the judge your priorities clearly: what is a deal breaker, should have, nice to have. Record tokens and cost for each run. From that, pick a shortlist of the best 3-4 models, balancing quality and price.
Expand to about 20 samples. Run and compare them manually on your shortlist, so you have a clear view of the pros and cons of each model. Finally pick one model for the task, and one backup to monitor and double check.

New models come out often, so keep your benchmark around and plug in any new models with positive community reviews. When your current model gets an update, always test it.

That is it, with this framework you can pick the right model for each task and not worry about missing new releases.

Key Takeaways

Start with APIs: Self-host only when you have strict data requirements or fixed workloads. APIs give you flexibility and lower operational burden.
Match tiers to revenue: Pick your model tier based on what your customers pay. Don’t burn money on Tier 1 models for free users.
Use brokers for prototyping: OpenRouter or similar services let you test many models quickly with one API.
Multi-provider for production: Single points of failure kill user trust. Route between providers for uptime.
Watch the 10x jumps: Price differences between tiers are massive. Small optimizations in prompt design can drop you a tier and save huge costs.
Evaluate systematically: Classify tasks (input vs output heavy), try ~10 models with a frontier judge, track cost, expand to ~20 real samples, pick a primary + backup, and keep your benchmark fresh.

I’ll have another post about tips and tricks to save on your LLM bill while still ensuring quality.