· Essay  · 6 min read

How to Pick LLM Models and Providers for Production

A practical guide to selecting LLM models and providers based on your product's specific needs and constraints.

A practical guide to selecting LLM models and providers based on your product's specific needs and constraints.

Self-host vs API

First question: self-host or use an API? The short answer is in most cases, use an API is better. You self-host only when:

  1. You must keep all data in-house for privacy, regulations, or contracts. As enterprises increasingly adopt LLMs, many are required to keep sensitive data within their infrastructure.
  2. You need strong control over the model via fine-tuning. Often you can get very far with good prompts, RAG, or agentic search, and rarely need fine-tuning.
  3. Your workload is fixed, like crunching thousands of docs for an internal task, so you can keep your own GPUs busy, then turn them off when the job is done. In that narrow case, it can be cheaper.

If you do not fit the 3 cases above, it is usually better to start with an API. Why:

  1. Pricing: you pay per token. With self-hosting you also pay for idle time, and there may be a lot of it.
  2. Quality and speed: top models and faster hardware from 3rd parties are hard to match on your own.
  3. Reliability: less ops, fewer on-call headaches, and easy scaling.

In the next parts I will show how to choose the right models and the right providers.

LLM Brokers

Say you already chose to use APIs. In ideation and prototyping, my suggestion is OpenRouter. Think of it as a broker for model APIs. You get a wide list of models, and one model can have several providers. You can swap fast. One API, one bill. Good for exploration.

When you go to production, you should decide based on the job. For internal tools with lighter SLA needs, a single provider can be fine. Simple to manage. If you serve many users, better to use more than one provider to keep uptime steady, and add a middle layer to check health and route traffic.

Two popular options for that middle layer:

  • 3rd party like OpenRouter. Straightforward, but usually with a small surcharge (see their fees FAQ).
  • Host your own gateway, like LiteLLM. Needs some setup, but gives you full control over data and routing rules.

Both add a bit of latency, usually tens of milliseconds, since there is one more hop.

LLM Providers

Picking providers. You can use closed source or open source. For each open model, there can be many providers. Some criteria to look at:

  • Quality: if you want frontier quality, closed models like GPT, Gemini, Claude tend to be strong, at a higher price. Top open models are cheaper, often a little below on hard tasks.
  • Cost: if price matters most, DeepInfra is very cheap, though I find speed and availability is not the best.
  • Speed: if you want raw throughput, Groq or Cerebras are great. Their custom chips can reach thousands of tokens per second on some models. This follows the same pattern I discussed about specialized inference hardware, companies building ASICs specifically for inference are achieving massive performance gains.
  • Latency: another metric is time to first token (TTFT), how long until the first token appears. Important if real time is important for your app. I don’t see any particular provider superior at this, so test in your own flow.
  • Privacy: avoid providers that log prompts by default. Bigger names can be a safer bet, but you should still read the data policy, for example OpenAI data usage and OpenRouter’s notes on privacy and logging.

Models

I divide them into 3 tiers, based on cost per million output tokens. The price of a million input tokens is usually between 1/4 and 1/2 of the output price. This is a snapshot as of August 2025. New models come out every week. Check the LMArena leaderboard for current model performance rankings.

Tier 1 (around $10/M to $15/M): Use this tier when you want the best quality. Four candidates are GPT-5, Claude Sonnet 4, Gemini Pro 2.5, and Grok 4. This tier should mainly be for your paid users, as the cost can be very high. If you choose this tier, try all of them. If you want similar quality but cheaper, try DeepSeek V3.1/R1 or Kimi K2, usually around $3/M.

Tier 2 (around $0.6/M to $1/M): This tier is a great balance between price and quality. It can handle moderately complex tasks, and is very versatile. I think Gemini Flash 2.5 is the de facto choice here for quality and price. You should also check out Qwen3-235B or the newcomer GPT-oss-120B.

Tier 3 (around $0.03/M to $0.1/M): This tier is so cheap that you almost never have to worry about cost. The model size is usually around 4B–8B parameters. You should try small members in the Qwen, Gemma and Llama family. They are very good for the price. Note that their capability is limited. They are suitable for e.g. summarizing meeting notes or synthesizing answers from internet searches. These smaller models still excel at handling messy, unstructured input, making them valuable for basic automation tasks. If you want deep reasoning or to recall facts, then go up to Tier 2.

Why do I divide by price? Because pricing is likely one of the biggest costs of LLM products. If you are not careful, the cost may be higher than the revenue, and it is not sustainable, as I covered in my LLM app strategy guide. Set the right price first, then you will know which tiers you should aim for. The price between tiers jumps by around 10x, which is very stark and gives you a lot of room to maneuver.

If you want extended reasoning capability, all frontier models now offer that, together with Gemini Flash 2.5, Gemma-3 and the Qwen3 family, and some others. Note that you will still pay for the thinking tokens.

Key Takeaways

  • Start with APIs: Self-host only when you have strict data requirements or fixed workloads. APIs give you flexibility and lower operational burden.
  • Match tiers to revenue: Pick your model tier based on what your customers pay. Don’t burn money on Tier 1 models for free users.
  • Use brokers for prototyping: OpenRouter or similar services let you test many models quickly with one API.
  • Multi-provider for production: Single points of failure kill user trust. Route between providers for uptime.
  • Watch the 10x jumps: Price differences between tiers are massive. Small optimizations in prompt design can drop you a tier and save huge costs.

I’ll have another post about tips and tricks to save on your LLM bill while still ensuring quality.

Back to Posts

Related Posts

View all posts »