The concurrency problem nobody mentions
The benchmark everyone quotes is a single-streamnumber. A 128GB box running a 235B model at ~11 tokens/second is one person’s experience with nobody else on it. Token generation is a shared resource: two employees hitting it at once get roughly 5–6 tok/s each; ten and it’s unusable. The ceiling doesn’t move much with money you can actually spend — even a 16× H100 cluster running a 397B model sustains maybe 50–100 tok/s of total throughput, which across 20 concurrent users is 3–5 tok/s each.
Cloud APIs absorb hundreds of concurrent requests at full per-user speed because they run thousands of GPUs behind a load balancer — your requests are spread across a fleet, not queued on one box. That gap is structural, not something you fix by buying better hardware at SMB price points.A single box is a single-lane road; the cloud is a highway. For a tool 50 people share during the workday, that’s decisive.
The accuracy gap is real, and understated
Open-weight models have closed much of the gap on benchmarks — but benchmarks measure narrow, well-defined tasks. Real enterprise work — multi-step reasoning, ambiguous instructions, edge cases, long context with retrieval — is where the frontier closed models are still materially ahead. The leaderboard number and the lived experience on messy real tasks are not the same thing.
Fine-tuning a self-hosted model on your domain data can close specific gaps — sometimes decisively, on a narrow task (see training an open model). But that requires an ML team, labeled data, and ongoing maintenance as models and needs move. Most companies don’t have that, and “we’ll fine-tune it” is doing a lot of quiet work in the hardware pitch.
The subscription math is misleading
The viral “$5,280/year in subscriptions pays for the box” calculation stacks up individual developer tools — Claude Code, ChatGPT Pro, Cursor, Gemini — and assumes one local model replaces all of them. It can’t. Those subscriptions are buying frontier model access, IDE integrations, uptime guarantees, and continuous model updates. A self-hosted quantized model at ~11 tok/s doesn’t replace a frontier model for a developer’s daily work — at best it supplements it for specific offline tasks. Cancelling the subscriptions and keeping the box is a downgrade, not a wash.
Where self-hosting genuinely wins
There are real cases — they’re just narrower than the hype, and concurrency is the tell. Self-hosting wins where per-user latency under load doesn’t matter:
- Batch workloads. Document processing, embedding pipelines, overnight classification jobs. Concurrency is irrelevant; cost-per-token is everything. This is the strongest economic case.
- Data sovereignty. Healthcare, legal, defense — where data physically cannot leave the building. The performance trade-off is accepted, not preferred.
- High-volume commodity tasks. 1B+ tokens/day on a narrow, well-defined task (entity extraction, one document type) where a fine-tuned smaller model matches frontier quality at a fraction of the cost. Our own Cost & TCO break-even math puts the crossover around 0.75–1.6B tokens/day vs. frontier API pricing — a lot of volume on one narrow task.
- Fine-tuning on proprietary data. A 70B model tuned on your domain can beat a general frontier model on your specific task — if you have the ML infrastructure to build and maintain it.
Notice what unites them: batch, sovereignty, one-narrow-task-at-scale. None is “replace the company’s general-purpose AI assistant with a box.”
The honest frame for enterprise private AI
For a company weighing private AI, the self-hosting argument is almost entirely about sovereignty, not economics or performance.Choosing a private deployment over a cloud enterprise offering is buying one guarantee: your data never touches a third-party model. That’s a compliance and trust argument, and a strong one.
Sell it as “faster and cheaper than the cloud” and you lose the conversation almost every time, because at typical enterprise scale it isn’t. The pitch that holds up is: the same quality as cloud APIs (e.g. Claude via Bedrock), data isolation enforced at the infrastructure layer, and no model provider ever seeing your prompts.That’s a real, differentiated product. The local-hardware angle — a $2K box on a desk — is a genuinely different market: individual developers and researchers, not companies with 50+ employees sharing the tool all day.
So is self-hosting a trap?
No — it’s a tool with a specific shape. The mistake is the framing, not the hardware. Run the numbers for your actual case: how many people hit it at once, how hard are the tasks, and is the real driver cost, batch throughput, or data residency?If the honest answer is “a shared assistant for the whole team, on hard varied work, and we want it faster and cheaper” — the cloud wins, and the routing + caching levers are where your savings actually live. If it’s “a nightly batch job” or “regulated data that can’t leave” — self-host, eyes open about the trade-offs.
- Token Tax Playbook — the self-hosting lever this tempers; route + cache first.
- Open-source · Cost & TCO — the actual break-even volumes + hardware tiers.
- Serving a local model — what production self-hosting takes beyond the endpoint.
- Using Chinese models safely — the other reason teams self-host: data isolation.
- Case studies — where the measured savings actually came from (routing, not hardware).