The thesis: GPUs are commodity, the stack is the moat
The honest version of every managed-inference pitch is the same. Anyone can rent an H100. What’s actually hard — and what you’d be rebuilding if you served the model yourself — is the layer between the GPU and the API:
- Custom kernels & decoding — speculative decoding, prefix/KV caching, quantization that doesn’t wreck quality. The difference between vanilla vLLM and a tuned stack is often 2–4× throughput on the same card.
- Cold starts & autoscaling — loading a 70B model onto a fresh GPU takes tens of seconds; the platforms that win hide that. Scale-to-zero without a cold-start penalty is genuinely hard engineering.
- Multi-cloud / VPC HA — spreading a workload across regions and clouds (or into your own VPC) for uptime, with one endpoint. This is the part a single self-hosted box structurally cannot match (the concurrency problem).
- Specialized runtimes — embeddings, transcription/diarization, and streaming TTS each want a different serving shape than chat. The platforms increasingly ship purpose-built runtimes for each.
That last point is where this category is pulling away from “just an OpenAI-compatible endpoint.” The interesting competition in 2026 isn’t chat tokens-per-second — it’s the non-chat modalities.
The field, as we read it
These split into two rough camps: software-optimization platforms (tune the stack on commodity GPUs) and custom-silicon platforms (their speed comes from their own chips). Both are legitimate; they trade off differently.
The clearest articulation of “the stack is the moat.” Custom kernels + caching baked in, fast cold starts, managed cloud oryour VPC, and specialized runtimes: BEI embeddings (claims >2× throughput / ~10% lower latency vs. alternatives), optimized transcription + diarization, streaming TTS for voice agents, and Chains for compound/multi-step inference with per-step hardware + autoscaling. Forward-deployed engineers, not just self-serve. Customer list is AI-native (Cursor, Notion, Abridge, Writer). The BEI throughput claim is exactly the kind of thing we should measure rather than repeat — see below.
A large catalog served on their own optimized kernels, with function-calling and fine-tuning. The pragmatic default when you want many open models fast behind one API.
The broadest open-model shelf plus a training/fine-tuning side. Good when the requirement is “serve a lot of different open models” rather than one tuned workload.
Aggressive per-token pricing on a broad open catalog. Often the cost floor for commodity open-model serving — worth checking against our measured pass-rate before assuming cheap == fine for the task.
Their edge is hardware, not just software: LPU / RDU / wafer-scale chips that hit token-per-second numbers commodity GPUs can’t. Decisive for latency-bound agents and voice. Trade-off: a narrower model menu and you’re tied to their fleet. We benchmark Groq, SambaNova and Cerebras separately.
Prepackaged, optimized inference microservices you run on NVIDIA GPUs — anywhere. The “bring it into your own infra” option from the silicon vendor itself; the baseline the independents position against.
Where it gets interesting: the non-chat runtimes
Our own benchmarks today measure chatquality. But the differentiated products on these platforms are increasingly embeddings, transcription and TTS — and those have their own “which is actually best” question that nobody answers neutrally:
- Embeddings. Throughput and latency matter enormously for RAG ingestion at scale. Baseten’s BEI 2× claim is a concrete, testable number — a natural extension of what we already do for chat.
- Transcription + diarization. Accuracy and cost per audio-hour, plus time-to-first-token for live use.
- Streaming TTS. Time-to-first-byte is the metric for voice agents and phone — exactly the layer an agent runtime sits on top of.
Open thread for us.We can extend the same measured, vendor-neutral discipline we apply to chat into these modalities — starting with the Baseten BEI embeddings throughput claim. That’s a benchmark task, not a marketing line, and it’s queued.
Managed platform vs. DIY — the SecondClaw read
For private/self-hosted infrastructure (the SecondClaw question), the choice isn’t “cloud API vs. our own GPUs.” It’s a three-way split, and the managed platform usually beats raw DIY:
- DIY H100 + vLLM— maximum control and data residency, but you own the kernel tuning, cold starts, autoscaling and HA. That’s months of engineering to match what a platform ships day one. Justified only when nothing can leave your building, or at very high steady volume on one task.
- Managed platform → your VPC (single-tenant) — the middle path. Baseten and NIM both offer running their optimized stack inside your own cloud. You get the optimization layer without rebuilding it; the open question for a regulated tenant is whether single-tenant isolation satisfies the compliance bar.
- Managed platform (their cloud) — fastest to ship, no GPU ops, but data transits a third party — which is the exact thing self-hosting exists to avoid. Fine for non-sensitive workloads, not the sovereignty case.
The honest verdict mirrors the self-hosting one: a managed platform replicating the optimization layer you’d struggle to build is the right default — unless the driver is strict data sovereignty, in which case you pay for VPC single-tenant or true DIY and accept the cost.
What we take from the scan
- The category is real and underrated.“Managed open-model serving” is the missing middle between frontier APIs and DIY — and for most teams it dominates raw self-hosting on every axis except pure data residency.
- We already cover the model-quality half. Every platform here is a benchmarked provider — so a client gets our measured pass-rate per model, not the vendor’s leaderboard.
- The next neutral signal to add is non-chat. Embeddings/transcription/TTS have the same “who’s actually best” gap chat had before we measured it. BEI throughput is the first concrete target.
- The self-hosting reality check — DIY’s honest limits; this is the managed alternative.
- What the vendors are building — the sibling scan, one layer up (agent runtimes).
- Providers — every platform here, with our measured per-model results.
- Open-source · Cost & TCO — the break-even math behind managed vs. DIY.
- Serving a local model — the optimization work these platforms do for you.