Who actually runs the open models — the managed inference layer

Between “rent a frontier API” and “stand up your own H100s” sits a third option most teams underrate: a managed inference platform that serves open-weight and custom models for you. This is the layer Baseten, Fireworks, Together, DeepInfra, Groq, SambaNova, Cerebras and NVIDIA NIM compete in — and their shared pitch is worth taking seriously: raw GPUs are table stakes; the moat is the optimization layer on top.

Our rule of thumb: recommend the best solution regardless of who makes it. So this is a fair read — and we already benchmark every platform here, so where we talk about model quality the number is measured, not the vendor’s claim.

The thesis: GPUs are commodity, the stack is the moat

The honest version of every managed-inference pitch is the same. Anyone can rent an H100. What’s actually hard — and what you’d be rebuilding if you served the model yourself — is the layer between the GPU and the API:

Custom kernels & decoding — speculative decoding, prefix/KV caching, quantization that doesn’t wreck quality. The difference between vanilla vLLM and a tuned stack is often 2–4× throughput on the same card.
Cold starts & autoscaling — loading a 70B model onto a fresh GPU takes tens of seconds; the platforms that win hide that. Scale-to-zero without a cold-start penalty is genuinely hard engineering.
Multi-cloud / VPC HA — spreading a workload across regions and clouds (or into your own VPC) for uptime, with one endpoint. This is the part a single self-hosted box structurally cannot match (the concurrency problem).
Specialized runtimes — embeddings, transcription/diarization, and streaming TTS each want a different serving shape than chat. The platforms increasingly ship purpose-built runtimes for each.

That last point is where this category is pulling away from “just an OpenAI-compatible endpoint.” The interesting competition in 2026 isn’t chat tokens-per-second — it’s the non-chat modalities.

The field, as we read it

These split into two rough camps: software-optimization platforms (tune the stack on commodity GPUs) and custom-silicon platforms (their speed comes from their own chips). Both are legitimate; they trade off differently.

Baseten · the optimization-stack thesis, stated plainly

The clearest articulation of “the stack is the moat.” Custom kernels + caching baked in, fast cold starts, managed cloud oryour VPC, and specialized runtimes: BEI embeddings (claims >2× throughput / ~10% lower latency vs. alternatives), optimized transcription + diarization, streaming TTS for voice agents, and Chains for compound/multi-step inference with per-step hardware + autoscaling. Forward-deployed engineers, not just self-serve. Customer list is AI-native (Cursor, Notion, Abridge, Writer). The BEI throughput claim is exactly the kind of thing we should measure rather than repeat — see below.

Fireworks · breadth + FireAttention kernels

A large catalog served on their own optimized kernels, with function-calling and fine-tuning. The pragmatic default when you want many open models fast behind one API.

Together · widest catalog, fine-tune + train

The broadest open-model shelf plus a training/fine-tuning side. Good when the requirement is “serve a lot of different open models” rather than one tuned workload.

DeepInfra · lowest sticker price

Aggressive per-token pricing on a broad open catalog. Often the cost floor for commodity open-model serving — worth checking against our measured pass-rate before assuming cheap == fine for the task.

Groq · SambaNova · Cerebras · custom silicon

Their edge is hardware, not just software: LPU / RDU / wafer-scale chips that hit token-per-second numbers commodity GPUs can’t. Decisive for latency-bound agents and voice. Trade-off: a narrower model menu and you’re tied to their fleet. We benchmark Groq, SambaNova and Cerebras separately.

NVIDIA · NIM

Prepackaged, optimized inference microservices you run on NVIDIA GPUs — anywhere. The “bring it into your own infra” option from the silicon vendor itself; the baseline the independents position against.

Where it gets interesting: the non-chat runtimes

Our own benchmarks today measure chatquality. But the differentiated products on these platforms are increasingly embeddings, transcription and TTS — and those have their own “which is actually best” question that nobody answers neutrally:

Embeddings. Throughput and latency matter enormously for RAG ingestion at scale. Baseten’s BEI 2× claim is a concrete, testable number — a natural extension of what we already do for chat.
Transcription + diarization. Accuracy and cost per audio-hour, plus time-to-first-token for live use.
Streaming TTS. Time-to-first-byte is the metric for voice agents and phone — exactly the layer an agent runtime sits on top of.

Open thread for us.We can extend the same measured, vendor-neutral discipline we apply to chat into these modalities — starting with the Baseten BEI embeddings throughput claim. That’s a benchmark task, not a marketing line, and it’s queued.

Managed platform vs. DIY — the SecondClaw read

For private/self-hosted infrastructure (the SecondClaw question), the choice isn’t “cloud API vs. our own GPUs.” It’s a three-way split, and the managed platform usually beats raw DIY:

DIY H100 + vLLM— maximum control and data residency, but you own the kernel tuning, cold starts, autoscaling and HA. That’s months of engineering to match what a platform ships day one. Justified only when nothing can leave your building, or at very high steady volume on one task.
Managed platform → your VPC (single-tenant) — the middle path. Baseten and NIM both offer running their optimized stack inside your own cloud. You get the optimization layer without rebuilding it; the open question for a regulated tenant is whether single-tenant isolation satisfies the compliance bar.
Managed platform (their cloud) — fastest to ship, no GPU ops, but data transits a third party — which is the exact thing self-hosting exists to avoid. Fine for non-sensitive workloads, not the sovereignty case.

The honest verdict mirrors the self-hosting one: a managed platform replicating the optimization layer you’d struggle to build is the right default — unless the driver is strict data sovereignty, in which case you pay for VPC single-tenant or true DIY and accept the cost.

What we take from the scan

The category is real and underrated.“Managed open-model serving” is the missing middle between frontier APIs and DIY — and for most teams it dominates raw self-hosting on every axis except pure data residency.
We already cover the model-quality half. Every platform here is a benchmarked provider — so a client gets our measured pass-rate per model, not the vendor’s leaderboard.
The next neutral signal to add is non-chat. Embeddings/transcription/TTS have the same “who’s actually best” gap chat had before we measured it. BEI throughput is the first concrete target.

Baseten, benchmarked The self-hosting reality check Open-source · Cost & TCO

In context

The self-hosting reality check — DIY’s honest limits; this is the managed alternative.
What the vendors are building — the sibling scan, one layer up (agent runtimes).
Providers — every platform here, with our measured per-model results.
Open-source · Cost & TCO — the break-even math behind managed vs. DIY.
Serving a local model — the optimization work these platforms do for you.