Why 'OpenAI-compatible' is necessary but not sufficient
“OpenAI-compatible” means a server speaks the same request/response shape as OpenAI’s Chat Completions API, so existing SDKs and tools point at it with a URL change. That’s genuinely useful — it’s why you can swap a local model in at all. But Chat Completions is stateless and minimal: you send the full message history every turn, you get text back, and that’s mostly it. The compatibility promise covers the protocol, not the capabilities — and local servers implement different subsets, often the thinnest one.
The result: the demo works, then production exposes everything the endpoint doesn’t do for you. Below is the checklist of what “more than an endpoint” actually means.
The gap, item by item
Stateful conversations
Chat Completions is stateless — youstore and resend the whole history every turn, which gets expensive and fiddly as conversations grow. OpenAI’s newer Responses API adds server-side state (a previous_response_id to continue a thread) so the server tracks context for you. Most local servers either don’t expose the Responses API or expose only its stateless flavor — so you rebuild conversation persistence yourself.
Tool / function calling
Agents need the model to call tools reliably and the server to round-trip the call/result. Compat endpoints vary wildly in tool-calling fidelity by model and implementation — some advertise it and silently return prose instead of a structured call. If your product is agentic, this is the first thing to test, not assume. (See tool use.)
Structured output
Getting guaranteedvalid JSON (schema-constrained output) is a server feature, not a given. Without it you get “mostly JSON” and write brittle parsers. Some local stacks support constrained decoding / JSON schema; many don’t — and it’s exactly the kind of thing a benchmark on your tasks surfaces fast.
Streaming — and streaming events
Token streaming is common, but streaming structured events (tool-call deltas, reasoning steps, partial structured output) is where compat endpoints diverge. A chat UI may be fine; an agent UI that shows tool steps live needs richer streaming than the floor provides.
Files & multimodal
Uploading a file or image and referencing it across turns is a managed capability on the big APIs. Locally you typically wire up your own storage, extraction, and context injection — the endpoint just takes text.
Token accounting & observability
You can’t optimize what you can’t see. Per-request token counts, latency, error rates, and traces are how you catch a regression or a runaway. A bare endpoint logs little; production needs tracing (open-source options like Langfuse-style observability exist) so cost and quality are measurable — the same reason this whole site measures models.
Error handling, limits & safety
Context-length overflows, rate limits, timeouts, retries, auth, and per-user caps are all on you. The hosted APIs absorb a lot of this; a local endpoint hands it back. (The runaway-cost angle is in the Token Tax Playbook.)
The standard to watch: Open Responses
This isn’t just a local-tooling complaint — the ecosystem is moving. OpenAI shifted its own recommended interface from Chat Completions to the Responses API (stateful, tools, streaming events, multimodal built in), and then co-developed an open specification — Open Responses — with a wide group of ecosystem players including NVIDIA, Vercel, OpenRouter, Hugging Face, LM Studio, Ollama, and vLLM. The direction of travel is clear: the local-serving baseline is rising from “a chat endpoint” toward “a stateful, tool-capable Responses endpoint.” When you choose a serving stack today, check where it sits on that curve, not just whether /chat/completions answers.
Build it or front it — the decision
You have two honest options for closing the gap, and the right one depends on scale:
- Build the missing layer yourself — conversation store, tool round-tripping, schema validation, tracing. Full control, real engineering cost; sensible when your needs are narrow or unusual.
- Front the model with a gateway that adds the layer. This is a product category, not one tool: API gateways / proxies (e.g. LiteLLM-class proxies for budgets, keys, and routing), observability layers (Langfuse-class tracing), and a newer class of Responses-API gateways that put stateful conversations, tools, files, and observability in front of a local runtime like Ollama. Evaluate them on which gap items they actually close for your stack — not the feature list.
The same “decide what, let the gateway do how” split we draw for routing applies here: keep the parts that are your product’s value, delegate the plumbing that isn’t.
The bottom line
Self-hosting an open model is the easy 80%. An OpenAI-compatible endpoint gets you a working request in minutes — and then the remaining 20% (state, tools, structured output, files, observability, limits) is most of the actual product. Plan the serving layer as deliberately as you plan the model: pick a stack by what it serves, not just what it answers, and measure the capabilities you depend on before you ship.A model that passes your benchmark but can’t reliably call a tool through your endpoint hasn’t actually shipped.
- How to train an open model — the step before this; serving is what comes after.
- The Token Tax Playbook — the self-hosting lever + break-even math.
- Using Chinese models safely — network isolation for self-hosted serving.
- Prompt routing · Tool use — the capabilities the serving layer has to carry.