The shape: layers between the client and the model
Every robust AI API has the same skeleton — a gateway for auth and throttling, an application layer for validation, prompt construction, cost tracking and error handling, the model itself, a store for usage accounting, and a metrics pipeline with alarms:
Client → Gateway (auth + throttling) → App layer (validation, prompt build, cost tracking, errors) → Model (inference) → Usage store (per-key token accounting) → Metrics (structured logs → alarms)
The reference stack is AWS-shaped (API Gateway → Lambda → Bedrock → DynamoDB → CloudWatch), but the pattern is cloud-agnostic: the same five responsibilities map cleanly onto a Postgres/Supabase + edge-function stack. What matters is that each property below has a layer that owns it, rather than being assumed.
Property 1 — Predictable behavior under all conditions
Production traffic includes the happy path and every failure path: a huge input, a slow upstream, a burst of concurrent requests. The first defense is cheap — bound the input before it ever reaches the model.
- Length bounds. A min/max on user input (e.g. 3–3,000 chars) caps the token cost of any single request before you pay for inference.
- Reject malformed input. Null bytes and control characters are never legitimate user text — drop them at the door.
- Resilience on the upstream. Exponential backoff on throttling (e.g. 1s / 2s / 4s, three retries), and a retry loop that distinguishes retryable from terminal — retry a throttle, fail fast on a timeout or a validation error rather than hammering a doomed call.
Property 2 — Security at every layer (and an honest caveat)
The AI-specific threats are prompt injection, data exfiltration via crafted prompts, token bombing, and abuse of leaked credentials. A cheap first-pass filter catches the laziest attempts — regex for patterns like “ignore previous instructions,” “you are now…,” “developer/jailbreak mode,” “reveal your system prompt.”
The honest caveat — don't mistake this for a security boundary.Regex injection filtering is trivially bypassed by paraphrasing, encoding, or switching languages. It's a thin pre-filter for logging and triage, not a solved problem. Real defense is structural: privilege separation between system and user content, structured-output constraints on what the model can return, and — most importantly — not putting secrets in the system prompt in the first place, so there's nothing to exfiltrate even if the boundary fails.
- Hash credentials before they touch storage or logs. Use a hash of the API key as the usage-tracking key; never log a raw credential.
- Constrain the output, not just the input. An envelope the app enforces (only emit text matching an expected structured shape; drop anything else) is a defense-in-depth layer against both injection and malformed responses — the same idea behind our chatbot bridge's structured-output enforcement.
Property 3 — Cost control that actually scales
Soft monitoring tells you about a runaway bill afteryou've paid it. The production version is a hard ceiling enforced before the spend:
- A per-key daily token quota, enforced with an atomic conditional update — DynamoDB's
ADD … ConditionExpression, or in Postgres aSELECT … FOR UPDATE/ upsert-then-check. Atomicity is the point: two concurrent requests can't both slip past the limit. - Estimate cost pre-call (a chars/4 heuristic for input + the max output budget) and reject an over-quota request before spending a cent on inference.
- A same-day cost alarm — publish per-request cost as a metric and alarm on a daily-spend threshold, so you catch a runaway the same day instead of at month-end billing.
This is the engineering form of the token-tax playbook's “cap what can run away” lever. Our own DLR chatbot bridge runs exactly this shape — a tenant daily spend cap with alerts at a per-request threshold, at 75% of the cap, and on any daily spike over 2× the rolling average; past the cap, new chats are refused until the day rolls over. For a multi-tenant stack like SecondClaw, the per-key quota maps directly onto per-client isolation, and the atomic-update pattern ports straight from DynamoDB to Supabase/Postgres.
Property 4 — Observability you can query
When something breaks silently, unstructured print()logs leave you doing grep archaeology. The fix is structured logs — every line a JSON object with a correlating request_id:
{ "level": "info", "event": "inference_done", "ts": "…",
"request_id": "…", "in_tok": 412, "out_tok": 188,
"cost_usd": 0.0021, "model": "…", "latency_ms": 740 }- Correlate by
request_idso a log aggregator can answer “most expensive requests in the last 24h” or “error breakdown by type” without regex. - Log injection detections separately so attack patterns become a queryable trend, not noise buried in the request stream.
- Alarm on the few that matter(tune per use case): error rate > 5% over 5 min; estimated daily cost crossing a fixed threshold; P95 latency > 5,000ms across two consecutive 5-min windows.
This is the same instinct behind everything we publish: a number you can't query isn't observability, and — as the routing Pareto trap shows — an aggregatenumber you can't segment will hide the failure that actually matters.
The through-line: measurement is the product
None of these four layers is glamorous, and that's the point — a production AI API is mostly the boring infrastructure that makes the model's output trustworthy, affordable, and debuggable. Cost ceilings enforced before the spend, security that doesn't rely on a bypassable regex, and logs you can actually query are the difference between a demo and a service. It's the same discipline this whole site argues for, applied one layer down: don't assume the happy path, measure and bound every path.