The demo works. Now make it survive production.

A model call in a notebook and a model call behind a customer-facing endpoint are different engineering problems. The demo dies in production four predictable ways: an unbounded prompt times out, request volume blows up the bill, a crafted prompt leaks the system prompt, and when it breaks there's no way to see why. None of the fixes are exotic — they're the boring layers between the client and the model. Here are the four properties to check for, and the patterns that deliver them.

The shape: layers between the client and the model

Every robust AI API has the same skeleton — a gateway for auth and throttling, an application layer for validation, prompt construction, cost tracking and error handling, the model itself, a store for usage accounting, and a metrics pipeline with alarms:

Client
  → Gateway        (auth + throttling)
  → App layer      (validation, prompt build, cost tracking, errors)
  → Model          (inference)
  → Usage store    (per-key token accounting)
  → Metrics        (structured logs → alarms)

The reference stack is AWS-shaped (API Gateway → Lambda → Bedrock → DynamoDB → CloudWatch), but the pattern is cloud-agnostic: the same five responsibilities map cleanly onto a Postgres/Supabase + edge-function stack. What matters is that each property below has a layer that owns it, rather than being assumed.

Property 1 — Predictable behavior under all conditions

Production traffic includes the happy path and every failure path: a huge input, a slow upstream, a burst of concurrent requests. The first defense is cheap — bound the input before it ever reaches the model.

Length bounds. A min/max on user input (e.g. 3–3,000 chars) caps the token cost of any single request before you pay for inference.
Reject malformed input. Null bytes and control characters are never legitimate user text — drop them at the door.
Resilience on the upstream. Exponential backoff on throttling (e.g. 1s / 2s / 4s, three retries), and a retry loop that distinguishes retryable from terminal — retry a throttle, fail fast on a timeout or a validation error rather than hammering a doomed call.

Property 2 — Security at every layer (and an honest caveat)

The AI-specific threats are prompt injection, data exfiltration via crafted prompts, token bombing, and abuse of leaked credentials. A cheap first-pass filter catches the laziest attempts — regex for patterns like “ignore previous instructions,” “you are now…,” “developer/jailbreak mode,” “reveal your system prompt.”

The honest caveat — don't mistake this for a security boundary.Regex injection filtering is trivially bypassed by paraphrasing, encoding, or switching languages. It's a thin pre-filter for logging and triage, not a solved problem. Real defense is structural: privilege separation between system and user content, structured-output constraints on what the model can return, and — most importantly — not putting secrets in the system prompt in the first place, so there's nothing to exfiltrate even if the boundary fails.

Hash credentials before they touch storage or logs. Use a hash of the API key as the usage-tracking key; never log a raw credential.
Constrain the output, not just the input. An envelope the app enforces (only emit text matching an expected structured shape; drop anything else) is a defense-in-depth layer against both injection and malformed responses — the same idea behind our chatbot bridge's structured-output enforcement.

Property 3 — Cost control that actually scales

Soft monitoring tells you about a runaway bill afteryou've paid it. The production version is a hard ceiling enforced before the spend:

A per-key daily token quota, enforced with an atomic conditional update — DynamoDB's ADD … ConditionExpression, or in Postgres a SELECT … FOR UPDATE / upsert-then-check. Atomicity is the point: two concurrent requests can't both slip past the limit.
Estimate cost pre-call (a chars/4 heuristic for input + the max output budget) and reject an over-quota request before spending a cent on inference.
A same-day cost alarm — publish per-request cost as a metric and alarm on a daily-spend threshold, so you catch a runaway the same day instead of at month-end billing.

This is the engineering form of the token-tax playbook's “cap what can run away” lever. Our own DLR chatbot bridge runs exactly this shape — a tenant daily spend cap with alerts at a per-request threshold, at 75% of the cap, and on any daily spike over 2× the rolling average; past the cap, new chats are refused until the day rolls over. For a multi-tenant stack like SecondClaw, the per-key quota maps directly onto per-client isolation, and the atomic-update pattern ports straight from DynamoDB to Supabase/Postgres.

Property 4 — Observability you can query

When something breaks silently, unstructured print()logs leave you doing grep archaeology. The fix is structured logs — every line a JSON object with a correlating request_id:

{ "level": "info", "event": "inference_done", "ts": "…",
  "request_id": "…", "in_tok": 412, "out_tok": 188,
  "cost_usd": 0.0021, "model": "…", "latency_ms": 740 }

Correlate by request_id so a log aggregator can answer “most expensive requests in the last 24h” or “error breakdown by type” without regex.
Log injection detections separately so attack patterns become a queryable trend, not noise buried in the request stream.
Alarm on the few that matter(tune per use case): error rate > 5% over 5 min; estimated daily cost crossing a fixed threshold; P95 latency > 5,000ms across two consecutive 5-min windows.

This is the same instinct behind everything we publish: a number you can't query isn't observability, and — as the routing Pareto trap shows — an aggregatenumber you can't segment will hide the failure that actually matters.

The through-line: measurement is the product

None of these four layers is glamorous, and that's the point — a production AI API is mostly the boring infrastructure that makes the model's output trustworthy, affordable, and debuggable. Cost ceilings enforced before the spend, security that doesn't rely on a bypassable regex, and logs you can actually query are the difference between a demo and a service. It's the same discipline this whole site argues for, applied one layer down: don't assume the happy path, measure and bound every path.

The demo works. Now make it survive production.

The shape: layers between the client and the model

Client
  → Gateway        (auth + throttling)
  → App layer      (validation, prompt build, cost tracking, errors)
  → Model          (inference)
  → Usage store    (per-key token accounting)
  → Metrics        (structured logs → alarms)

Property 1 — Predictable behavior under all conditions

Length bounds. A min/max on user input (e.g. 3–3,000 chars) caps the token cost of any single request before you pay for inference.
Reject malformed input. Null bytes and control characters are never legitimate user text — drop them at the door.
Resilience on the upstream. Exponential backoff on throttling (e.g. 1s / 2s / 4s, three retries), and a retry loop that distinguishes retryable from terminal — retry a throttle, fail fast on a timeout or a validation error rather than hammering a doomed call.

Property 2 — Security at every layer (and an honest caveat)

Hash credentials before they touch storage or logs. Use a hash of the API key as the usage-tracking key; never log a raw credential.
Constrain the output, not just the input. An envelope the app enforces (only emit text matching an expected structured shape; drop anything else) is a defense-in-depth layer against both injection and malformed responses — the same idea behind our chatbot bridge's structured-output enforcement.

Property 3 — Cost control that actually scales

Soft monitoring tells you about a runaway bill afteryou've paid it. The production version is a hard ceiling enforced before the spend:

A per-key daily token quota, enforced with an atomic conditional update — DynamoDB's ADD … ConditionExpression, or in Postgres a SELECT … FOR UPDATE / upsert-then-check. Atomicity is the point: two concurrent requests can't both slip past the limit.
Estimate cost pre-call (a chars/4 heuristic for input + the max output budget) and reject an over-quota request before spending a cent on inference.
A same-day cost alarm — publish per-request cost as a metric and alarm on a daily-spend threshold, so you catch a runaway the same day instead of at month-end billing.

Property 4 — Observability you can query

When something breaks silently, unstructured print()logs leave you doing grep archaeology. The fix is structured logs — every line a JSON object with a correlating request_id:

{ "level": "info", "event": "inference_done", "ts": "…",
  "request_id": "…", "in_tok": 412, "out_tok": 188,
  "cost_usd": 0.0021, "model": "…", "latency_ms": 740 }

Correlate by request_id so a log aggregator can answer “most expensive requests in the last 24h” or “error breakdown by type” without regex.
Log injection detections separately so attack patterns become a queryable trend, not noise buried in the request stream.
Alarm on the few that matter(tune per use case): error rate > 5% over 5 min; estimated daily cost crossing a fixed threshold; P95 latency > 5,000ms across two consecutive 5-min windows.

The through-line: measurement is the product

The demo works. Now make it survive production.

The shape: layers between the client and the model

Property 1 — Predictable behavior under all conditions

Property 2 — Security at every layer (and an honest caveat)

Property 3 — Cost control that actually scales

Property 4 — Observability you can query

The through-line: measurement is the product

Putting an AI feature in front of customers?

The demo works. Now make it survive production.

The shape: layers between the client and the model

Property 1 — Predictable behavior under all conditions

Property 2 — Security at every layer (and an honest caveat)

Property 3 — Cost control that actually scales

Property 4 — Observability you can query

The through-line: measurement is the product

Putting an AI feature in front of customers?