The contract: one folder, one agent
The core idea is a filesystem-first contract: an agent is a directory, and each file or folder declares one component. There is no central registry to wire up — the framework discovers the pieces at build time. The layout reads like the anatomy of an agent:
my-agent/ agent.ts # model config → "anthropic/claude-opus-4.8" instructions.md # system prompt, loaded on every call tools/ # executable capabilities (TypeScript + Zod schemas) skills/ # knowledge loaded contextually (Markdown) connections/ # secure MCP servers + OpenAPI integrations channels/ # where it's reachable: Slack, Discord, Telegram, GitHub… schedules/ # cron — autonomous, unattended runs subagents/ # specialist child agents it can delegate to
Two files — agent.ts and instructions.md — are enough to get a working agent; everything else is opt-in. Scaffold with npx eve init, run it locally with eve dev, deploy with vercel deploy. The same definition serves HTTP, Slack, Telegram, GitHub and more at once — you write the agent once and pick the surfaces.
What the framework gives you for free
The value is the production machinery that every serious agent needs and nobody enjoys rebuilding. Eve ships six pieces of it:
- Durable execution — every conversation is a checkpointed workflow, so an agent survives a crash or a deploy mid-task and resumes where it left off instead of starting over.
- Sandboxed compute— agent code runs isolated (Docker locally, a hosted sandbox in production), so a tool call can’t reach the rest of your system.
- Human-in-the-loop approvals — any action can carry a
needsApprovalflag and pause for a human before it fires. - Secure connections — auth brokerage to Slack, GitHub, Notion, Salesforce, Snowflake and the rest, instead of hand-rolling each OAuth dance.
- Multi-channel deployment — one agent answers across every channel simultaneously.
- Observability — OpenTelemetry tracing and evals built in, so you can see what the agent did and test it.
The payoff is real: Vercel cites internal agents like a data analyst fielding 30,000+ questions a month, a support agent resolving 92% of tickets on its own, and an autonomous SDR that “costs about $5,000 a year and returns 32× that.” The framework is what let a small team build each one in weeks rather than quarters.
Why this looks familiar to us
Eve is, almost line for line, the architecture we already run. The agents behind this site and the DLR chatbot are a directory of Markdown skills loaded on demand, a set of tools (our read-only -eye lenses, a secrets CLI, database lenses), MCP connections, channels (Telegram bots), schedules (the nightly and morning cron jobs), subagents for delegated work, and a hard human-approval gate on anything destructive. When Vercel open-sources the same shape and runs a hundred agents on it, that is confirmation the pattern is right — not a reason to rebuild.
So our read is validation, not migration. We deliberately keep the agent runtime local — agents execute on our own hardware at $0, and the only metered cost is the model call itself. A hosted framework moves that execution (and the durable-workflow checkpointing that re-sends the prompt on every step) onto billed cloud compute, and brokers your connection auth through a third party. For us, both of those trade the two things we guard hardest — predictable cost and tight control of secrets — for convenience we don’t need. The lesson worth borrowing is the shape: the directory contract, the approval flag, the built-in tracing. The runtime we keep at home. (See the self-hosting reality check for when running it yourself genuinely wins.)
The one line the framework hardcodes
Look back at the directory. The model lives in one place — agent.ts, a single string: "anthropic/claude-opus-4.8". A framework makes everything around the model effortless and then leaves the most consequential choice as a build-time constant. Pin a frontier model and every task — the yes/no intent, the templated lookup, the genuinely hard reasoning — pays frontier prices. Pin a cheap one and the hard tasks quietly fail.
That constant is exactly what this site exists to turn into a measurement. The framework moves the request and runs the loop; it does not know which model is right for each thing the agent does. EyesInAI benchmarks the candidates on the real tasks, graded against ground truth, and produces the answer the agent.ts line should hold — the least-expensive model that still passes, per task. On the DLR chatbot that one change measured ~31% in savings per turn at the same accuracy.
So the two layers stack cleanly: a framework like Eve is the how it runs; a measured routing signal is the which model runs it. Adopt the framework pattern for the plumbing, and feed its model choice from a number you can defend — not a string you guessed at build time.
Frameworks are the app layer growing up — the same forces that decide which model-powered companies survive decide which agents are worth running. The economics are in the field guide; the measured model choice is the rest of this site.