Eleventh — AI engineering is contract engineering

//first contact · session 0x----t+ 0:00·traces 00·click to trace

INGESTsource

EMBEDvector

RETRIEVEpgvector

RERANKcohere

GENERATEclaude

EVALdeterministic

SHIPprod

// RAG · agents · backends · eval

Eleventh turns fragile retrieval, agent, and AI-backend prototypes into systems that survive real users, real data, and real failures.

Book a DiagnosticInspect the proofSee the systems

git · main · live↑ 6↓ 0last 4h

$git log --oneline --since=2d

9f2a3c1 feat(rag): hybrid retrieval + cohere rerank · 4h ago

8e1b2af refactor(eval): split harness into stages · 6h ago

4d2cdf9 feat(agent): add approval gate retry · 1d ago

Deployed0across 6 systems

Instrumented · live0/ 10↑ NexusRAG hot

p95 retrieval0ms−18% wk

// refreshed 7s ago// design preview · stub data

// trace 01 was this page load · every system below follows the same spine

scroll · capabilities

//capabilities

Four pipelines.
All shipped, all in public repos.

Each row is one engineering capability. The diagram on the right is how the system runs in production. Every row links to its source repo and (when applicable) the live deploy. Read the source before the first call.

hybrid retrieval + rerank

Retrieval Contracts for RAG Systems

Your retrieval works in demos but fails on real user queries. Precision drops, irrelevant results surface, and answer quality degrades under messy, unstructured input.

Hybrid retrieval (pgvector + BM25) with reranking, instrumented end-to-end. Built for the gap between a demo that retrieves and a system that survives the next 10,000 queries.

LangGraph
pgvector
FastAPI

// auditNexusRAG ↗·live deploy ↗

//packet traces the live hybrid pathlive

state graph + approval gate

LLM Agent Infrastructure

Your agent works on the demo path but breaks when users deviate. State gets lost between steps, tool calls fail silently, and there is no way to audit what happened.

Multi-step agent workflows on LangGraph with persistent state, tool orchestration, approval gates, retry logic, and deterministic evaluation. Designed for reliability under real interaction patterns.

LangGraph
Claude API
Celery

// auditagent-runbook-orchestrator ↗·live deploy ↗

//the loop is durable, the gate is humanlive

request latency waterfall

FastAPI Backends for AI Products

Your AI feature needs a real backend, not a notebook wrapped in an endpoint. You need async APIs, structured data access, migration discipline, and deployment automation from day one.

Asynchronous FastAPI backends purpose-built for LLM-powered apps. pgvector, Alembic migrations, health checks, and operational readiness on day one.

FastAPI
PostgreSQL
Docker

// auditlanggraph-fastapi-starter ↗·// production template — no demo deploy by design

//p95 · 280ms · measured prod, last 24hreference

regression detection · 12 runs

AI Reliability Engineering

You are scaling prompts before you have evaluation discipline. Quality is checked by eyeballing outputs. Cost and latency are unmeasured. When something breaks in production, there is no way to tell what changed.

Deterministic evaluation frameworks, structured output validation, cost and latency observability, regression detection. The layer most teams skip between prototype and production.

Python
pytest
Prometheus

// auditevalops-workbench ↗·live deploy ↗

//deploy @ run 8 — regression caught at run 9live

//portfolio · 1 flagship + 5 systems

Six systems.
One carries the weight.

Every frame below is the real system, captured from its live deploy. One runs in production, three run public benchmarks, two are honest showcases. All six are open source and running right now. Click any frame and use the real thing.

flagship · Production

NexusRAG

Multi-tenant RAG platform

Multi-tenant, multi-cloud RAG platform with SSO, SCIM, RBAC/ABAC, envelope encryption, multi-region failover, and SOC 2 automation. The shipped reference every other repo points to.

Read the outcome Live deploy GitHub

NexusRAG dashboard, captured from the live deploy

// supporting fleet · every frame is the system, live

Data Watchtower dashboard, captured from the live deploy

Data WatchtowerBenchmarkRead the outcome

EvalOps dashboard, captured from the live deploy

EvalOpsBenchmarkRead the outcome

Revenue Signal dashboard, captured from the live deploy

Revenue SignalBenchmarkRead the outcome

Agent Runbook dashboard, captured from the live deploy

Agent RunbookShowcaseSource

Repo RAG Debugger dashboard, captured from the live deploy

Repo RAG DebuggerShowcaseSource

// captured from the live deploys · June 2026 · see the full fleet, by stage

//clients

Who we build for,
and what the shape looks like.

Three engagement profiles. Each starts with a measurable problem, ends with deployed code in your repo, and gets handed off with the eval harness, observability, and the docs we used to ship it.

FOR ·01

Startups moving past prototype

You demoed the AI feature. Now the team is asking for retrieval that doesn’t fall over, an eval harness with regressions caught, and an API your iOS engineer can actually call.

// engagement4-week build · live API · full handoff

FOR ·02

Agencies delivering AI products

Your team builds the front end and the brand. We ship the AI backend underneath: RAG, agents, eval. Then we disappear before launch with the source in your client’s repo.

// engagementembed · ship · transfer ownership

FOR ·03

Teams evaluating capability

You’re considering an AI investment and want a working reference architecture, an eval against your real data, and a number on what production reliability actually costs.

// engagement2-week pilot · benchmark · go/no-go

//method

Five phases,
one shipping commit.

Every engagement runs through the same sequence. Discovery sets the spec and the evaluation bars; the next four phases build and ship against them. Adjustable by week, not by order.

W1 · Day 1W2 · Day 6W3 · Day 11W4 · Day 16Ongoing · Day 21+

// 4-week standard · adjustable by weekclick a phase to inspect →

Phase 01 / 05 · the spec

Discovery

Days 1–5 · 5 work-days

We sit with the brief, write the system spec, define the evaluation bars, and surface every risk before any code ships. The spec we write here is the contract every phase below is measured against.

// deliverables4 files · spec-only

spec/v1.md: System spec + architecture sketch
eval/bars.yaml: Evaluation criteria + scoring bars
risk/register.md: Risk register + open questions
ops/deploy.tf: Repo + deploy targets confirmed

//telemetry

What the systems
are actually doing right now.

Every system ships a public stats endpoint. The console below probes all six from your browser every 30 seconds and prints what comes back, including the measured round-trip of each request. When a system is quiet, it says so. Nothing here is simulated.

fleet-ops console· 6 public endpoints · probes run from this browser

0/6 reporting--:--:-- utcfirst poll…

NexusRAG

Multi-cloud RAG · production workload

source ↗·live ↗·the work →

Queries · total…

Queries · 24h…

Uptime · 30d…

Indexed chunks…

probe rtt · this sessionfirst probe in flight…

GET /api/stats → …checked — · last activity —

// request log · fetched live from this browsernewest last

--:--:--Zawaiting first responses…

//engineering charter

The model isn’t the system.
The system is the work.

The difference is not the model. It’s the system around it. Below are the four laws we won’t ship without. Each is paired with the anti-pattern that gives the principle its enforcement edge.

Law 0101 / 04

Deterministic evaluation

Replacing subjective “looks good” testing with measurable scoring and regression checks. Every prompt change ships with a delta against a frozen baseline.

// won’t accept“it worked when I tried it”

Law 0202 / 04

Cost control

Preventing uncontrolled token usage through architecture, not afterthought optimization. Cost ceilings, caching, and routing live at the request layer.

// won’t accept“we’ll optimize when it gets expensive”

Law 0303 / 04

Latency engineering

Designing systems that respond in real time, not seconds too late. p95 is a feature; warm-paths and streaming are architectural, not optimization.

// won’t accept“the model takes 8s, deal with it later”

Law 0404 / 04

Operational reliability

Building systems that continue working under load, failure, and scale. Health checks, retries, idempotency, and circuit breakers ship on day one.

// won’t accept“works in dev, ship it”

//position

AI engineering is contract engineering.

01Build the control plane.
02Defend the workflow.
03Run layered evaluation.

AI infrastructure, not AI theater.

// from the engineering charter

//contact

Tell us the system.
We’ll tell you the bar.

Tell us about the system you need shipped. We respond in 24 hours with a specific technical plan, a measurable bar, and an honest estimate of weeks and cost.

First call inside one business day.

Brief us in two paragraphs. If we’re not the right fit, we’ll tell you fast and route you to a team that is.

// prefer email? hello@eleventh.dev

Name

Work email

// about you

Company

Role

Company or product URL

// your system

Project type

Current stage

Current stack

Traffic / volume

Data sensitivity

// your engagement

Timeline

Budget band

Can this become a public reference?

YesPossiblyNo

// what you need

Describe what you need

// 24h response · specific technical plan · honest estimate

YourAIdemoworks.Thesystembehinditdoesn’t.

Retrieval Contracts for RAG Systems

LLM Agent Infrastructure

FastAPI Backends for AI Products

AI Reliability Engineering

NexusRAG

Startups moving past prototype

Agencies delivering AI products

Teams evaluating capability

Discovery

NexusRAG

Deterministic evaluation

Cost control

Latency engineering

Operational reliability

AI engineering is contract engineering.

First call inside one business day.