Skip to main content
//first contact · session 0x----t+ 0:00·traces 00·click to trace
01
INGESTsource
02
EMBEDvector
03
RETRIEVEpgvector
04
RERANKcohere
05
GENERATEclaude
06
EVALdeterministic
07
SHIPprod
// RAG · agents · backends · eval

Eleventh turns fragile retrieval, agent, and AI-backend prototypes into systems that survive real users, real data, and real failures.

git · main · live↑ 6↓ 0last 4h
$git log --oneline --since=2d
9f2a3c1 feat(rag): hybrid retrieval + cohere rerank · 4h ago
8e1b2af refactor(eval): split harness into stages · 6h ago
4d2cdf9 feat(agent): add approval gate retry · 1d ago
Deployed0across 6 systems
Instrumented · live0/ 10↑ NexusRAG hot
p95 retrieval0ms−18% wk
// refreshed 7s ago// design preview · stub data
// trace 01 was this page load · every system below follows the same spine
scroll · capabilities
//capabilities

Four pipelines.
All shipped, all in public repos.

Each row is one engineering capability. The diagram on the right is how the system runs in production. Every row links to its source repo and (when applicable) the live deploy. Read the source before the first call.

hybrid retrieval + rerank

Retrieval Contracts for RAG Systems

Your retrieval works in demos but fails on real user queries. Precision drops, irrelevant results surface, and answer quality degrades under messy, unstructured input.

Hybrid retrieval (pgvector + BM25) with reranking, instrumented end-to-end. Built for the gap between a demo that retrieves and a system that survives the next 10,000 queries.

  • LangGraph
  • pgvector
  • FastAPI
//packet traces the live hybrid pathlive
QUERYEMBEDDENSEpgvectorBM25keywordRERANKcohereTOP-KP@50.94· baseline 0.62 ↑queries · 24h14,217↑ 12%hit · cache0.83↑ 0.04p99 retrieval218msvector space · k=5relevance dist · top-1kmean 0.81

state graph + approval gate

LLM Agent Infrastructure

Your agent works on the demo path but breaks when users deviate. State gets lost between steps, tool calls fail silently, and there is no way to audit what happened.

Multi-step agent workflows on LangGraph with persistent state, tool orchestration, approval gates, retry logic, and deterministic evaluation. Designed for reliability under real interaction patterns.

  • LangGraph
  • Claude API
  • Celery
//the loop is durable, the gate is humanlive
PLANdecomposeACTtool callOBSERVEtool responseEVALscore + logAPPROVE// human gateretry · replaniteration07/ 12 maxretries · 5m2· 2 approvalsrunningcontext · tokens3.2k/ 8k40% · headroom 4.8k

request latency waterfall

FastAPI Backends for AI Products

Your AI feature needs a real backend, not a notebook wrapped in an endpoint. You need async APIs, structured data access, migration discipline, and deployment automation from day one.

Asynchronous FastAPI backends purpose-built for LLM-powered apps. pgvector, Alembic migrations, health checks, and operational readiness on day one.

  • FastAPI
  • PostgreSQL
  • Docker
// auditlanggraph-fastapi-starter ·// production template — no demo deploy by design
//p95 · 280ms · measured prod, last 24hreference
50ms100ms150ms200msHTTP12MSVALIDATE8MSAUTH18MSQUERY42MSGENERATE185MSRESPOND15MS200 OKP95 TOTAL280MSp5088MS·p99412MSmeasured prod · 24hrps · now1,247↑ 8%concurrent24/ 100 max

regression detection · 12 runs

AI Reliability Engineering

You are scaling prompts before you have evaluation discipline. Quality is checked by eyeballing outputs. Cost and latency are unmeasured. When something breaks in production, there is no way to tell what changed.

Deterministic evaluation frameworks, structured output validation, cost and latency observability, regression detection. The layer most teams skip between prototype and production.

  • Python
  • pytest
  • Prometheus
//deploy @ run 8 — regression caught at run 9live
1.00.80.6run 1run 12DEPLOY · 9F2A3C1REGRESSION DETECTEDrun 9 · −0.19 deltasample · evals2,432per runmean score · 7d0.91↓ 0.04 wkcost · per eval$0.038↓ 12%cohere + claude haiku± 1 sd
//portfolio · 1 flagship + 5 systems

Six systems.
One carries the weight.

Every frame below is the real system, captured from its live deploy. One runs in production, three run public benchmarks, two are honest showcases. All six are open source and running right now. Click any frame and use the real thing.

// supporting fleet · every frame is the system, live

//clients

Who we build for,
and what the shape looks like.

Three engagement profiles. Each starts with a measurable problem, ends with deployed code in your repo, and gets handed off with the eval harness, observability, and the docs we used to ship it.

FOR ·01

Startups moving past prototype

You demoed the AI feature. Now the team is asking for retrieval that doesn’t fall over, an eval harness with regressions caught, and an API your iOS engineer can actually call.

FOR ·02

Agencies delivering AI products

Your team builds the front end and the brand. We ship the AI backend underneath: RAG, agents, eval. Then we disappear before launch with the source in your client’s repo.

FOR ·03

Teams evaluating capability

You’re considering an AI investment and want a working reference architecture, an eval against your real data, and a number on what production reliability actually costs.

//method

Five phases,
one shipping commit.

Every engagement runs through the same sequence. Discovery sets the spec and the evaluation bars; the next four phases build and ship against them. Adjustable by week, not by order.

W1 · Day 1W2 · Day 6W3 · Day 11W4 · Day 16Ongoing · Day 21+
// 4-week standard · adjustable by weekclick a phase to inspect →
Phase 01 / 05 · the spec

Discovery

Days 1–5 · 5 work-days

We sit with the brief, write the system spec, define the evaluation bars, and surface every risk before any code ships. The spec we write here is the contract every phase below is measured against.

// deliverables4 files · spec-only

spec/v1.md
System spec + architecture sketch
eval/bars.yaml
Evaluation criteria + scoring bars
risk/register.md
Risk register + open questions
ops/deploy.tf
Repo + deploy targets confirmed
//telemetry

What the systems
are actually doing right now.

Every system ships a public stats endpoint. The console below probes all six from your browser every 30 seconds and prints what comes back, including the measured round-trip of each request. When a system is quiet, it says so. Nothing here is simulated.

fleet-ops console· 6 public endpoints · probes run from this browser

0/6 reporting--:--:-- utcfirst poll…

NexusRAG

Multi-cloud RAG · production workload

Queries · total
Queries · 24h
Uptime · 30d
Indexed chunks
probe rtt · this sessionfirst probe in flight…

GET /api/stats checked · last activity

// request log · fetched live from this browsernewest last

--:--:--Zawaiting first responses…

//engineering charter

The model isn’t the system.
The system is the work.

The difference is not the model. It’s the system around it. Below are the four laws we won’t ship without. Each is paired with the anti-pattern that gives the principle its enforcement edge.

Law 0101 / 04

Deterministic evaluation

Replacing subjective “looks good” testing with measurable scoring and regression checks. Every prompt change ships with a delta against a frozen baseline.

// won’t accept“it worked when I tried it”

Law 0202 / 04

Cost control

Preventing uncontrolled token usage through architecture, not afterthought optimization. Cost ceilings, caching, and routing live at the request layer.

// won’t accept“we’ll optimize when it gets expensive”

Law 0303 / 04

Latency engineering

Designing systems that respond in real time, not seconds too late. p95 is a feature; warm-paths and streaming are architectural, not optimization.

// won’t accept“the model takes 8s, deal with it later”

Law 0404 / 04

Operational reliability

Building systems that continue working under load, failure, and scale. Health checks, retries, idempotency, and circuit breakers ship on day one.

// won’t accept“works in dev, ship it”

//position

AI engineering is contract engineering.

  1. 01Build the control plane.
  2. 02Defend the workflow.
  3. 03Run layered evaluation.

AI infrastructure, not AI theater.

// from the engineering charter

//contact

Tell us the system.
We’ll tell you the bar.

Tell us about the system you need shipped. We respond in 24 hours with a specific technical plan, a measurable bar, and an honest estimate of weeks and cost.

First call inside one business day.

Brief us in two paragraphs. If we’re not the right fit, we’ll tell you fast and route you to a team that is.

// prefer email? hello@eleventh.dev

// about you

// your system

// your engagement

Can this become a public reference?

// what you need

// 24h response · specific technical plan · honest estimate