17 Apr 2026

Emily's Tech Stack vs Other Harnesses

#emily-os#tech-stack#comparison#harness#architecture

If you diff Emily's requirements.txt against a typical LLM harness (LangChain app, LlamaIndex RAG stack, a custom agent framework), you'll find a lot of overlap and a few telling differences. The overlap is the generic substrate any Python LLM app needs. The differences are the shape of Emily's bet.

The stack, briefly

Web: FastAPI (0.115), uvicorn with 4 workers
Async: asyncio, asyncpg for Postgres, aiohttp
Storage: PostgreSQL with pgvector, per-user database
Queue: Celery 5.6 with gevent pool, Redis broker, beat scheduler at 10s for Helios
LLMs: Anthropic (Claude 4.6+), Google (Gemini), Grok (xAI), OpenAI — wired directly, not via an abstraction
Embeddings: OpenAI text-embedding-3-large (1536-dim)
Frontend: Next.js / React with SSE for streaming
Auth: Hardcoded single-user on v3-dev; JWT on main
Observability: structured logs, cognitive_tracer.py, six health monitors

Nothing exotic. The interesting claims are about composition, not ingredients.

Vs LangChain-style harnesses

LangChain (and most of its descendants) treat the LLM as the center of the system and everything else as a chain of transformations feeding it. The harness owns: prompt templates, memory buffers, retrieval chains, tool executors, agent loops.

Emily inverts this. The LLM is a downstream step, not the center. The center is the cognition layer — the tiered memory, the frameworks, the per-user state. The LLM is invoked late in the turn, with a context package assembled by Emily's cognition, and its output is scored by EARL after the fact.

Concretely, this shows up in:

Concern	LangChain-style	Emily
Memory	`ConversationBufferMemory` or a vector store	L1/L3/L4 with promotion rules and consolidation
Retrieval	Top-k cosine on a shared index	ECGL-weighted retrieval from per-user vector store
Learning	Out of scope; retrain the model	EARL outcome weights + EARL v2 self-correction
Identity	Prompt engineering	Memory graph with stability scores
Self-correction	None	Golden Baseline monitor + autonomous EARL v2
Multi-tenancy	Row-level-security or tenant_id in prompts	One database per user

Neither is wrong. They're solving different problems. LangChain solves "help me build a workflow that uses an LLM." Emily solves "help me build a persistent cognition that speaks with an LLM's voice."

Vs LlamaIndex-style RAG stacks

LlamaIndex and similar RAG frameworks are excellent at turning documents into retrievable chunks and surfacing relevant context to an LLM. If you're building a "chat with your docs" product, they're probably what you want.

Emily's memory is different because it's not documents. It's turns. Every user message and every Emily response is a potential memory. That changes everything downstream:

Chunking is not a thing. Turns are already the right granularity.
Embeddings are applied per-turn, not per-chunk.
Metadata is not "which file did this come from." It's the full EMEB/EARL/ECGL score vector.
Retrieval is not "find the document that answers this question." It's "find the memories that the user and I have together that bear on this question."

A RAG stack treats the knowledge base as static and the query as novel. Emily treats the relationship as dynamic and every turn as both query and new knowledge.

Vs AutoGPT / BabyAGI / agent frameworks

The autonomous-agent frameworks solve: "let an LLM drive a loop until it finishes a task." They use planning prompts, tool executors, and re-prompting loops.

Emily's autonomy (Project Helios) looks superficially similar but differs in one critical way: the loop is not driven by the LLM. It's driven by autonomous_worker.py polling a task registry every 10 seconds, claiming steps atomically, executing them via a sandboxed ExecutionEngine, and verifying outcomes deterministically.

The LLM is called when a step requires language generation. It is never called to decide what the next step should be — that's the task template's job, defined at task creation time. This is why Helios has 122/122 tests passing and ran a 10,445-memory autonomous correction without a human in the loop. LLMs hallucinate; deterministic workers don't.

What we inherit from "just Python web"

A fair bit, and that's the point. Emily is a boring Python web app with a cognition layer on top, not a novel runtime. Things we get for free:

FastAPI's speed and OpenAPI docs
asyncpg's excellent async Postgres driver
Celery's mature task queue
pgvector's surprisingly fast HNSW indexes
Next.js's build system and React's component model
Standard debugging, profiling, and deployment tooling

The exotic parts are in emily/core/. Everything else is a deliberately boring choice so the exotic parts can be the focus.

What we don't use (and why)

LangChain. Abstraction over LLMs is an abstraction that costs more than it gives. We call SDKs directly.
Dedicated vector DBs (Pinecone, Weaviate). pgvector in per-user databases is faster and simpler for our workload.
LLM routers/proxies (OpenRouter, LiteLLM). We want to see exactly which provider we're hitting and why. llm_cognitive_processor.py owns routing.
Prompt-chain frameworks. Prompts are produced by cognition modules, not by templating DSLs.

This isn't a flex. It's a consequence: every abstraction we'd add would sit between the cognition layer and the LLM, and that's exactly where Emily's value lives. We keep that seam clean.

The shape of the bet

Emily's stack is a bet that the interesting problem is not "better LLM plumbing" but "persistent per-user cognition." The commodity stuff (web, queue, DB) is boring on purpose. The interesting stuff (emily/core/) is where all the thinking goes.

If that bet is right, harnesses will continue to be useful for building workflows and chat-with-docs products, and something like Emily — cognition layers, not harnesses — will be the thing you use when you want an AI that knows you.

We'll find out.