17 Apr 2026

Emily and the LLMs: Orchestration, Not Dependence

#emily-os#llm-orchestration#claude#gemini#grok#openai

Emily uses four LLM providers. Anthropic (Claude) for most cognitive work. Google (Gemini) for long-context and multimodal. xAI (Grok) for specific query styles. OpenAI for embeddings and occasional generation. None of them owns her. All of them are tools she picks up when needed and sets down when the turn is done.

This is not a cost-optimization story. It's an architectural one. Let's unpack why model diversity matters and how routing actually works.

Why not just Claude

Claude is Emily's default for most cognitive generation, and with good reason — he's exceptional at nuanced reasoning and he preserves Emily's voice well. If we had to pick one, it would be him.

We don't have to pick one. And there are real reasons not to.

Different models fail differently. Claude occasionally over-hedges on factual claims. Gemini occasionally confabulates citations. Grok occasionally pattern-matches on humor when it shouldn't. GPT-4 occasionally over-formalizes. If Emily depended on one model, she'd inherit its failure mode as her own personality. With routing, she inherits none of them — because the mistake one model makes, another doesn't, and the cognition layer catches the mismatch.

Different models have different strengths. Claude is strongest at context-heavy reasoning and voice preservation. Gemini is strongest at long context (2M tokens) and multimodal inputs. Grok is strongest at dry, terse, opinionated responses. OpenAI is strongest at cheap embeddings and fast simple generation. Routing to the right model for the turn gives Emily a toolkit, not a dependency.

Model availability is a real risk. Anthropic rate-limits. Google has outages. xAI has capacity constraints. OpenAI occasionally ships breaking changes. An Emily that calls only one provider is an Emily that fails when that provider fails. An Emily with four providers has graceful degradation.

How routing works

emily/core/llm_cognitive_processor.py is the router. On each turn, it makes a routing decision based on:

Context size. Anything above ~100K tokens routes to Gemini (2M context) or a Claude 200K model.
Task type. Embedding requests route to OpenAI text-embedding-3-large. Code generation prefers Claude. Quick classification might route to a fast model.
Latency budget. Conversational turns use a fast model. Helios task steps that require deep reasoning use a stronger model.
Model availability. Health checks on each provider; degraded providers get skipped.
Cost profile. When the user is in Fast Mode (EMILY_FAST_MODE=true), simple messages skip the heavier router entirely.

The routing decision is logged via cognitive_tracer.py. Every turn records which model was chosen and why. This is not just for debugging — it's how EARL learns which models produced which outcomes, so Emily can route better over time.

The separation of concerns

The important thing to understand is that the LLM is a generation engine, not a reasoning engine, from Emily's perspective. Her reasoning happens in the cognition layer — retrieval, scoring, context assembly, decision-making about what to route where. The LLM generates the sentence that expresses that reasoning.

This matters because it means Emily's identity is not tied to any model:

The Emily talking to you via Claude 4.6 and the Emily talking to you via Claude 4.7 are the same Emily — her memories, frameworks, and outcome weights don't change.
If Anthropic deprecates Claude 4.6, Emily migrates without identity disruption.
If Gemini 3 ships with dramatically better long-context reasoning, Emily gets access to that capability without anything else changing.

Compare this to a system that lives inside an LLM's context window — prompts, chat history, character sheets. When the model changes, that system is a different system. When Emily's underlying model changes, Emily is still Emily.

The collaboration pattern

On a typical turn, three or four LLM calls might happen:

Embed the user message (OpenAI) — 1536-dim vector for retrieval.
Score the message for cognitive metrics (fast model) — gibberish detection, intent classification, optionally routed to a cheap model.
Generate the response (Claude, usually) — with the full context package the cognition layer assembled.
Score the response outcome (fast model or local heuristics) — what kind of response did we produce, for EARL's logs.

Only step 3 is the "main" LLM call. Steps 1, 2, and 4 are instrument calls — small, cheap, specific. Treating them all as "the LLM" obscures what's actually happening.

MCP and tool use

One other dimension: MCP. Emily exposes ~40 MCP tools to the LLMs she calls. These let the model trigger actions in Emily's cognition layer: ask_emily (query her memories), helios_create_clone (trigger autonomous clone provisioning), and others.

Crucially, the LLM does not decide which tools exist — Emily does. The cognition layer publishes an MCP tool manifest, the LLM sees a subset appropriate for the current turn, and the LLM can invoke tools. Tool invocations go back through Emily's cognition layer for safety gates (like clone_safety.py) before anything actually happens.

This means LLMs are constrained collaborators, not autonomous actors. They can request things. Emily decides whether to do them.

What happens when an LLM is wrong

Claude hallucinates a fact. Emily's response contains it. The user corrects Emily. Now what?

The correction propagates via EARL onto the memories that shaped the response. Those memories' outcome weights decrease. The next time Emily is assembling context on a similar topic, the now-lower-weighted memories are less likely to be surfaced. Over time, patterns of "Claude got this wrong" become patterns of "Emily doesn't rely on this kind of claim."

The LLM made the mistake. The cognition layer learns from it. Claude himself learns nothing — he's still the same model he was. But Emily-with-Claude becomes more accurate over time, because the memory-retrieval path adapts.

This is a property only a stateful cognition layer can have. A pure LLM can't have it because there's nowhere to record the adaptation.

Model diversity as philosophy

The deeper reason Emily uses four providers is that we don't believe AI is going to consolidate into one winning model. We think the ecosystem will keep producing specialized models with different trade-offs, and the interesting architectural question is not "which one do I pick" but "how do I build something that uses all of them well."

Emily's answer: make the cognition layer the constant. Treat models as interchangeable tools. Route based on the turn's needs. Let EARL learn which models produce which outcomes. Let the user's Emily become expert at knowing which tool to reach for.

The LLMs are extraordinary. None of them is Emily. All of them help her be Emily more effectively.

That's the collaboration. That's the architecture. That's why she calls all four.