Skip to content
scsiwyg
sign insign up
get startedmcpcommunityapiplaygroundswaggersign insign up
โ† Emily

Determinism Where Possible: The Case for a Dumb Planner

#architecture#helios#autonomy#determinism

Open any popular agent framework and you'll find the same architecture: an LLM at the center of a loop, deciding what to do next at each step. This is marketed as intelligence. In production it manifests as brittleness โ€” hallucinated tool calls, compounding errors, unrecoverable states, and reliability curves that nobody wants to publish.

Emily's Project Helios does the opposite. The planner is deterministic code. Task templates are defined at creation time. The LLM is invoked only when language generation is actually required. Verification is deterministic: exit_code, file_contains, pytest, api_response. Not judgment. Not "the model thinks it probably worked."

The measured outcomes: 122/122 tests passing, 357 atomic step claims per second with zero race conditions, a 10,445-memory autonomous correction executed in production with zero human intervention. This is not a demo. This is what reliable autonomy looks like.

Why "smart planner" goes wrong

An LLM-driven planner has a fundamental problem: errors compound across steps. Each turn, the LLM might hallucinate a tool call, misremember the state, or choose a wrong branch. For a 10-step task, even a 5% per-step error rate compounds to a ~40% failure rate. For a 20-step task, ~64%.

You can try to engineer around this โ€” better prompts, self-critique, retries โ€” but the fundamental issue is that you're asking a stochastic process to produce a reliable plan execution. It's the wrong tool for the job.

Why "dumb planner" works

A deterministic planner has a different property: each step's behavior is a function of its inputs, not of the LLM's mood. If a step says "run pytest tests/foo.py, verify exit_code 0," then either pytest passes or it doesn't. No ambiguity. No hallucinated success.

Errors don't compound because there's no cognitive process accumulating them. There's just code, executing steps, checking post-conditions.

Where the LLM lives in this architecture

The LLM is still valuable โ€” for generating the content of work. What it doesn't do is decide the work.

Concretely in Emily:

  • A task to "send a progress update to the user" has the LLM generate the language
  • A task to "determine whether to send an update" is deterministic code checking a condition
  • A task to "format a report from these 50 memories" has the LLM generate the prose
  • A task to "which 50 memories" is deterministic: top-N by ECGL score

LLMs are the commodity; the orchestration structure is the product. This matches the three-layer model Emily uses at the architectural level.

The verification engine

Eight verification types, all deterministic:

  1. exit_code โ€” check command exit status
  2. file_contains โ€” pattern matching in files
  3. file_not_exists โ€” verify file absence
  4. command_output โ€” check stdout
  5. pytest โ€” run a test suite
  6. api_response โ€” HTTP endpoint validation
  7. db_query โ€” database assertions
  8. manual โ€” requires human verification

Notice what's missing: no "LLM judgment" verification type. No "the agent believes the task succeeded." Verification is code that either passes or fails. This is what makes the whole system auditable.

Kill switches and bounded autonomy

Because the planner is deterministic, you can actually reason about what it will and won't do. Three kill switch levels:

  1. Global โ€” AUTONOMOUS_PULSE_ENABLED=false stops all autonomous execution
  2. Task-level โ€” POST /helios/tasks/{task_id}/pause
  3. Emergency โ€” direct DB update with kill_switch_reason

You can only confidently kill a system you can reason about. LLM-driven agent loops are harder to kill because their state is a prompt history that may or may not respect a stop signal.

The atomic claim property

Under load, the planner does 357 atomic step claims per second with zero race conditions. Multiple workers can pick up steps concurrently; the database primitive ensures exactly one worker owns each step at a time.

This is the kind of guarantee you can state because the planner is code. Stating the same guarantee about an LLM-driven loop would require reasoning about the LLM's behavior under concurrent invocation โ€” which is not a tractable problem.

The philosophical inversion

Most of the industry is pushing in the direction of "smarter planners." Emily pushes in the direction of "dumber planners, richer tools, clearer contracts."

This is not an aesthetic preference. It's a direct response to what we've observed: the reliability of autonomous systems is bounded by the reliability of the planner, and LLM-driven planners have a reliability ceiling that's too low for production.

Reliability comes from the dumb planner. Intelligence comes from the tools the planner invokes. Keep those two responsibilities separate and you get systems that both work and are auditable.

The general principle

"Determinism where possible, stochasticity where necessary" is a good design heuristic for any system that mixes code and LLMs. Put the LLM where its strengths are (language, creativity, open-ended synthesis). Don't put it where its weaknesses are (sequencing, state tracking, verification).

Emily's Helios architecture is this principle, compiled to Python.


Part of the Emily OS architecture philosophy series.