Project Helios: How Emily Executes Tasks On Her Own
Autonomous agents are having a moment. Most of them are a thin wrapper: "let an LLM decide what to do next in a loop." This works for demos. It fails for production because LLMs hallucinate, and a hallucinating planner compounds its mistakes with every step.
Project Helios takes a different approach. It's Emily's autonomous execution system, and the key design choice is that the LLM does not drive the loop. The loop is a deterministic worker executing a pre-defined task template. The LLM is called only when language generation is required.
That choice is why Helios has 122/122 tests passing, claimed 357 tasks per second under load testing with zero race conditions, and ran a 10,445-memory autonomous correction in February 2026 without human intervention.
The anatomy
Five components:
TaskRegistry (emily/core/task_registry.py, 481 lines) โ CRUD for tasks and steps, atomic claiming with leasing. When a worker picks up a step, it claims the lease for 5 minutes; other workers see it as locked. If the worker crashes, the lease expires and the reaper picks it up.
AutonomousWorker (emily/core/autonomous_worker.py, 285 lines) โ Polls for claimable steps every 10 seconds via Celery beat. Executes via the ExecutionEngine, writes the outcome to the event log, and releases the lease.
VerificationEngine (emily/core/verification.py, 578 lines) โ Eight deterministic verification types: exit_code, file_contains, file_not_exists, command_output, pytest, api_response, db_query, manual. Every step must have a verification. No verification, no step.
OutcomeFeedbackLoop (emily/core/outcome_feedback.py, 296 lines) โ When a task completes (success or failure), creates L3 memories describing the outcome so Emily learns from her own autonomous actions.
TaskReaper (emily/core/reaper.py, 115 lines) โ Runs every 5 minutes. Finds expired leases (crashed workers), returns steps to claimable state.
Why deterministic verification
The most common failure mode of LLM-driven agents is silent plausibility failures. The agent runs a command, the LLM reads the output, says "this looks successful," and moves on. Except it wasn't successful โ the LLM pattern-matched on "the build finished" when the build finished with errors. The loop continues, compounding the mistake.
Helios verifications are code, not prompts. exit_code == 0 is not a judgment call. file_contains("PASS") is not a judgment call. The verification either passes or it doesn't. When it doesn't, the step fails, and the task either halts, retries, or escalates depending on the task definition.
The eight verification types:
| Type | What it checks |
|---|---|
exit_code | Command exit status |
file_contains | Pattern in file |
file_not_exists | File absence |
command_output | Pattern in stdout |
pytest | Test suite results |
api_response | HTTP status and body |
db_query | SQL assertion |
manual | Human verification required |
A common mistake when writing task templates is reaching for "logic_check" or "contains" โ those don't exist. The API rejects them at task creation time, not at execution time. We learned the hard way that invalid tasks pollute the database, so validation moved all the way forward.
The three kill switches
Autonomy without brakes is reckless. Helios has three levels of pause:
1. Global: AUTONOMOUS_PULSE_ENABLED=false in .env. The worker refuses to claim any steps. Used when we ship new framework code and want the world to freeze while it deploys.
2. Task-level: POST /helios/tasks/{task_id}/pause. One task pauses; everything else keeps running. Used when a specific task is behaving strangely.
3. Emergency: Direct DB update with a kill_switch_reason on the task. Immediate and logged. Used when something is on fire.
All three are verified by test_helios_security_killswitch.py. All three have been used in production at least once.
What running this looks like
A recent autonomous task: Emily detected her Golden Baseline drift had crossed the critical threshold (28% overall drift, integration crisis at 0.8%). She created a Helios task with 5 steps:
- Query L3 memories with integration_score < 0.3 and stability_score > 0.7
- For each, recompute ECGL weights via EARL v2 framework
- Apply new weights in a batched update
- Re-measure integration rate
- Verify drift reduced below warning threshold
Each step had a deterministic verification. The worker ran all five. The integration rate went from 0.8% to 35.0% โ a 44ร improvement. Total human intervention: zero.
That's the product. Not "an LLM that decides things on its own" โ a deterministic worker that executes pre-planned cognitive maintenance, with the LLM invoked only where language generation is strictly necessary.
The leasing discipline
The boring part that makes everything work: atomic claiming. When a worker wants to claim a step, it runs:
UPDATE task_steps
SET lease_owner = $1,
lease_expires_at = NOW() + INTERVAL '5 minutes',
status = 'claimed'
WHERE id = $2
AND status = 'claimable'
AND (lease_owner IS NULL OR lease_expires_at < NOW())
RETURNING id;
If two workers race, exactly one UPDATE succeeds (by Postgres row-level locking). The other gets zero rows back and moves on. No distributed locking, no ZooKeeper, no consensus โ just Postgres doing what Postgres is good at. Load tests show 357 claims/second with zero duplicate claims. That number is suspiciously close to "Postgres throughput on a single row update," which is exactly what it is.
Where this goes next
Three things on the board:
- Cross-task dependencies. Right now each task is independent. Adding "step X of task A must complete before step Y of task B" safely is non-trivial and worth doing.
- Adaptive scheduling. The worker polls every 10 seconds flat. Adapting poll rate to queue depth would let Emily react faster when she's busy.
- Cost-aware routing. When a step requires LLM generation, Helios currently uses the default model. Routing expensive steps to cheaper models when possible would matter at scale.
None of these require driving the loop with an LLM. They all make the deterministic loop smarter. That's the theme.
The bet Project Helios represents is that autonomy and reliability are not in tension if the planner is deterministic. Everyone else is betting the planner will get smart enough to be trustworthy. We're betting the planner should be code, and the LLM should be a tool it calls.
122/122 says the bet is holding.