How We Track Every Dollar Atlas Spends on LLMs

16 April 2026#atlas#devlog#feature#llm#building-in-public

Running multi-agent simulations burns tokens fast. Graph building, profile generation, per-round agent decisions, final report — each stage hits the LLM a different number of times at different prompt sizes. Without visibility into that spend, it is easy to finish a run and have no idea whether your $0.12 came from the graph pass or from 80 agents arguing for 20 rounds.

We built a monitor layer that records every call, categorizes it by workflow step, persists it to a per-simulation JSONL ledger, and streams live updates to the UI over SSE. Here is how it works.

The record

Every LLM call goes through LLMClient.chat(). After the API response comes back, the client extracts token counts from the response's usage object, computes cost, and calls record_call() before returning to the caller:

python

usage = response.usage
pt = usage.prompt_tokens if usage else 0
ct = usage.completion_tokens if usage else 0
cost = compute_cost(self.model, pt, ct)

record_call({
    'model': self.model,
    'caller': caller,          # auto-detected via inspect.stack()
    'prompt_tokens': pt,
    'completion_tokens': ct,
    'total_tokens': pt + ct,
    'cost_usd': cost,
    'duration_ms': round(duration_ms, 1),
    'messages_preview': msg_preview,
    'response_preview': resp_preview,
    'error': None,
})

The caller field is resolved automatically by walking the call stack and finding the first frame outside llm_client.py. That gives us strings like simulation.run_round or ontology_generator.build_schema without any manual tagging at call sites.

compute_cost() does a longest-prefix match against a pricing table keyed by model family (gpt-4o-mini, gpt-4.1, o3-mini, etc.), so cost is always calculated correctly even when a model string has a version suffix we have not seen before. Unknown models fall back to GPT-4-level pricing as a conservative estimate.

Step classification

record_call() passes the caller string through _caller_to_step(), which maps module prefixes to five workflow steps: graph_building, env_setup, simulation, report, and other. The mapping is a simple ordered list of (prefix, step) pairs — no regex, no magic.

The step label is stored on the record and accumulated into per-step cost/token/call counters in memory. Those counters drive the breakdown chips in the status bar.

Two ledgers

Every record lands in two places simultaneously:

Global JSONL — data/llm_calls.jsonl, appended unconditionally. All-time totals are bootstrapped from this file at startup and then maintained in memory incrementally.
Per-simulation JSONL — uploads/simulations/<id>/llm_ledger.jsonl, appended only when a simulation is active. The monitor also keeps an in-memory dict indexed by simulation ID so stats queries never touch disk.

When a simulation finishes, save_simulation_ledger() writes a companion llm_summary.json alongside the ledger — a pre-computed step breakdown useful for post-hoc analysis without replaying the full JSONL.

Budget tracking

Budgets are set via a REST call (POST /api/monitor/simulation/<id>/budget) and persisted to budget.json in the simulation directory. On every record_call(), if a budget is active the monitor recomputes the running total and sets a _budget_exceeded or _budget_warning flag on the record before it goes out over SSE. The frontend uses that flag to flip the budget pill from green to amber to red.

Data flow

The SSE endpoint (GET /api/monitor/stream) is a Flask generator that blocks on a queue.Queue with a 30-second timeout. On timeout it yields a heartbeat comment to keep the connection alive. When record_call() runs, it puts the event onto every subscriber queue before releasing the lock. If a queue is full (capped at 100 events), the subscriber is dropped — a slow client does not block the simulation.

Frontend: one composable, zero polling

useLLMMonitor.js is a module-level singleton — one EventSource, one reactive state object, shared across every component that imports it. It auto-connects on first import. On reconnect after a drop it waits three seconds before retrying.

The composable handles three event types from the stream:

init — sent immediately on connect, carries current session and all-time totals
llm_call — carries the full call record plus a rolled-up session snapshot and, when relevant, per-simulation stats
system — text messages from the backend (call starting, call complete, errors)

StatusBar.vue reads directly from the composable state. The collapsed bar shows last-call cost, session total, all-time total, token count, per-step pills, and — when a budget is set — a miniature progress bar that animates via CSS transition as the fill width changes. Clicking the bar expands a scrollable log panel that replays the last 80 events with timestamps and type tags.

Provider-agnostic

LLMClient uses the OpenAI Python SDK with a configurable base_url, so the same call path and the same monitor work with OpenAI, Azure OpenAI, Ollama, Qwen, or any other OpenAI-compatible endpoint. The pricing table covers known hosted models; self-hosted models that do not match any prefix fall back to the default rate, which at least keeps the ledger structurally complete even if the dollar figure is an estimate.

What we learned

The caller auto-detection via inspect.stack() was the call that saved the most implementation effort — zero annotation required at call sites across fifteen modules. The step-prefix table needs updating when we add new modules, but that is a two-line change and easy to audit.

The dual-ledger design (global + per-sim) came from a real need: we wanted all-time cost tracking across all simulations without loading every per-sim ledger into memory. The global file is append-only and the in-memory counter keeps the REST query for all-time totals at O(1).

The one rough edge is that the global JSONL is never pruned. For now that is fine — a busy session might generate a few thousand records, which stays well under a megabyte. We will add rotation if it becomes a problem.

𝕏 Post