Right-sizing LLMs: Per-Category Model Routing in Atlas

16 April 2026#atlas#devlog#feature#llm#building-in-public

Running a social simulation is expensive if you are not careful. A single 50-agent, 10-round run can generate several hundred LLM calls. If all of those go to a full-capability model, the bill adds up fast — and most of those calls do not need that capability.

We shipped per-category model routing to fix this. Every stage of the Atlas pipeline now has its own independently configurable model, with sensible defaults selected for the cost-to-quality tradeoff at each stage.

The six categories

Atlas breaks its pipeline into six workflow categories, each with a different volume profile and quality requirement.

Category	Default model	Rationale
`graph_building`	gpt-4.1-nano	Highest volume — structured JSON extraction from document chunks
`profile_generation`	gpt-4.1-mini	Creative persona writing, moderate volume
`config_generation`	gpt-4.1-mini	Single call, needs reasoning to set simulation parameters
`simulation`	gpt-4.1-nano	Dominates total cost — agents need basic action selection only
`report`	gpt-4.1	Low volume, user-facing — worth the quality premium
`interaction`	gpt-4.1-mini	Conversational agent interviews, user-facing

Graph building and simulation are the two categories that generate the most calls. Graph building runs against every document chunk in parallel. Simulation fires one LLM call per agent per round, so at 50 agents and 10 rounds that is 500 calls — all routed to nano by default.

Report generation is the opposite: typically one or two calls, but the output goes directly to the user as a structured analytical document. That is where we let the full model run.

How routing works in practice

The lookup is a simple three-level fallback in Config.get_model_for_category:

python

@classmethod
def get_model_for_category(cls, category: str) -> str:
    """Return the model for a given category: user override > category default > global."""
    model = cls.MODEL_CATEGORIES.get(category)
    if model:
        return model
    return cls.MODEL_CATEGORY_DEFAULTS.get(category, cls.LLM_MODEL_NAME)

User overrides (stored in settings.json) take priority. If none is set, the category default applies. If the category is unknown, we fall back to the global model name. The config hot-reloads on save — no restart required.

Each service that makes LLM calls instantiates LLMClient with the model for its category:

python

client = LLMClient(model=Config.get_model_for_category("simulation"))

Model-aware parameter handling

Different model families require different API parameters, which means a naive "just change the model name" approach breaks at the API boundary. We handle this in LLMClient._build_params:

GPT-4 / GPT-4.1 family     -> max_tokens, temperature, response_format
GPT-4.5 / GPT-5 family     -> max_completion_tokens, temperature, response_format
o1 / o1-mini / o1-preview  -> max_completion_tokens only (no temperature, no json mode)
o3 / o4+ family            -> max_completion_tokens, temperature, response_format

The o1 family is the awkward case. It accepts neither temperature nor response_format: json_object. When a category is routed to o1, chat_json falls back to prompt-only JSON enforcement and strips the format parameter entirely. All other families get the full parameter set, with the correct token budget key selected by regex against the model name.

This means users can point any category at any model — including o-series reasoning models or locally-hosted Ollama endpoints — and the client adjusts the API call automatically.

Routing diagram

Configuration UI

The settings panel exposes all six categories as individual dropdowns with per-model pricing labels (e.g., GPT-4.1 Nano — $0.10 / $0.40). Recommended models and the reasoning behind each default are shown inline. Users can override any category independently and save — the change takes effect on the next LLM call with no restart.

The same settings endpoint accepts any OpenAI-compatible base URL, so the routing works equally against OpenAI, Azure OpenAI, Qwen, Ollama, or any other API-compatible backend.

The cost argument

The math is straightforward. At 500 simulation calls per run, the difference between routing simulation to gpt-4.1-nano ($0.10/M input) versus gpt-4.1 ($2.00/M input) is roughly a 20x cost reduction on the dominant line item — without touching quality for the steps where quality matters.

We are still collecting real cost data across run sizes. The telemetry system records model, token counts, and cost per call to a per-session JSONL file, so we will have numbers to share in a follow-up post.

𝕏 Post