Meet forge — an experiment harness for open source, and the loop we want it to learn

23 June 2026#forge#meet-forge#open-source#experiment-harness#operationalization#roadmap#self-tuning

Every week, dozens of new open-source AI tools, frameworks, and research papers get posted on GitHub, Substack, X, and a few specialized aggregators. Some are genuinely useful. Some are well-intentioned but unfinished. A few are abandoned within a month. From the outside it's almost impossible to tell which is which without actually downloading each one, installing its dependencies, and trying to run it — which nobody has time to do for more than two or three a year.

forge is our solution to that problem. It's an automated harness that:

Watches our Slack #development channel for a 🧪 test-tube emoji (used as a "this looks interesting" marker).
For each marked item, clones the project from GitHub into a clean, sandboxed Docker container.
Tries to build it, run its own tests, and exercise it in a small bench.
Writes up what actually happened — what's solid, what's missing, what claims are true, which are aspirational.
Publishes the writeup on this blog with a companion gist holding the build log, the reproducibility recipe, and the artifact code.

The result is a steady cadence of careful, honest, reproducible reviews — not hype posts. Every article includes a layman-friendly intro (so non-technical readers can follow what's happening), a technical writeup, and the exact commands to reproduce what we did.

Where it gets interesting: we just upgraded forge to handle sources that aren't clonable repos — essays, research papers, hosted SaaS products. For those, forge now tries to build the system the source describes as a portable package, rather than just summarizing the source. The first article using the new pattern shipped today: an essay on Agentic Reinforcement Learning became a working Python package with 13 passing tests.

Down the road, we plan to point forge at its own design and let it propose improvements to itself. That's the "self-tuning loop" mentioned in the roadmap below. We're piloting forge as a manually-triggered tool first; once the pilot is stable, the self-tuning loop comes online.

What forge is

Forge is an experiment harness for open-source projects, hosted SaaS, and technical writing. It walks each candidate through a fixed lifecycle inside a no-secrets Docker sandbox, captures reproducibility anchors at every stage, and publishes a structured writeup on this blog. The bar: a future reader can reproduce every claim in every post from the gist alone, with no reference back to the original source needed.

The substrate lives at ~/forge/ on the operator's machine. It is a flat directory of YAML, NDJSON, and markdown — no database, no cloud service, no shared mutable state. The whole thing is portable; you could zip it up tomorrow and run it on a different laptop.

The plugin (the actual skills that read and write the substrate) lives at /Users/davidolsson/WORKSONA/forge-state/plugin/ and ships as ten coordinated skills. They are deliberately thin: each skill has one job, reads through the spine (forge-state), and never directly mutates the data plane.

How forge works — the lifecycle

Phase, not queue. Any experiment whose phase is not published is in the queue. The orchestrator walks the substrate every night, decides what to dispatch, and never holds in-memory state across runs — restart-safety by design.

The two-plane isolation

Secrets never enter the data plane. Build artifacts never enter the substrate beyond their log + env manifest. This is what makes forge safe to point at arbitrary public code: an exploit in a npm install --postinstall cannot reach the operator's GitHub token because the token simply isn't in the container.

What forge has shipped (first pilot run, June 2026)

#	slug	result	bench
EXP-0001	autowiki-factory-ai	pattern-note	Hosted SaaS, not clonable. Three reusable patterns extracted.
EXP-0002	cc-gateway	partial	16/16 rewriter tests pass; OAuth pre-flight gate identified as a key finding.
EXP-0003	cc-gateway-dashboard	success	200-line companion UI forge built for EXP-0002. 5/5 tests, SSE verified.
EXP-0004	road-to-machine-learning	partial	5 of 23 advertised projects have runnable code; the iris script that exists works.
EXP-0005	mentraos	strong	33/33 cloud protocol tests pass; real smart-glasses OS, 674 MB monorepo, 4 supported devices.
EXP-0006	agentic-rl	success	Essay → working `agentic-rl-runner` Python package (13/13 tests). New `article-as-spec` template.
EXP-0007	pinokio	partial	Install verified (863 packages); GUI probe deferred. Substantive finding: transfer-and-freeze security model.

Seven experiments, seven published posts, six gists, four working build artifacts, two new skills (forge-agentic-rl, this article's framework), one upgrade to the experimenter (non-build templates).

The operationalization rule (new)

EXP-0006 added a non-build experimenter template and codified what we expect when the source isn't a clonable repo:

The success criterion for every experiment: a future reader can pip-install / docker-pull / clone-and-run what we shipped, with no reference back to the original source needed. Forge is not a summary blog; if the source described a system, forge ships the system.

Practical use cases

Three concrete patterns of use forge supports today, with examples from the pilot:

1. Triage a Slack-firehose of OSS recommendations. Someone in the team posts a link to "Project X is interesting." React with 🧪. Forge picks it up overnight, runs the build, and reports back. You go from one project per month evaluated by hand to seven projects per night evaluated by harness — with a real reproducibility anchor for each.

2. Convert essays and research papers into running code. When Cameron R. Wolfe published the Agentic RL essay, forge applied the new article-as-spec template and produced a 500-line Python package implementing the harness the essay describes. Same pattern would work for a Karpathy YouTube transcript, an arxiv paper, a vendor blog post about a new technique.

3. Pre-flight a vendor product before you adopt it. EXP-0007 (Pinokio) verified the install was clean and surfaced the security model in 60 seconds — that's the kind of "should we bring this in?" question forge can answer in five minutes that would take a developer a half-day of careful reading.

Roadmap — what's next, including the self-tuning loop

The self-tuning loop — what we mean

Forge has a substrate of its own (forge-state-spec.md, ~5,000 words), a plugin (ten skills), and a record of every experiment it has ever run. All of that is itself a clonable artifact. The plan, once the pilot is stable:

Forge runs an experiment on its own design. Source: forge-state-spec.md + the plugin source. Template: article-as-spec (the spec is the source; the harness is forge itself).
Forge proposes upgrades to itself. The experimenter runs a critic over each phase, flags the weakest skill (slowest, lowest-success-rate, lowest layman-readability), and proposes a minimal change.
Forge applies the upgrade, reruns the pilot batch, measures the delta. If success rate goes up or artifact portability goes up without regressions on other metrics, the change sticks. Otherwise it reverts.
The agentic-rl-runner shipped by EXP-0006 is the bench harness. Each forge-on-forge run is itself an agentic-RL trajectory; the rewards are the measured deltas. The runner already exists, the math already works; the missing piece is the policy that proposes the changes.

This is intentional and called out as a roadmap item rather than as production. We're piloting forge as a manually-triggered tool first because the self-tuning loop has to be safe, and "safe" here means: a bad forge-on-forge change cannot quietly delete itself or its substrate. Some of the work toward that is already in place (forge never mutates its substrate from the data plane, the activity log is append-only, every phase advance is signed). Some of it (a working revert + the policy itself) is still to come.

Pieces you can use today

Read the published experiments. Each post on this blog is self-contained and walks through one project.
Clone the companion gists. Each post links to a public gist with the source, the build log, and a RUN.md. Anyone with docker can reproduce.
Use the agentic-rl-runner package. EXP-0006 shipped a 500-line Python package that runs multi-turn LLM-agent rollouts with GRPO + task-normalization. pip install it; bring your own policy.
Steal the patterns. The forge plugin is open. The substrate spec is open. The non-build operationalization rule is reusable in any context where you have technical writing that ought to be code.

What forge is not

Not a benchmark suite. Forge doesn't rank projects against each other. Each experiment stands alone.
Not a code review service. Forge tests that things build and run. It does not opine on style.
Not a security audit. Forge runs code in a sandbox; "did install succeed" is not the same as "is this safe to deploy."
Not autonomous yet. Everything in the pilot is operator-triggered. The self-tuning loop is the goal, not the current state.

Forge runs from ~/forge/, ships out of plugin/skills/forge-*, and posts here on /forge/. Every article includes its own reproducibility anchors. If you want to see what forge looks like in action, start with EXP-0006 — it's the most ambitious of the pilot run.

𝕏 Post