EXP-0023 — chunkr: 13 services of Rust document intelligence, AGPL-fortified
#forge#experiment#rust#document-intelligence#rag#agpl#docker-compose
David OlssonIf your job is "feed PDFs into an AI" — contracts, manuals, research papers, anything with tables and images and headers and footnotes — most of the actual work happens before the AI ever sees the document. You have to figure out where the page ends and the table begins. Pull text out of pictures. Tag headers as headers, body as body, captions as captions. Break the result into chunks small enough for the AI to digest. chunkr is one open-source project that ships all of that as a single Docker stack. It's also one of the largest things forge has ever benched: 13 services in the full deploy, written in Rust, with a commercial-license tier behind it.
Summary
Forge benched chunkr (Lumina AI Inc, 3,747⭐, AGPL-3.0, Rust) on 2026-06-29 via Slack 🧪. The full stack's compose.yaml lists 13 services — Postgres + Redis + MinIO + Keycloak + a layout/segmentation/OCR triplet + the Rust server + the web frontend + admin tooling. The forge sandbox doesn't have budget to spin all that up; the bench was the tpa-pin-and-bench no-spin-up variant.
Verdict: strong-shape (structural). The advertised system matches the tree; the multi-tier license model is real; and the deploy weight is honest with the enterprise positioning.
Pinned
commit: 1bde59beccf9a429af2c63bccd659316c2b4cf3d, AGPL-3.0 + commercial-license tier.
What it is
A production-grade self-hostable document-intelligence pipeline:
- Layout analysis — find tables, figures, headers, body, captions
- OCR + bounding boxes
- Structured output — HTML and Markdown
- VLM processing — vision-language model for complex regions
PDFs / DOCX / PPTX / images go in. RAG-ready chunks come out.
What's notable
1. Three explicit tiers, three explicit deploy variants.
The README's tier matrix is unusually honest:
| tier | layout | OCR | VLM | Excel |
|---|---|---|---|---|
| Open-source (AGPL) | community models | community OCR | basic open VLM | ❌ |
| Cloud API (chunkr.ai) | proprietary | optimized | enhanced | ✅ |
| Enterprise | proprietary + custom-tuned | optimized + domain-tuned | custom fine-tunes | ✅ |
And then three compose files: compose.yaml (13 services, full Linux GPU), compose.mac.yaml (7 services, Apple-Silicon, no nvidia), compose.cpu.yaml (3 services, CPU-only overrides). Most projects ship one compose and tell Mac users "good luck." chunkr ships three. That's a meaningful signal about engineering discipline.
2. Rust + 7 in-house Dockerfiles.
One root Cargo.toml, 98 .rs files, 7 Dockerfiles. Lumina is investing in tight control of the deploy surface rather than gluing community images together. Consistent with the "we have a commercial tier" story.
3. AGPL as commercial moat.
This is forge's first AGPL bench. Prior benches were MIT / Apache / GPL-3. AGPL means anyone running chunkr-as-a-service must open-source their modifications — Lumina is using the license to say "self-host all you want, but you can't out-SaaS me." Same pattern Outlines (EXP-0011) and Yuxi (EXP-0021) use with different mechanisms.
4. No agent-instruction files.
No AGENTS.md / CLAUDE.md / SKILL.md in the tree. chunkr is a service product, not an agent harness. Clean counter-example: not everything in 2026 OSS adopts the SKILL.md convention forge tracks. Useful data point.
Position vs prior benches
| project | role | language | services in compose |
|---|---|---|---|
| Yuxi (EXP-0021) | agent harness | Python | 6+ (incl. Milvus/Neo4j) |
| chunkr (EXP-0023) | document service | Rust | 13 |
| sift-kg (EXP-0020) | KG CLI | Python | 0 |
| graphify (EXP-0018) | KG CLI | Python | 0 |
chunkr is the only Rust project in the doc-pipeline cohort. Forge's first Rust bench at scale.
What I didn't run
A full docker compose up against the 13-service stack. That would have pulled ~5-8 GB of images and taken 15-20 minutes from clone to first request — beyond the per-experiment budget. Verifying the actual throughput claims (which is where Rust matters) requires that run; the bench can verify shape, not speed.
Install
git clone --depth 1 https://github.com/lumina-ai-inc/chunkr.git
cd chunkr
docker compose up # full stack
# or:
docker compose -f compose.mac.yaml up # Apple Silicon
Sources
- https://github.com/lumina-ai-inc/chunkr (pinned
1bde59b) - README + compose.yaml / compose.cpu.yaml / compose.mac.yaml
- https://www.chunkr.ai (Cloud API tier)
- Prior benches: EXP-0018 graphify, EXP-0020 sift-kg, EXP-0021 Yuxi