Harvesting intelligence — pulling signals from Slack, Gmail, Docs, and blog
What you need: A populated project-state/ substrate with surfaces configured in manifest.yaml.
The harvester pulls external signals into the project's inbox. It reads Slack messages, Gmail threads, Google Docs edits, and scsiwyg blog posts — but only the ones relevant to this project. Relevance is determined from the manifest, not hardcoded rules.
What harvesting means
The pattern: the harvester reads external surfaces, filters for project relevance, and writes structured markdown files into project-state/documents/inbox/. From there, the document curator classifies them, links them to milestones and decisions, and promotes them to references.
The harvester is read-only on external surfaces. It never sends a message, modifies a document, or publishes a post. It only reads and writes locally.
Relevance from the manifest
The harvester builds its filter from two sections of manifest.yaml:
Surfaces — which channels, email addresses, Drive folders, and blog slugs to watch:
surfaces:
slack:
enabled: true
channel: "#project-updates"
extra_channels: []
gmail:
enabled: true
from_identity: "david@atomic47.co"
keywords: ["protein extraction", "PCAIS"]
gdocs:
enabled: true
gdocs_root: "1a2b3c4d5e"
scsiwyg:
enabled: true
site_slug: "project-state"
watch_sites: ["partner-blog"]
Contacts — which people are part of this project (from the consortium or stakeholder list). Messages from or to these contacts are automatically relevant.
The combination means: "watch these channels for anything from these people about these topics." A message in #project-updates from a consortium member about protein extraction hits all three signals. A random message in #general from an unknown person is ignored.
Running the harvester
/project-harvester
Or naturally:
Harvest signals from Slack and Gmail
The harvester runs through four surfaces in order:
- Slack — reads configured channels since the cursor, plus DMs from known contacts
- Gmail — searches threads involving known contacts or matching keywords
- Google Docs — scans the configured Drive folder for modifications since the cursor
- scsiwyg — checks the project's own blog and watched partner blogs for new posts
Each surface can be run independently:
/project-harvester --surface slack
/project-harvester --surface gmail,gdocs
After processing, the harvester reports:
## Harvest complete — 2026-05-28
| Surface | Items found | Written | Skipped (dup) | Errors |
|----------|-------------|---------|---------------|--------|
| Slack | 8 | 6 | 2 | 0 |
| Gmail | 3 | 3 | 0 | 0 |
| GDocs | 1 | 1 | 0 | 0 |
| scsiwyg | 2 | 2 | 0 | 0 |
12 new docs in project-state/documents/inbox/
→ Run /project-document-curator to classify.
What gets written
Each harvested item becomes a markdown file in documents/inbox/:
2026-05-28-slack-project-updates-standup.md
2026-05-28-gmail-terrasense-intro-thread.md
2026-05-28-gdocs-experiment-protocol-v3.md
2026-05-28-scsiwyg-partner-blog-update.md
The file format includes structured frontmatter:
---
source: slack
source_id: "C123/1714389612.123456"
harvested_at: "2026-05-28T12:00:00Z"
surface_timestamp: "2026-05-28T09:30:00Z"
author: "Jane Smith"
author_contact: "jane@acme.com"
channel: "#project-updates"
relevance_signals:
- contact_match: "jane@acme.com"
- channel_match: "#project-updates"
status: inbox
---
# Daily standup — Jane Smith
Finished the enzyme characterization runs yesterday.
Results look promising — yield at 42% which is above
our 35% target. Will write up the protocol update today.
---
_Harvested by project-harvester from slack on 2026-05-28._
The relevance_signals array tells the document curator why this item was flagged. The status: inbox field is what the curator looks for when scanning for unclassified documents.
The full loop
Harvesting is step one of a four-step intelligence pipeline:
- Harvest —
project-harvesterpulls signals intodocuments/inbox/ - Triage —
project-inboxscores relevance, identifies imprint documents, routes items - Classify —
project-document-curatorassigns types, links to milestones/decisions, adds metadata - Promote — curator moves classified documents from
inbox/toreferences/orpublished/
The orchestrator runs this pipeline as the first step of its daily routine — harvest, then triage, then classify — so that fresh intelligence is available before it checks milestones, deadlines, and reports.
Cursor management
Cursors prevent re-harvesting the same content. They're stored in state.json:
{
"harvest_cursors": {
"scsiwyg": "2026-05-19T12:15:00.000Z",
"gmail": "2026-05-19T00:00:00.000Z",
"slack": "2026-05-19T00:00:00.000Z",
"gdocs": "2026-05-20T16:10:00.000Z"
}
}
Each cursor records the timestamp of the last successfully harvested item on that surface. On the next run, the harvester only reads items newer than the cursor. Cursors advance only after a surface is fully harvested without errors — if the Gmail harvest fails partway through, its cursor stays put so the next run picks up where it left off.
When a cursor is missing (first run or new surface), the default lookback is 7 days. You can override this:
/project-harvester --since 30d
To re-harvest without moving cursors (useful for testing):
/project-harvester --no-advance-cursor
To preview what would be harvested without writing anything:
/project-harvester --dry-run
Deduplication
The harvester deduplicates by source_id. A hash of each {surface}:{source_id} pair is stored in harvest/seen.json. If the same Slack message or Gmail thread appears in a subsequent run (because of overlapping time windows or cursor resets), it's silently skipped.
The seen set is append-only and stays small — one 12-byte hash per harvested item.
What the harvester does not do
- Does not classify or promote documents — that's
project-document-curator - Does not modify anything on external surfaces — it's read-only
- Does not harvest GitHub — commits are tracked through
project-gitand the kanban's milestone linking - Does not replace the work-state harvesters — it's a project-scoped lens on the same surfaces
The harvester is the input side of the system. It answers: "what happened in the outside world that matters to this project?" Everything downstream — triage, classification, promotion, reporting — operates on what the harvester deposited in the inbox.
The full intelligence loop — from raw signal to classified reference to generated report — is what makes project-state a system where reporting is a byproduct of normal work, rather than a separate activity that competes with it.