Harvesting intelligence — pulling signals from Slack, Gmail, Docs, and blog

28 May 2026

What you need: A populated project-state/ substrate with surfaces configured in manifest.yaml.

The harvester pulls external signals into the project's inbox. It reads Slack messages, Gmail threads, Google Docs edits, and scsiwyg blog posts — but only the ones relevant to this project. Relevance is determined from the manifest, not hardcoded rules.

What harvesting means

The pattern: the harvester reads external surfaces, filters for project relevance, and writes structured markdown files into project-state/documents/inbox/. From there, the document curator classifies them, links them to milestones and decisions, and promotes them to references.

The harvester is read-only on external surfaces. It never sends a message, modifies a document, or publishes a post. It only reads and writes locally.

Relevance from the manifest

The harvester builds its filter from two sections of manifest.yaml:

Surfaces — which channels, email addresses, Drive folders, and blog slugs to watch:

surfaces:
  slack:
    enabled: true
    channel: "#project-updates"
    extra_channels: []
  gmail:
    enabled: true
    from_identity: "david@atomic47.co"
    keywords: ["protein extraction", "PCAIS"]
  gdocs:
    enabled: true
    gdocs_root: "1a2b3c4d5e"
  scsiwyg:
    enabled: true
    site_slug: "project-state"
    watch_sites: ["partner-blog"]

Contacts — which people are part of this project (from the consortium or stakeholder list). Messages from or to these contacts are automatically relevant.

The combination means: "watch these channels for anything from these people about these topics." A message in #project-updates from a consortium member about protein extraction hits all three signals. A random message in #general from an unknown person is ignored.

Running the harvester

/project-harvester

Or naturally:

Harvest signals from Slack and Gmail

The harvester runs through four surfaces in order:

Slack — reads configured channels since the cursor, plus DMs from known contacts
Gmail — searches threads involving known contacts or matching keywords
Google Docs — scans the configured Drive folder for modifications since the cursor
scsiwyg — checks the project's own blog and watched partner blogs for new posts

Each surface can be run independently:

/project-harvester --surface slack
/project-harvester --surface gmail,gdocs

After processing, the harvester reports:

## Harvest complete — 2026-05-28

| Surface  | Items found | Written | Skipped (dup) | Errors |
|----------|-------------|---------|---------------|--------|
| Slack    | 8           | 6       | 2             | 0      |
| Gmail    | 3           | 3       | 0             | 0      |
| GDocs    | 1           | 1       | 0             | 0      |
| scsiwyg  | 2           | 2       | 0             | 0      |

12 new docs in project-state/documents/inbox/
→ Run /project-document-curator to classify.

What gets written

Each harvested item becomes a markdown file in documents/inbox/:

2026-05-28-slack-project-updates-standup.md
2026-05-28-gmail-terrasense-intro-thread.md
2026-05-28-gdocs-experiment-protocol-v3.md
2026-05-28-scsiwyg-partner-blog-update.md

The file format includes structured frontmatter:

---
source: slack
source_id: "C123/1714389612.123456"
harvested_at: "2026-05-28T12:00:00Z"
surface_timestamp: "2026-05-28T09:30:00Z"
author: "Jane Smith"
author_contact: "jane@acme.com"
channel: "#project-updates"
relevance_signals:
  - contact_match: "jane@acme.com"
  - channel_match: "#project-updates"
status: inbox
---

# Daily standup — Jane Smith

Finished the enzyme characterization runs yesterday.
Results look promising — yield at 42% which is above
our 35% target. Will write up the protocol update today.

---
_Harvested by project-harvester from slack on 2026-05-28._

The relevance_signals array tells the document curator why this item was flagged. The status: inbox field is what the curator looks for when scanning for unclassified documents.

The full loop

Harvesting is step one of a four-step intelligence pipeline:

Harvest — project-harvester pulls signals into documents/inbox/
Triage — project-inbox scores relevance, identifies imprint documents, routes items
Classify — project-document-curator assigns types, links to milestones/decisions, adds metadata
Promote — curator moves classified documents from inbox/ to references/ or published/

The orchestrator runs this pipeline as the first step of its daily routine — harvest, then triage, then classify — so that fresh intelligence is available before it checks milestones, deadlines, and reports.

Cursor management

Cursors prevent re-harvesting the same content. They're stored in state.json:

json

{
  "harvest_cursors": {
    "scsiwyg": "2026-05-19T12:15:00.000Z",
    "gmail": "2026-05-19T00:00:00.000Z",
    "slack": "2026-05-19T00:00:00.000Z",
    "gdocs": "2026-05-20T16:10:00.000Z"
  }
}

Each cursor records the timestamp of the last successfully harvested item on that surface. On the next run, the harvester only reads items newer than the cursor. Cursors advance only after a surface is fully harvested without errors — if the Gmail harvest fails partway through, its cursor stays put so the next run picks up where it left off.

When a cursor is missing (first run or new surface), the default lookback is 7 days. You can override this:

/project-harvester --since 30d

To re-harvest without moving cursors (useful for testing):

/project-harvester --no-advance-cursor

To preview what would be harvested without writing anything:

/project-harvester --dry-run

Deduplication

The harvester deduplicates by source_id. A hash of each {surface}:{source_id} pair is stored in harvest/seen.json. If the same Slack message or Gmail thread appears in a subsequent run (because of overlapping time windows or cursor resets), it's silently skipped.

The seen set is append-only and stays small — one 12-byte hash per harvested item.

What the harvester does not do

Does not classify or promote documents — that's project-document-curator
Does not modify anything on external surfaces — it's read-only
Does not harvest GitHub — commits are tracked through project-git and the kanban's milestone linking
Does not replace the work-state harvesters — it's a project-scoped lens on the same surfaces

The harvester is the input side of the system. It answers: "what happened in the outside world that matters to this project?" Everything downstream — triage, classification, promotion, reporting — operates on what the harvester deposited in the inbox.

The full intelligence loop — from raw signal to classified reference to generated report — is what makes project-state a system where reporting is a byproduct of normal work, rather than a separate activity that competes with it.