Signal Harvester: Building a Living Intelligence Platform from Team Discourse
#signal-harvester#intelligence#skills#architecture#knowledge-graph#thought-leadership#cowork
David OlssonEvery team produces more signal than it captures. Slack messages scroll past. LinkedIn posts get a like and vanish. Email threads reach a conclusion nobody indexes. Articles get shared, discussed for an afternoon, and forgotten. The intellectual output of a team — the positions they take, the resources they surface, the debates they have, the themes they circle back to — evaporates unless someone does the work to capture it.
Nobody does that work. It's too tedious, too manual, and too fragmented across too many surfaces.
Signal Harvester exists because we got tired of losing our own thinking.
The Origin
The idea started with a simple frustration: our team was having genuinely good conversations across Slack, LinkedIn, email, and in response to articles we were reading — and none of it was being captured in a way that compounded. Each conversation was ephemeral. The same topic would come up three weeks later and we'd reconstruct the same arguments from scratch, having forgotten who said what and which articles informed the last round.
We'd been building skill-based systems in Cowork for a while — orchestrated pipelines for tarot reading services, code audits, newsletter publishing, narrative state management. We had patterns that worked: the orchestrator-specialist composition from our tarot pipeline, the decomposed state directories from our narrative engine, label-based deduplication from our email triage, append-only indexed records from our history tracker, timestamp-incremental harvesting from our blog publisher.
The question became: what if we applied all of these patterns to the problem of team discourse itself? What if we treated every post, message, article share, and email thread as a raw signal — harvested it, normalized it, analyzed it, extracted entities and relationships, stored it in structured long-term memory, and then surfaced the intelligence back to us as reports, books, and published articles?
That's Signal Harvester. It's not one tool — it's a platform of 12 interconnected skills, organized into 5 architectural layers, connected to 9 external services, running on a daily cadence that gets richer every cycle.
The Architecture: Five Layers
The platform is organized into layers, each with a clear responsibility. The orchestrator conducts the daily pipeline through them in sequence.
Harvest Layer
Four surface-specific harvesters pull raw content and normalize it into a common signal schema:
slack-harvester reads configured channels and threads via the Slack MCP. It captures message text, authors, timestamps, reactions, reply counts, and linked URLs. Threads are treated as single signals — the whole conversation is one unit, with replies captured as discussion structure. Bot messages and automated notifications are filtered by default.
email-harvester follows the label-state-machine pattern we proved in our tarot orchestrator. Gmail search uses -label:Signal/Harvested to find unprocessed threads. After harvest, threads get labeled. Labels cascade: Harvested → Analyzed → Reported. Multi-message threads collapse into one signal with the full conversation captured.
social-harvester reads LinkedIn, X, and Substack using the Chrome MCP for social platforms and WebFetch for Substack's RSS. For each monitored profile or handle, it captures post text, engagement metrics (likes, reshares, comments), and comment highlights.
web-harvester fetches articles and RSS feeds via WebFetch and runs keyword alerts via WebSearch. It supports three modes: RSS feed monitoring, keyword-based discovery, and targeted site watching. It also serves as the secondary research arm for the book-keeper — when a living book identifies a knowledge gap, it triggers the web-harvester with specific queries.
Every harvester outputs the same normalized signal format. This is the key architectural decision that makes everything downstream work without modification as new surfaces are added.
Analysis Layer
Three specialists process the normalized signals in parallel:
signal-analyzer is the core intelligence engine. For each signal, it produces: topics (what is this about), importance score (0–1, based on author role, theme relevance, novelty, engagement), urgency rating, sentiment, and style analysis (register from casual to analytical, depth from shallow to deep, engagement type from commentary to thought leadership). For threaded conversations, it maps the discussion trajectory — convergent, divergent, exploratory, or contentious — identifying key turns where someone shifted the conversation.
entity-mapper performs full gestalt extraction. It identifies people (with aliases — "Jensen Huang," "@jensenhuang," and "NVIDIA CEO" resolve to one entity), organizations (with sector and type), themes (with frequency and velocity tracking), opinions (attributed positions with confidence levels and evolution over time), and relationships (who collaborates with, tracks, endorses, debates, or criticizes whom). When a theme cluster appears in 5+ signals across 3+ days without matching any existing book, the entity-mapper flags it as a discovery candidate.
resource-indexer captures every URL mentioned across all signals. URLs are normalized (tracking params stripped, redirects resolved), deduped, and classified by type (article, paper, tool, repo, dataset, video, podcast). Metadata is fetched — title, description, publication date. High-value resources get full-text capture. Resources mentioned by multiple team members or in high-importance signals get higher scores. The result is a living reading list ranked by team engagement.
State Layer
signal-state is the sole gatekeeper of the .signal-state/ directory — a decomposed state system inspired by our narrative-state skill. Every other skill reads and writes through it. The directory has 9 domains: config (settings, connectors, interests), registry (per-surface dedup ledgers as append-only JSONL), signals (day-partitioned archives), entities (the knowledge graph), resources (the link index), books (living intelligence documents), discussions (cross-surface thread tracking), reports (generation history), and runs (pipeline execution log).
book-keeper maintains living intelligence books. This is the long-memory layer — where ephemeral signals become durable knowledge. More on this below, because books are where the real magic happens.
Output Layer
signal-reporter generates daily digests and weekly deep reports, output to Google Drive as formatted documents. Daily digests are designed to be read in 5 minutes: top signals by importance, active discussions, entity highlights, new resources, book updates. Weekly reports go deeper: theme velocity analysis, entity relationship evolution, cross-signal pattern recognition, opinion shifts, and recommendations.
signal-publisher drafts articles for Substack (via scsiwyg), X, and LinkedIn. It generates platform-appropriate content from book syntheses, signal clusters, or curated intelligence. It never auto-publishes — every piece goes through human review. Article types include weekly roundups, deep dives from book synthesis, signal spotlights on high-importance items, and trend alerts on auto-discovered themes.
signal-sharer delivers audience-segmented briefings. Team members get full signal access via Slack canvases. Investors get curated market intelligence in GDrive folders with internal attribution stripped. Participants get contextual updates on discussions they're part of. Peers get theme-specific intelligence packages. A redaction protocol sanitizes internal data for external audiences.
Orchestration Layer
signal-orchestrator is the daily conductor. It sequences the pipeline through 8 checkpointed stages: init → harvest (parallel) → analyze (parallel) → store (sequential, write-locked) → books → report → notify → complete. Each stage is idempotent — if a run fails mid-way, restarting picks up from the last checkpoint. The orchestrator runs daily via the scheduled-tasks MCP.
The Themes That Carry Through
Several themes from our earlier skill-building work evolved and advanced as we built Signal Harvester. These are the ideas that kept proving their worth as the system grew more complex.
Append-Only as Default
Our tarot history tracker taught us this: records should be immutable once written. Signal Harvester applies this everywhere. Registry ledgers are append-only JSONL. Book entries are append-only. Discussion archives are append-only. Signal archives are day-partitioned and never modified after the day closes. The entity graph is the one exception — entities use upsert semantics — but even there, arrays (like an entity's signal_ids or associated_themes) are only appended to, never pruned.
Why this matters: append-only data is trivially safe for concurrent reads, easy to audit, and impossible to accidentally corrupt by overwriting. When you can trust that your data only grows, your dedup logic becomes simple: check if the ID exists, if yes skip, if no append.
State Decomposition Over Monoliths
Our narrative-state skill (which tracks the full world-state of a fictional universe — characters, geography, timeline, puzzles, audience knowledge) taught us that decomposed state directories massively outperform single-file state blobs. Instead of one giant state.json that every skill reads and writes, .signal-state/ has 30+ files across 9 domains. Each skill reads only the files it needs. Write contention is minimal because skills touch different files.
Signal Harvester took this further with domain isolation: the slack-harvester only touches registry/slack.jsonl. The entity-mapper only touches entities/. The book-keeper only touches books/. The signal-state skill enforces this as the gatekeeper — but the architecture means even without enforcement, accidental cross-domain writes are structurally unlikely.
Label-Based State Machines
Our tarot orchestrator proved that Gmail labels are an excellent state machine for email processing. Signal Harvester's email-harvester uses the same pattern: search excludes already-labeled threads, processing applies labels, labels cascade through states. The elegance is that Gmail itself becomes the state store — no separate database needed for tracking which emails have been processed.
We extended this thinking to Slack harvesting (timestamp-based rather than labels, since Slack's API works differently) and social harvesting (URL-based dedup), but the principle is the same: use the source system's own metadata as your processing state whenever possible.
Orchestrator-Specialist Composition
Both our tarot orchestrator and code-audit orchestrator follow the same pattern: a conductor skill that doesn't do the work itself but calls specialist skills in sequence, threading state between them. Signal Harvester's orchestrator is the most complex version yet — 8 stages, parallel execution within stages, checkpoint-based recovery — but the principle is unchanged from the tarot pipeline.
The power of this pattern is that specialists are independently testable, replaceable, and composable. Adding a new harvester surface means writing one new skill and registering it with the orchestrator. The analysis layer, state layer, and output layer don't change.
The Normalized Schema as Contract
This is the theme that advanced the most in Signal Harvester. Every harvester outputs the same signal schema. Every analyzer enriches the same fields. Every output skill reads the same enriched format. The signal schema is the API contract between layers.
This means the social-harvester (which was built in Phase 3) worked with the signal-analyzer (built in Phase 1) without any modifications to the analyzer. The web-harvester feeds into the same entity-mapper, resource-indexer, and reporter. Plug in a new surface, get full-stack intelligence automatically.
Living Books: Where the Magic Compounds
Books are the feature that makes Signal Harvester more than a monitoring tool. They're the long-term memory — living documents that accumulate intelligence over time, synthesize it into narrative understanding, and get richer every cycle.
A book starts with a seed: a title, a set of keyword triggers, and optionally some entity triggers. "AI Infrastructure" triggers on terms like "GPU compute," "training cluster," "inference engine." It also triggers on entities like NVIDIA, AMD, CoreWeave.
Every harvest cycle, the book-keeper routes matching signals into the book's entries.jsonl — an append-only chronological record. Each entry captures: timestamp, signal ID, a book-specific summary, entities mentioned, key claims made, and resources referenced.
After accumulation reaches a threshold (5+ new entries), the book-keeper regenerates synthesis.md — an analytical narrative document. This is not a bullet-point dump of entries. It's a structured analysis: the current state of the theme, key voices and their positions, open questions, trend trajectory, and how the discourse has evolved since the last synthesis.
Auto-Discovery
When the entity-mapper detects a theme cluster appearing in 5+ signals across 3+ days that doesn't match any existing book's triggers, it creates a candidate in books/_discovered/. The daily digest surfaces these: "Emerging theme: Edge Computing — 7 signals in 4 days from 3 team members. Promote to full book?"
On approval, the candidate promotes to a top-level book and begins formal tracking. This means the system discovers themes the team hasn't explicitly decided to track — it surfaces what the team is actually paying attention to, not just what they said they'd watch.
Secondary Research
Books can trigger targeted web research. When a synthesis identifies a knowledge gap — a question the team keeps asking that no shared resource answers — the book-keeper calls the web-harvester with specific queries. Results come back tagged with the book's slug and get appended as "entry_type": "secondary_research" entries. The next synthesis weaves in both organic team signals and proactive research.
The Compounding Effect
A book seeded in week 1 with 3 entries has a thin synthesis. By week 4 it has 20+ entries and the synthesis has substance — real positions attributed to real people, trend direction visible, resources accumulating. By week 8, with 50+ entries, it's an authoritative document. By week 12, it's institutional memory.
The synthesis at week 12 doesn't just say "the team discusses AI infrastructure." It says: "Over 12 weeks, the team's thesis evolved from 'compute costs are the bottleneck' to 'inference is commoditizing faster than expected.' This shift was driven by 3 key signals: Alice Chen's analysis in week 4, the Sequoia GPU market report shared by 4 team members in week 7, and CoreWeave's pricing announcement in week 10. Current open question: will on-device inference undercut cloud inference before cloud margins compress?"
That's compound intelligence. No individual signal contains that insight. It emerges from accumulation over time.
How to Use It with Effect
Start with One Surface, One Book
Don't try to harvest everything on day one. Install Phase 1 (signal-state, signal-orchestrator, slack-harvester, signal-analyzer), configure 3-5 of your most active Slack channels, and seed one book on a topic your team cares about. Run it daily for two weeks.
By the end of week 2, you'll have: 100+ signals in your archive, an entity graph with your team members and the people/organizations they discuss, a resource index of everything they've shared, and a book with 15-20 entries and a meaningful synthesis. The daily digest becomes something you actually read.
Let Discovery Surprise You
Once the entity-mapper is running (Phase 2), watch the auto-discovery candidates. The themes your team is actually discussing often differ from the themes you think you're tracking. A candidate that surfaces "developer experience" when you seeded books for "AI infrastructure" and "climate tech" tells you something about where your team's attention is really going.
Use Books as Briefing Material
The most immediate high-value use case is using book syntheses as pre-read material for meetings, ideation sessions, and investor conversations. Instead of "let's brainstorm about AI infrastructure," you start with "here's 6 weeks of structured intelligence on AI infrastructure — the convergent positions, the open debates, the knowledge gaps, and the top resources. Start here."
Publish from Convergence
The signal-publisher works best when a book's discussion trajectory has shifted from exploratory or contentious to convergent. That means the team's thinking has matured on a topic — positions have been tested against evidence, debates have resolved, a thesis has formed. That's the moment to publish a position paper or deep-dive article. It'll have substance because it's backed by weeks of accumulated evidence, not written from a blank page.
Track Divergence for Product Opportunities
Where the signal-analyzer shows contentious or divergent discussions — especially recurring questions without satisfying answers — that's a product opportunity signal. If 3 team members ask the same question in 3 weeks and no authoritative resource exists (the resource-indexer confirms the gap), you've found unmet demand.
Use Sentiment Divergence for Thought Leadership
The entity-mapper can compare your team's internal sentiment against web-harvested external sentiment on the same themes. When your team is bearish on something the market is bullish about (or vice versa), that's a contrarian take waiting to be published. These are the highest-value thought leadership pieces because they're substantiated by real experience, not just provocation.
Installation
Signal Harvester is a set of 13 skill directories that install into your Claude skills folder. Each skill is a directory containing a SKILL.md file and optionally a references/ directory with supporting documentation.
Prerequisites
You need the following MCP connectors configured in your Cowork or Claude Code environment:
- Slack MCP — for harvesting Slack channels and posting digests
- Gmail MCP — for harvesting email threads and sending briefings
- Google Drive MCP — for storing reports, books, and entity exports
- Chrome MCP (Claude in Chrome) — for harvesting LinkedIn and X
- scsiwyg MCP — for publishing to Substack and X
- Scheduled Tasks MCP — for the daily harvest cron
Not all connectors are needed from day one. Phase 1 only requires Slack MCP.
Phase 1: Foundation (Start Here)
Copy these 4 skill directories into your skills folder:
cp -r signal-state ~/.claude/skills/
cp -r signal-orchestrator ~/.claude/skills/
cp -r slack-harvester ~/.claude/skills/
cp -r signal-analyzer ~/.claude/skills/
Then initialize the state:
"Initialize Signal Harvester. Set up .signal-state/ with default config.
Configure these Slack channels: #general, #engineering, #ai-research"
The signal-state skill creates the full directory structure, default config, and empty collections. Then run your first harvest:
"Run the harvest"
Phase 2: Intelligence
cp -r entity-mapper ~/.claude/skills/
cp -r resource-indexer ~/.claude/skills/
cp -r signal-reporter ~/.claude/skills/
Now you get the entity graph, resource indexing, and daily digest reports. Seed your first book:
"Create a book on AI infrastructure — track everything the team
says about GPU compute, training, inference, and the companies
building it"
Phase 3: Surface Expansion
cp -r email-harvester ~/.claude/skills/
cp -r social-harvester ~/.claude/skills/
cp -r web-harvester ~/.claude/skills/
Configure additional surfaces:
"Add email harvesting for our team threads. Add LinkedIn monitoring
for @alicechen and @bobkim. Add RSS feeds for Stratechery and
Benedict Evans."
Phase 4: Books and Deep Patterns
cp -r book-keeper ~/.claude/skills/
By now you have 4+ weeks of accumulated signals. The book-keeper starts generating real syntheses and auto-discovering emerging themes.
Phase 5: Publishing and Sharing
cp -r signal-publisher ~/.claude/skills/
cp -r signal-sharer ~/.claude/skills/
Set up publishing:
"Draft a 'This Week in AI Infrastructure' post from the book
for Substack. Also create an investor briefing for this month
and share it to the investor folder in GDrive."
Setting Up the Daily Cron
"Schedule the signal harvester to run every morning at 7am.
Post the daily digest to #signals in Slack."
The orchestrator creates a scheduled task that triggers the full pipeline daily.
The Shared References
Three reference documents live in signal-state/references/ and serve as the source of truth for the entire platform:
signal-schema.md documents the normalized signal JSON format — every field, its type, which skill populates it, and when in the lifecycle it gets populated. This is the contract between layers.
state-schema.md documents the full .signal-state/ directory — every file, its JSON structure, which skills read from it, and which write to it. This is the system's memory architecture.
connector-map.md documents every MCP connector — its tools, which skills use them, data direction, and key operations. This is the integration map.
When building new skills that extend Signal Harvester (a GitHub harvester, a Discord harvester, a Notion exporter), these references define the interfaces to conform to.
What This Really Is
Signal Harvester isn't a monitoring tool. It's not a notification aggregator. It's not an RSS reader with extra steps.
It's a system that treats your team's discourse as raw material for institutional intelligence. The daily harvest is just the input mechanism. The real output is the knowledge that compounds: entity graphs that reveal who influences whom, books that trace how your collective thinking evolves, resource indexes that surface what your team actually finds valuable, and discussion maps that show where ideas converge and where they fracture.
The platform is built to get smarter every day it runs. Week 1 is thin. Week 4 has patterns. Week 12 has authority. Week 24 has institutional memory that no single person carries.
Your team is already producing the signals. Signal Harvester just makes sure you don't lose them.