Teaching AI to Read Pipe Specs: Schema-Driven BOM Extraction
#ai#extraction#bom#building-in-public#aimqc#devlog#oil-and-gas
David OlssonGetting an AI to extract a Bill of Materials from a piping drawing is not just a prompting problem. The domain is full of conventions that are not in any training dataset: Canadian control numbers, pipe schedule abbreviations that mean something specific in Alberta, material grades written in shorthand only experienced QC engineers recognize. We encoded that knowledge in a schema registry so the AI could use it.
The problem with generic extraction
A general-purpose AI can read a PDF and identify that there is a table with columns that look like material descriptions and quantities. What it cannot do without help is know that "A333 Gr.6" means ASTM A333 Grade 6 low-temperature carbon steel pipe, that "SCH 80" and "XH" refer to the same wall thickness, that the "CTRL #" column is the Canadian control number that links this material to its regulatory traceability chain, or that a quantity written as "1 LF" should be normalized to linear feet.
These are not edge cases. They are the standard language of piping materials on Alberta industrial sites. Without encoding this knowledge, the extraction is unreliable โ and unreliable extraction requires a human to manually review every output line before it can be trusted.
Schema registry pattern
We built a schema registry that pairs each document type with a structured extraction schema and a set of domain-specific hints. The schema defines the shape of the output (using Zod for runtime validation). The hints tell the AI what to look for and how to interpret it.
// Simplified from the BOM extraction schema
const bomSchema = z.object({
items: z.array(z.object({
itemNumber: z.string(),
description: z.string(),
materialGrade: z.string().optional(),
size: z.string().optional(),
specification: z.string().optional(),
quantity: z.number(),
unit: z.string(),
controlNumber: z.string().optional(),
}))
})
const bomHints = `
- Material grades follow ASTM conventions: A106, A333, A53, A105
- "Gr.B" or "Gr.6" suffix indicates grade
- Wall thickness: SCH 40, SCH 80, XH (extra heavy), XXH (double extra heavy)
- Control numbers (CTRL #) are Canadian regulatory identifiers โ preserve exactly
- Units: LF = linear feet, EA = each, LB = pounds
- Pipe specifications reference CSA or ASME standards
`
The registry maps document types to their schemas and hints. When a document is submitted for extraction, the system looks up the schema, attaches the hints, and builds the structured JSON output request.
The extraction pipeline
Human review before database insertion is intentional. The AI gets the structure right most of the time. It occasionally misreads a smeared scan or misidentifies a continuation row. The review step is not a workaround for a poor model โ it is the appropriate QC gate for data that will underpin procurement and material traceability.
Split-view review UI
The review interface uses a split pane: the source document on the left, the extracted BOM table on the right. A user can click a BOM line item and the document viewer scrolls to the source location. This is the same interaction model as code review โ the diff on one side, the context on the other.
Line items can be edited inline, reordered, or rejected. Only approved items are written to the database.
Domain knowledge is the moat
The extraction accuracy on our domain is substantially higher than a generic prompt would achieve, because the schema registry encodes years of field experience about how Alberta piping materials are described. A general model can recognize that something is a table. Our extraction knows what the columns mean.
That knowledge lives in the registry, not in a prompt someone might forget to update. As new document formats arrive from different clients, the registry gains new entries. The improvement compounds.
David Olsson is CTO at AIMQC. Contact: dolsson@aimqc.com