Design an AI document processing pipeline
Walk through designing an intelligent document processing system that extracts structured data from PDFs, invoices, and contracts using OCR, layout analysis, and LLM extraction at 10K documents per hour.
TL;DR
- Tiered OCR routing (digital PDF text extraction, Tesseract for clean scans, Google Document AI for complex layouts) cuts OCR costs by 60-70% while maintaining 95%+ accuracy across all document types.
- Multimodal LLMs like GPT-4o with structured output achieve 95%+ field extraction accuracy on invoices and contracts, replacing months of regex and template engineering with a single prompt.
- Multi-layer validation (JSON schema checks, business rule validation, cross-document consistency, LLM logprob confidence scoring) reduces human review volume from 100% to 5-15% of documents while catching 99.5%+ of extraction errors.
- Layout-aware extraction using models like LayoutLM and TableTransformer preserves table structure and spatial relationships that flat OCR text destroys, increasing table extraction accuracy from 40% to 90%+.
- The production lesson: the extraction model is the easy part. The hard part is building a confidence-aware routing system that knows which documents need human review, which can auto-approve, and which should retry with a more expensive model.
Requirements
Functional requirements
- The system ingests documents in multiple formats (PDF, scanned images, TIFF, DOCX) and produces structured JSON output with extracted fields mapped to a configurable schema.
- The system classifies each document by type (invoice, contract, receipt, tax form, medical record) and applies the corresponding extraction template.
- The system extracts key-value fields, line items, tables, signatures, and dates with per-field confidence scores attached to every output.
- The system validates extracted data against business rules (totals match line items, dates are plausible, required fields are present) and flags inconsistencies.
- The system routes low-confidence documents to a human review queue with the original image, extracted data, and highlighted areas of uncertainty.
- The system supports schema versioning so new document types or fields can be added without reprocessing the entire backlog.
Non-functional requirements
- Throughput: process 10,000 documents per hour (approximately 3 per second sustained).
- P95 latency: under 30 seconds per document end-to-end (intake to structured output).
- Field extraction accuracy: 95%+ across all document types, 99%+ for high-value fields like invoice totals and dates.
- Cost per document: $0.01-0.10 depending on complexity (versus $2-5 for manual data entry).
- Human review rate: under 15% of documents require human intervention.
- Availability: 99.9% uptime with graceful degradation (queue documents during outages, process when recovered).
The hardest engineering problem here: every document is different. An invoice from Vendor A looks nothing like an invoice from Vendor B. Scanned documents vary wildly in quality, rotation, and resolution. The system needs to handle this variance without a custom template per vendor, and it needs to know its own confidence level well enough to route uncertain documents to humans instead of silently producing wrong data.
The core entities
Document
doc_id,source(upload, email, s3_sync),file_type(pdf, tiff, png, docx),file_size_bytes,page_count,status(queued, preprocessing, classifying, extracting, validating, review, completed, failed),uploaded_at,completed_at
DocumentClassification
classification_id,doc_id,doc_type(invoice, contract, receipt, tax_form, medical_record, unknown),confidence,model_used,classified_at
ExtractionResult
extraction_id,doc_id,schema_version,fields(JSON map of field_name to value + confidence),line_items(array of row objects),tables(array of table objects with headers and rows),model_used,tokens_consumed,cost_usd,extracted_at
OCRResult
ocr_id,doc_id,page_number,engine(native_text, tesseract, document_ai),raw_text,word_boxes(array of {text, x, y, width, height, confidence}),avg_confidence,processing_time_ms
ValidationResult
validation_id,extraction_id,checks_passed,checks_failed,errors(array of {field, rule, message, severity}),overall_confidence,routing_decision(auto_approve, human_review, retry_with_upgrade),validated_at
HumanReview
review_id,doc_id,extraction_id,reviewer_id,corrections(JSON diff of changed fields),review_time_seconds,status(pending, in_progress, completed),created_at,completed_at
ExtractionSchema
schema_id,doc_type,version,fields(array of {name, type, required, validation_rules}),active,created_at
API design
POST /v1/documents - upload a document for processing
Request: {
"source": "upload",
"file": "<binary>",
"file_type": "pdf",
"priority": "normal",
"callback_url": "https://acme.com/webhooks/doc-processed",
"extraction_schema": "invoice_v3"
}
Response: {
"doc_id": "doc_abc123",
"status": "queued",
"estimated_completion_seconds": 25,
"queue_position": 12
}
GET /v1/documents/{doc_id} - check processing status and results
Response: {
"doc_id": "doc_abc123",
"status": "completed",
"classification": {
"doc_type": "invoice",
"confidence": 0.97
},
"extraction": {
"vendor_name": { "value": "Acme Corp", "confidence": 0.99 },
"invoice_number": { "value": "INV-2026-0042", "confidence": 0.98 },
"total_amount": { "value": 15420.00, "confidence": 0.95 },
"line_items": [
{ "description": "Cloud hosting (March)", "quantity": 1, "unit_price": 12000.00, "amount": 12000.00 }
]
},
"validation": {
"checks_passed": 8,
"checks_failed": 0,
"overall_confidence": 0.96,
"routing_decision": "auto_approve"
}
}
POST /v1/documents/batch - submit multiple documents for processing
Request: {
"documents": [
{ "s3_uri": "s3://acme-docs/invoices/batch-march/*.pdf", "extraction_schema": "invoice_v3" }
],
"priority": "bulk",
"callback_url": "https://acme.com/webhooks/batch-complete"
}
Response: {
"batch_id": "batch_xyz789",
"document_count": 342,
"estimated_completion_minutes": 8,
"status": "processing"
}
GET /v1/documents/{doc_id}/review - get human review interface data
Response: {
"review_id": "rev_def456",
"doc_id": "doc_abc123",
"original_image_url": "/v1/documents/doc_abc123/image?page=1",
"extracted_fields": {
"vendor_name": { "value": "Acme Corp", "confidence": 0.99, "bounding_box": [120, 45, 380, 72] },
"total_amount": { "value": 15420.00, "confidence": 0.72, "bounding_box": [400, 890, 520, 915], "flagged": true }
},
"validation_errors": [
{ "field": "total_amount", "rule": "line_item_sum_match", "message": "Line items sum to $15,320.00 but total reads $15,420.00" }
]
}
PUT /v1/schemas/{doc_type} - create or update an extraction schema
Request: {
"doc_type": "invoice",
"version": "v4",
"fields": [
{ "name": "vendor_name", "type": "string", "required": true },
{ "name": "invoice_number", "type": "string", "required": true },
{ "name": "total_amount", "type": "number", "required": true, "validation": "must_match_line_item_sum" },
{ "name": "due_date", "type": "date", "required": false }
]
}
Response: {
"schema_id": "schema_inv_v4",
"doc_type": "invoice",
"version": "v4",
"active": true,
"fields_count": 4
}
High-level design
The system operates as two pipelines. The real-time pipeline processes individual documents on upload: preprocess, classify, OCR, extract, validate, and route. The batch pipeline handles bulk imports by parallelizing across a worker pool, processing up to 10K documents per hour on a cluster of 20 workers. Both pipelines share the same extraction and validation logic.
I think of the architecture as five layers. The intake layer handles uploads and queuing. The perception layer handles OCR and layout analysis (turning pixels into structured text). The intelligence layer uses LLMs for classification and field extraction. The validation layer checks the output against business rules. The routing layer decides whether to auto-approve, retry, or send to humans. Each layer is stateless and scales independently behind the task queue.
For your interview: draw the five layers and explain that each document flows through all five in sequence. This shows the interviewer you understand the pipeline pattern and can identify where bottlenecks occur (hint: it is almost always the LLM extraction layer, which takes 5-15 seconds per document).
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.