Design an AI document processing pipeline

TL;DR

Tiered OCR routing (digital PDF text extraction, Tesseract for clean scans, Google Document AI for complex layouts) cuts OCR costs by 60-70% while maintaining 95%+ accuracy across all document types.
Multimodal LLMs like GPT-4o with structured output achieve 95%+ field extraction accuracy on invoices and contracts, replacing months of regex and template engineering with a single prompt.
Multi-layer validation (JSON schema checks, business rule validation, cross-document consistency, LLM logprob confidence scoring) reduces human review volume from 100% to 5-15% of documents while catching 99.5%+ of extraction errors.
Layout-aware extraction using models like LayoutLM and TableTransformer preserves table structure and spatial relationships that flat OCR text destroys, increasing table extraction accuracy from 40% to 90%+.
The production lesson: the extraction model is the easy part. The hard part is building a confidence-aware routing system that knows which documents need human review, which can auto-approve, and which should retry with a more expensive model.

Requirements

Functional requirements

The system ingests documents in multiple formats (PDF, scanned images, TIFF, DOCX) and produces structured JSON output with extracted fields mapped to a configurable schema.
The system classifies each document by type (invoice, contract, receipt, tax form, medical record) and applies the corresponding extraction template.
The system extracts key-value fields, line items, tables, signatures, and dates with per-field confidence scores attached to every output.
The system validates extracted data against business rules (totals match line items, dates are plausible, required fields are present) and flags inconsistencies.
The system routes low-confidence documents to a human review queue with the original image, extracted data, and highlighted areas of uncertainty.
The system supports schema versioning so new document types or fields can be added without reprocessing the entire backlog.

Non-functional requirements

Throughput: process 10,000 documents per hour (approximately 3 per second sustained).
P95 latency: under 30 seconds per document end-to-end (intake to structured output).
Field extraction accuracy: 95%+ across all document types, 99%+ for high-value fields like invoice totals and dates.
Cost per document: $0.01-0.10 depending on complexity (versus $2-5 for manual data entry).
Human review rate: under 15% of documents require human intervention.
Availability: 99.9% uptime with graceful degradation (queue documents during outages, process when recovered).

The hardest engineering problem here: every document is different. An invoice from Vendor A looks nothing like an invoice from Vendor B. Scanned documents vary wildly in quality, rotation, and resolution. The system needs to handle this variance without a custom template per vendor, and it needs to know its own confidence level well enough to route uncertain documents to humans instead of silently producing wrong data.

The core entities

Document

doc_id, source (upload, email, s3_sync), file_type (pdf, tiff, png, docx), file_size_bytes, page_count, status (queued, preprocessing, classifying, extracting, validating, review, completed, failed), uploaded_at, completed_at

DocumentClassification

classification_id, doc_id, doc_type (invoice, contract, receipt, tax_form, medical_record, unknown), confidence, model_used, classified_at

ExtractionResult

extraction_id, doc_id, schema_version, fields (JSON map of field_name to value + confidence), line_items (array of row objects), tables (array of table objects with headers and rows), model_used, tokens_consumed, cost_usd, extracted_at

OCRResult

ocr_id, doc_id, page_number, engine (native_text, tesseract, document_ai), raw_text, word_boxes (array of {text, x, y, width, height, confidence}), avg_confidence, processing_time_ms

ValidationResult

validation_id, extraction_id, checks_passed, checks_failed, errors (array of {field, rule, message, severity}), overall_confidence, routing_decision (auto_approve, human_review, retry_with_upgrade), validated_at

HumanReview

review_id, doc_id, extraction_id, reviewer_id, corrections (JSON diff of changed fields), review_time_seconds, status (pending, in_progress, completed), created_at, completed_at

ExtractionSchema

schema_id, doc_type, version, fields (array of {name, type, required, validation_rules}), active, created_at

API design

POST /v1/documents - upload a document for processing

Request: {
  "source": "upload",
  "file": "<binary>",
  "file_type": "pdf",
  "priority": "normal",
  "callback_url": "https://acme.com/webhooks/doc-processed",
  "extraction_schema": "invoice_v3"
}
Response: {
  "doc_id": "doc_abc123",
  "status": "queued",
  "estimated_completion_seconds": 25,
  "queue_position": 12
}

GET /v1/documents/{doc_id} - check processing status and results

Response: {
  "doc_id": "doc_abc123",
  "status": "completed",
  "classification": {
    "doc_type": "invoice",
    "confidence": 0.97
  },
  "extraction": {
    "vendor_name": { "value": "Acme Corp", "confidence": 0.99 },
    "invoice_number": { "value": "INV-2026-0042", "confidence": 0.98 },
    "total_amount": { "value": 15420.00, "confidence": 0.95 },
    "line_items": [
      { "description": "Cloud hosting (March)", "quantity": 1, "unit_price": 12000.00, "amount": 12000.00 }
    ]
  },
  "validation": {
    "checks_passed": 8,
    "checks_failed": 0,
    "overall_confidence": 0.96,
    "routing_decision": "auto_approve"
  }
}

POST /v1/documents/batch - submit multiple documents for processing

Request: {
  "documents": [
    { "s3_uri": "s3://acme-docs/invoices/batch-march/*.pdf", "extraction_schema": "invoice_v3" }
  ],
  "priority": "bulk",
  "callback_url": "https://acme.com/webhooks/batch-complete"
}
Response: {
  "batch_id": "batch_xyz789",
  "document_count": 342,
  "estimated_completion_minutes": 8,
  "status": "processing"
}

GET /v1/documents/{doc_id}/review - get human review interface data

Response: {
  "review_id": "rev_def456",
  "doc_id": "doc_abc123",
  "original_image_url": "/v1/documents/doc_abc123/image?page=1",
  "extracted_fields": {
    "vendor_name": { "value": "Acme Corp", "confidence": 0.99, "bounding_box": [120, 45, 380, 72] },
    "total_amount": { "value": 15420.00, "confidence": 0.72, "bounding_box": [400, 890, 520, 915], "flagged": true }
  },
  "validation_errors": [
    { "field": "total_amount", "rule": "line_item_sum_match", "message": "Line items sum to $15,320.00 but total reads $15,420.00" }
  ]
}

PUT /v1/schemas/{doc_type} - create or update an extraction schema

Request: {
  "doc_type": "invoice",
  "version": "v4",
  "fields": [
    { "name": "vendor_name", "type": "string", "required": true },
    { "name": "invoice_number", "type": "string", "required": true },
    { "name": "total_amount", "type": "number", "required": true, "validation": "must_match_line_item_sum" },
    { "name": "due_date", "type": "date", "required": false }
  ]
}
Response: {
  "schema_id": "schema_inv_v4",
  "doc_type": "invoice",
  "version": "v4",
  "active": true,
  "fields_count": 4
}

The system operates as two pipelines. The real-time pipeline processes individual documents on upload: preprocess, classify, OCR, extract, validate, and route. The batch pipeline handles bulk imports by parallelizing across a worker pool, processing up to 10K documents per hour on a cluster of 20 workers. Both pipelines share the same extraction and validation logic.

I think of the architecture as five layers. The intake layer handles uploads and queuing. The perception layer handles OCR and layout analysis (turning pixels into structured text). The intelligence layer uses LLMs for classification and field extraction. The validation layer checks the output against business rules. The routing layer decides whether to auto-approve, retry, or send to humans. Each layer is stateless and scales independently behind the task queue.

For your interview: draw the five layers and explain that each document flows through all five in sequence. This shows the interviewer you understand the pipeline pattern and can identify where bottlenecks occur (hint: it is almost always the LLM extraction layer, which takes 5-15 seconds per document).

TL;DR

Tiered OCR routing (digital PDF text extraction, Tesseract for clean scans, Google Document AI for complex layouts) cuts OCR costs by 60-70% while maintaining 95%+ accuracy across all document types.
Multimodal LLMs like GPT-4o with structured output achieve 95%+ field extraction accuracy on invoices and contracts, replacing months of regex and template engineering with a single prompt.
Multi-layer validation (JSON schema checks, business rule validation, cross-document consistency, LLM logprob confidence scoring) reduces human review volume from 100% to 5-15% of documents while catching 99.5%+ of extraction errors.
Layout-aware extraction using models like LayoutLM and TableTransformer preserves table structure and spatial relationships that flat OCR text destroys, increasing table extraction accuracy from 40% to 90%+.
The production lesson: the extraction model is the easy part. The hard part is building a confidence-aware routing system that knows which documents need human review, which can auto-approve, and which should retry with a more expensive model.

Requirements

Functional requirements

The system ingests documents in multiple formats (PDF, scanned images, TIFF, DOCX) and produces structured JSON output with extracted fields mapped to a configurable schema.
The system classifies each document by type (invoice, contract, receipt, tax form, medical record) and applies the corresponding extraction template.
The system extracts key-value fields, line items, tables, signatures, and dates with per-field confidence scores attached to every output.
The system validates extracted data against business rules (totals match line items, dates are plausible, required fields are present) and flags inconsistencies.
The system routes low-confidence documents to a human review queue with the original image, extracted data, and highlighted areas of uncertainty.
The system supports schema versioning so new document types or fields can be added without reprocessing the entire backlog.

Non-functional requirements

Throughput: process 10,000 documents per hour (approximately 3 per second sustained).
P95 latency: under 30 seconds per document end-to-end (intake to structured output).
Field extraction accuracy: 95%+ across all document types, 99%+ for high-value fields like invoice totals and dates.
Cost per document: $0.01-0.10 depending on complexity (versus $2-5 for manual data entry).
Human review rate: under 15% of documents require human intervention.
Availability: 99.9% uptime with graceful degradation (queue documents during outages, process when recovered).

The hardest engineering problem here: every document is different. An invoice from Vendor A looks nothing like an invoice from Vendor B. Scanned documents vary wildly in quality, rotation, and resolution. The system needs to handle this variance without a custom template per vendor, and it needs to know its own confidence level well enough to route uncertain documents to humans instead of silently producing wrong data.

The core entities

Document

doc_id, source (upload, email, s3_sync), file_type (pdf, tiff, png, docx), file_size_bytes, page_count, status (queued, preprocessing, classifying, extracting, validating, review, completed, failed), uploaded_at, completed_at

DocumentClassification

classification_id, doc_id, doc_type (invoice, contract, receipt, tax_form, medical_record, unknown), confidence, model_used, classified_at

ExtractionResult

extraction_id, doc_id, schema_version, fields (JSON map of field_name to value + confidence), line_items (array of row objects), tables (array of table objects with headers and rows), model_used, tokens_consumed, cost_usd, extracted_at

OCRResult

ocr_id, doc_id, page_number, engine (native_text, tesseract, document_ai), raw_text, word_boxes (array of {text, x, y, width, height, confidence}), avg_confidence, processing_time_ms

ValidationResult

validation_id, extraction_id, checks_passed, checks_failed, errors (array of {field, rule, message, severity}), overall_confidence, routing_decision (auto_approve, human_review, retry_with_upgrade), validated_at

HumanReview

review_id, doc_id, extraction_id, reviewer_id, corrections (JSON diff of changed fields), review_time_seconds, status (pending, in_progress, completed), created_at, completed_at

ExtractionSchema

schema_id, doc_type, version, fields (array of {name, type, required, validation_rules}), active, created_at

API design

POST /v1/documents - upload a document for processing

Request: {
  "source": "upload",
  "file": "<binary>",
  "file_type": "pdf",
  "priority": "normal",
  "callback_url": "https://acme.com/webhooks/doc-processed",
  "extraction_schema": "invoice_v3"
}
Response: {
  "doc_id": "doc_abc123",
  "status": "queued",
  "estimated_completion_seconds": 25,
  "queue_position": 12
}

GET /v1/documents/{doc_id} - check processing status and results

Response: {
  "doc_id": "doc_abc123",
  "status": "completed",
  "classification": {
    "doc_type": "invoice",
    "confidence": 0.97
  },
  "extraction": {
    "vendor_name": { "value": "Acme Corp", "confidence": 0.99 },
    "invoice_number": { "value": "INV-2026-0042", "confidence": 0.98 },
    "total_amount": { "value": 15420.00, "confidence": 0.95 },
    "line_items": [
      { "description": "Cloud hosting (March)", "quantity": 1, "unit_price": 12000.00, "amount": 12000.00 }
    ]
  },
  "validation": {
    "checks_passed": 8,
    "checks_failed": 0,
    "overall_confidence": 0.96,
    "routing_decision": "auto_approve"
  }
}

POST /v1/documents/batch - submit multiple documents for processing

Request: {
  "documents": [
    { "s3_uri": "s3://acme-docs/invoices/batch-march/*.pdf", "extraction_schema": "invoice_v3" }
  ],
  "priority": "bulk",
  "callback_url": "https://acme.com/webhooks/batch-complete"
}
Response: {
  "batch_id": "batch_xyz789",
  "document_count": 342,
  "estimated_completion_minutes": 8,
  "status": "processing"
}

GET /v1/documents/{doc_id}/review - get human review interface data

Response: {
  "review_id": "rev_def456",
  "doc_id": "doc_abc123",
  "original_image_url": "/v1/documents/doc_abc123/image?page=1",
  "extracted_fields": {
    "vendor_name": { "value": "Acme Corp", "confidence": 0.99, "bounding_box": [120, 45, 380, 72] },
    "total_amount": { "value": 15420.00, "confidence": 0.72, "bounding_box": [400, 890, 520, 915], "flagged": true }
  },
  "validation_errors": [
    { "field": "total_amount", "rule": "line_item_sum_match", "message": "Line items sum to $15,320.00 but total reads $15,420.00" }
  ]
}

PUT /v1/schemas/{doc_type} - create or update an extraction schema

Request: {
  "doc_type": "invoice",
  "version": "v4",
  "fields": [
    { "name": "vendor_name", "type": "string", "required": true },
    { "name": "invoice_number", "type": "string", "required": true },
    { "name": "total_amount", "type": "number", "required": true, "validation": "must_match_line_item_sum" },
    { "name": "due_date", "type": "date", "required": false }
  ]
}
Response: {
  "schema_id": "schema_inv_v4",
  "doc_type": "invoice",
  "version": "v4",
  "active": true,
  "fields_count": 4
}

Design an AI document processing pipeline

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design an AI document processing pipeline

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments