Document Intake in Spring Boot: Multimodal Extraction, Validation, and Review Queues

This solution implements a production-ready Intelligent Document Intake pipeline for Spring Boot applications. It is designed for teams that receive PDFs, scanned forms, images, or semi-structured documents from customers, partners, or internal users and need to convert those files into validated business data rather than just searchable text.

Verified v1.0.0 Redhat 8/9 / Ubuntu / macOS / Windows (Docker) Java 17 · Spring Boot 3.x · PostgreSQL · Object Storage · OpenAI/Compatible Multimodal LLM API · Docker Compose

Unlock full implementation + downloads

Account access required

This solution includes runnable code bundles and full implementation details intended for production use.

1. Overview

The problem it solves is straightforward: in production, document intake is rarely blocked by the inability to read text. It is blocked by the inability to turn messy, inconsistent, real-world documents into structured records that are safe to insert into downstream systems or route into operational workflows.

Typical failure patterns include:

OCR or basic parsing extracts text, but the application still cannot identify the document type reliably enough to send it into the correct workflow.
Extracted fields are incomplete, inconsistent, or wrongly mapped because documents vary by template, scan quality, language, or layout.
Teams can extract candidate values, but they cannot trust them because there is no validation layer and no operator review queue for uncertain results.
A document contains just enough information to look processable, but key business fields are missing or contradictory, so downstream systems reject the record later.
Intake automation breaks when a supplier, customer, or department changes the form layout slightly and the parsing logic is too brittle.

Existing approaches often fail in production because they stop at “OCR and regex” or “OCR and JSON extraction.” That may work for ideal templates, but it is not sufficient for intake flows where classification, validation, confidence handling, and human review are part of the business process.

This implementation is production-ready because:

It separates document ingestion, multimodal classification, structured extraction, validation, confidence scoring, and human review into explicit runtime stages.
It persists the original file, extraction attempts, field-level validation results, review decisions, and final routing outcomes for auditability.
It supports review queues and low-confidence fallbacks so the system can automate the safe cases while escalating uncertain cases instead of silently writing bad data.

In practice, this matters most in workflows such as invoice intake, onboarding forms, claims intake, complaint handling, application processing, shipment documentation, and case creation from uploaded files. The business value is not that a model can read the document. The business value is that the system can produce structured, validated data that moves work forward.

An Intelligent Document Intake pipeline changes the contract. Instead of treating the uploaded file as a blob of text for search or storage, the application treats it as an operational input that must be classified, extracted, validated, reviewed when necessary, and routed safely.

2. Architecture

Request flow and dependencies:

Client → Spring Boot REST API
Spring Boot REST API → Request validation and upload policy enforcement
Spring Boot service → Object storage for raw document persistence
Spring Boot service → Document metadata store for intake record creation
Spring Boot service → Multimodal classification service to determine document type and processing route
Spring Boot service → Extraction service to produce structured candidate fields
Spring Boot service → Validation engine for schema checks, field rules, and cross-field consistency
Spring Boot service → Confidence scoring and review policy engine
Spring Boot service → Review queue persistence for low-confidence or invalid cases
Spring Boot service → Downstream router for accepted documents and structured payload delivery
Spring Boot service → PostgreSQL for documents, extraction attempts, validation results, review actions, and routing outcomes
Spring Boot REST API → JSON response containing intake status, classification, extraction result summary, and review state

Key components:

API Controller: Accepts uploaded files, validates payloads, and returns intake status metadata.
Object Storage Adapter: Stores original documents and derived artifacts such as thumbnails or normalized images.
Document Registry: Creates and tracks document intake records, versions, and processing state.
Classification Service: Determines document type using multimodal analysis and configured document classes.
Extraction Service: Produces structured fields from the document according to a schema for the detected class.
Validation Engine: Applies required-field checks, format rules, domain rules, and cross-field consistency checks.
Confidence Scorer: Aggregates extraction confidence and validation quality into a routing decision.
Review Queue Service: Places uncertain or invalid documents into an operator review workflow.
Routing Service: Sends accepted records into downstream systems, APIs, or internal tables.
Observability Layer: Emits structured logs, metrics, and traces for classification, extraction, validation, review, and routing.

Trust boundaries:

Inbound boundary: The REST API validates file size, content type, tenant scope, and upload authorization before processing.
Model boundary: Multimodal classification and extraction outputs are treated as untrusted until validated against document schemas and business rules.
Storage boundary: Original files and extracted data are persisted by application-controlled services only.
Review boundary: Human review actions are authenticated, attributed, and persisted before final acceptance or rejection.
Tenant boundary: Uploaded files, extraction results, review queues, and downstream routing all remain tenant-scoped.

The system is deliberately arranged so that document automation is not treated as a single model call:

Classification determines what kind of document the application is looking at.
Extraction produces candidate structured fields.
Validation determines whether those fields are usable.
Confidence scoring decides whether automation is safe.
Review handles the uncertain cases.
Routing sends only accepted data downstream.

That separation matters in production. If a document is processed incorrectly, the team needs to know whether the issue came from file normalization, wrong document classification, bad field extraction, missing validation rules, overly aggressive auto-accept thresholds, or operator review gaps. Without that decomposition, every problem looks like “AI extracted the wrong data,” which is not actionable.

3. Key Design Decisions

Technology stack

Spring Boot 3.x is used because it provides a strong operational base for file-handling APIs, persistence, validation, security, and workflow orchestration in Java 17 services.

PostgreSQL is selected for document metadata, extraction attempts, validation outcomes, review actions, and routing records because transactional durability matters more than raw throughput for most intake systems.

Object storage is used for original files and derived artifacts because uploaded documents can be large, binary, and versioned independently from relational metadata.

OpenAI-compatible multimodal LLM APIs are used for document classification and structured extraction because they allow one implementation to support cloud or local providers that can process images and PDF-derived content.

Docker Compose is chosen for local execution because it makes the service, database, and storage dependencies easy to reproduce and verify.

Why not build the entire solution around OCR only? Because OCR is only one part of the problem. A production intake system also needs document typing, structured field mapping, validation rules, confidence handling, review queues, and safe routing. OCR-only pipelines usually fail when form layouts vary or business rules matter more than raw text extraction.

Classification-first design

The pipeline classifies the document before applying extraction logic. That classification can map the file into a schema and processing contract such as:

invoice,
application form,
complaint submission,
ID document,
purchase order,
supporting evidence attachment,
or unknown.

This matters because field extraction rules depend on document class. Trying to extract every file with one generic schema leads to brittle mappings, over-extraction, and poor routing decisions.

The pipeline also supports an unknown or needs review class. This is important because forcing every document into a known template increases silent failure risk.

Structured extraction model

Extraction is schema-driven rather than “dump all possible fields.” Each supported document class defines:

required fields,
optional fields,
field types,
field constraints,
and downstream routing targets.

This design keeps the output operationally useful. A business process usually does not need an unbounded JSON blob. It needs a stable contract such as:

invoiceNumber
issueDate
supplierName
currency
totalAmount

We do not treat extraction as a free-form summarization task because downstream systems need predictable structure.

Validation-first acceptance

The runtime never trusts extracted fields just because the model produced them. Every extracted payload goes through a validation layer that can include:

schema validation,
regex or format checks,
date and amount normalization,
mandatory-field checks,
cross-field checks,
domain-specific rules,
and duplicate detection if needed.

This is one of the most important decisions in the system. In a document intake workflow, unvalidated automation is usually worse than partial automation because bad data reaches business systems and causes downstream failures or silent corruption.

Confidence and review policy

The pipeline treats low confidence as a routing signal, not as an error. A document can be:

automatically accepted,
routed to human review,
or rejected as unprocessable.

This matters because not every file should be forced through the same path. Safe automation means:

auto-accepting high-confidence, valid documents,
escalating uncertain or contradictory cases,
and rejecting clearly unusable inputs early.

A review queue is therefore a core feature, not an operational afterthought.

Synchrony versus asynchrony

Upload acceptance is synchronous, but classification, extraction, validation, review routing, and downstream delivery can run asynchronously. This is important because document processing may take longer than a standard request budget, especially for large files or multipage PDFs.

The API therefore creates a durable intake record and exposes status transitions over time rather than requiring the entire document lifecycle to complete in one request.

Error handling and retries

Transient (temporary model/API failure, object storage timeout, downstream routing timeout): bounded retry with per-stage retry limits and visibility into pending state.
Permanent (invalid file type, corrupted file, unsupported document class, schema failure after extraction): mark as failed or send to review depending on policy.

Retries are intentionally scoped by stage. The system should not re-upload a file or recreate intake metadata unnecessarily just because downstream routing failed. Each stage should retry within its own operational boundary.

Human review as a first-class component

Human review is modeled as a formal system state with:

assigned reviewer,
field-level corrections,
review decision,
and audit trail.

This is essential because the purpose of intake automation is not “zero humans forever.” The purpose is to automate the safe majority while giving operators efficient control over uncertain cases.

4. Data Model

Core tables:

documents
- Purpose: Stores the lifecycle and metadata of each uploaded document.
- Key columns: id, tenant_id, document_id, original_filename, mime_type, storage_uri, status, uploaded_by, created_at, updated_at
document_classifications
- Purpose: Stores classification attempts and final document class.
- Key columns: id, tenant_id, document_id, classification_label, confidence_score, model_name, status, created_at
document_extractions
- Purpose: Stores extraction attempts and structured candidate payloads.
- Key columns: id, tenant_id, document_id, document_class, schema_version, extracted_payload, confidence_payload, model_name, created_at
field_validation_results
- Purpose: Stores field-level and document-level validation results.
- Key columns: id, tenant_id, document_extraction_id, field_name, validation_code, validation_status, normalized_value, message, created_at
review_queue_items
- Purpose: Stores documents that require human review.
- Key columns: id, tenant_id, document_id, queue_name, reason_code, priority, status, assigned_to, created_at, updated_at
review_actions
- Purpose: Stores reviewer decisions and field corrections.
- Key columns: id, tenant_id, review_queue_item_id, reviewer_id, action_type, correction_payload, notes, created_at
routing_results
- Purpose: Stores downstream routing attempts and outcomes.
- Key columns: id, tenant_id, document_id, target_system, request_payload, status, external_reference, created_at
document_events
- Purpose: Stores ordered lifecycle events for auditing and replay diagnostics.
- Key columns: id, tenant_id, document_id, event_type, event_payload, created_at

Indexing strategy:

documents(tenant_id, document_id) for document lookup
documents(tenant_id, status, updated_at) for active processing scans
document_classifications(document_id, created_at) for classification history
document_extractions(document_id, created_at) for extraction history
field_validation_results(document_extraction_id, field_name) for review and debugging
review_queue_items(tenant_id, queue_name, status, priority) for operator workload management
routing_results(document_id, target_system) for downstream result lookup
document_events(document_id, created_at) for lifecycle replay

The structure above supports the core operational goals of intelligent document intake:

Traceability — every document can be traced from upload to routing or rejection.
Recoverability — failed stages can be retried without losing prior state.
Reviewability — low-confidence cases can be inspected and corrected by humans.
Auditability — the system preserves original files, extracted candidates, validation results, and final decisions.

For document modeling, the runtime should distinguish clearly between:

the original file,
the detected document class,
the extracted payload,
the normalized payload,
and the reviewer-approved payload.

That separation makes it easier to improve extraction over time without losing the raw input or the operational decision trail.

5. API Surface

POST /api/documents/intake – Upload a document and create a durable intake record (ROLE_USER)
GET /api/documents/{id} – Fetch intake status, classification, validation summary, and routing outcome (ROLE_USER)
GET /api/documents/{id}/extractions – Return extraction attempts and candidate payloads (ROLE_ADMIN)
GET /api/documents/{id}/validations – Return field-level validation results (ROLE_ADMIN)
GET /api/review/queue – Return pending review queue items (ROLE_ADMIN or reviewer role)
POST /api/review/queue/{id}/claim – Claim a review task (ROLE_ADMIN or reviewer role)
POST /api/review/queue/{id}/submit – Submit review corrections and final decision (ROLE_ADMIN or reviewer role)
POST /api/admin/document-classes – Register or update a document class schema and routing contract (ROLE_ADMIN)
POST /api/admin/documents/{id}/reprocess – Re-run classification or extraction on a document (ROLE_ADMIN)
GET /actuator/health – Health endpoint for service readiness (ROLE_ADMIN / ops network)
GET /actuator/prometheus – Metrics scraping endpoint (ROLE_ADMIN / ops network)

Example response from POST /api/documents/intake:

{
  "documentId": "doc_20260426_0001",
  "status": "PROCESSING",
  "review": {
    "required": false
  }
}

Example response from GET /api/documents/{id} after extraction:

{
  "documentId": "doc_20260426_0001",
  "status": "REVIEW_REQUIRED",
  "classification": {
    "label": "invoice",
    "confidenceScore": 0.93
  },
  "validation": {
    "documentValid": false,
    "issues": [
      {
        "field": "invoiceNumber",
        "code": "REQUIRED_FIELD_MISSING"
      }
    ]
  },
  "review": {
    "required": true,
    "queueItemId": "rq_3001"
  }
}

The API surface is intentionally centered on durable intake and review rather than immediate inline extraction responses. A document intake system is an operational workflow, not just a parsing endpoint.

A few implementation notes matter here:

Upload endpoints should support idempotent client submission when file retry behavior matters.
Review endpoints should keep field-level corrections separate from raw extraction output.
Reprocess endpoints should create new extraction attempts rather than overwriting historical results.
Admin document-class endpoints should version schemas so old documents remain auditable against the schema that applied at the time.

6. Security Model

Authentication

Authentication is handled through Spring Security with JWT bearer tokens or opaque access tokens issued by the upstream identity provider. Every request carries user identity and tenant identity claims.

Authorization (roles)

ROLE_USER: Can upload documents and read intake status within tenant scope.
ROLE_ADMIN: Can inspect extraction details, validations, routing attempts, reprocess documents, and manage document class contracts.
ROLE_REVIEWER: Can claim review tasks, correct fields, and submit review outcomes within authorized queues.

Data isolation guarantees

Every document, extraction, validation result, review task, and routing record includes tenant_id, and all lookups are tenant-scoped. Object storage paths and metadata must also respect tenant partitioning.

This isolation has to be applied consistently:

Uploaded files must be stored under tenant-scoped prefixes or buckets.
Reviewers must only see queues they are authorized to access.
Admin endpoints must not expose another tenant’s raw files or extracted payloads.
Sensitive documents may require encryption at rest, signed URL controls, and field-level masking in APIs.

Security also intersects with extraction prompts and review UIs. Personally identifiable information, financial identifiers, and sensitive attachments should be exposed only to the minimum set of services and users required for processing.

7. Operational Behavior

Startup behavior

On startup, the application:

Validates required environment variables for database, object storage, and multimodal model providers
Runs schema migrations
Verifies PostgreSQL connectivity
Verifies object storage connectivity and bucket/container readiness
Initializes classification and extraction model clients
Loads document class schemas, validation rules, and routing policies
Registers health indicators for DB, storage, and model dependencies

The service only reports ready after core ingestion, storage, and processing dependencies are available.

For intake systems, startup behavior matters because uploads may arrive continuously. A service that can accept files but cannot persist metadata, classify documents, or route review tasks is only partially operational and should not present itself as ready.

Failure modes

Object storage unavailable: the service rejects uploads because original-file persistence is mandatory.
Model provider unavailable: the service can still accept and store documents, but documents remain queued or unclassified based on policy.
Schema or validation rules unavailable: classification or extraction may proceed, but documents should not auto-accept without validation configuration.
Downstream routing unavailable: validated documents remain in an accepted-but-unrouted state and can be retried later.
Unsupported or corrupted file: the intake record is marked failed or routed to manual handling.
Low-confidence extraction: the document is placed into review instead of being auto-routed.

Those behaviors should be deliberate. The most dangerous failure mode in document intake is not “stopping.” It is “sending weak or invalid data into downstream systems.”

Retry and timeout behavior

Recommended defaults:

File persistence timeout: 2s to 10s depending on file size
Classification timeout: 5s to 20s
Extraction timeout: 5s to 30s
Downstream routing timeout: 2s to 10s

Retry policy:

Object storage write: retry once or twice for transient network failures
Classification and extraction: bounded retries for transient model/API failures
Routing: bounded retries with backoff, isolated from upload and extraction stages
No repeated retries for corrupted file content, unsupported type, or validation-rule failure without reprocessing

Timeouts and retries should be managed per stage so a routing problem does not cause the system to re-run upload and extraction unnecessarily.

Observability hooks

Structured logs: document_id, tenant_id, document_class, classification_confidence, extraction_model, validation_status, review_required, queue_name, routing_target, latency_ms, outcome

OpenTelemetry traces:

document_intake.request: root span for upload and intake creation
document_intake.store_file: object storage persistence span
document_intake.classify: classification span
document_intake.extract: structured extraction span
document_intake.validate: validation span
document_intake.enqueue_review: review queue placement span
document_intake.route: downstream routing span
document_intake.persist_state: metadata and state transition persistence span

These hooks are essential because quality tuning in document automation is empirical. Operators need to answer:

Which document classes are classified accurately?
Which fields fail validation most often?
Which suppliers or templates trigger review more frequently?
Are review queues growing because extraction quality dropped or because business rules changed?
Are downstream routing failures caused by bad extraction or external system instability?

Without structured observability, those questions become difficult to answer.

8. Local Execution

Prerequisites

Docker Desktop with Compose v2
JDK 17
Available ports: 8080, 5432, 9000

Environment variables

SPRING_DATASOURCE_URL=jdbc:postgresql://localhost:5432/documentintakedb
SPRING_DATASOURCE_USERNAME=intake
SPRING_DATASOURCE_PASSWORD=intake

OBJECT_STORAGE_ENDPOINT=http://localhost:9000
OBJECT_STORAGE_ACCESS_KEY=minio
OBJECT_STORAGE_SECRET_KEY=minio123
OBJECT_STORAGE_BUCKET=documents

LLM_API_BASE_URL=http://host.docker.internal:11434/v1
LLM_API_KEY=dummy
LLM_MULTIMODAL_MODEL=gpt-4.1-mini

APP_INTAKE_AUTO_ACCEPT_THRESHOLD=0.95
APP_INTAKE_ENABLE_REVIEW_QUEUE=true
APP_INTAKE_MAX_FILE_MB=20

Docker Compose usage

docker compose up -d --build

Verification steps

Health check:

curl -s http://localhost:8080/actuator/health

Upload a document:

curl -s -X POST http://localhost:8080/api/documents/intake \
  -H "Authorization: Bearer <TOKEN>" \
  -F "file=@./samples/invoice.pdf" \
  -F "tenantId=demo-tenant"

Inspect intake result:

curl -s http://localhost:8080/api/documents/<DOCUMENT_ID> \
  -H "Authorization: Bearer <TOKEN>"

List review queue:

curl -s http://localhost:8080/api/review/queue \
  -H "Authorization: Bearer <REVIEWER_TOKEN>"

Submit a review correction:

curl -s -X POST http://localhost:8080/api/review/queue/<QUEUE_ITEM_ID>/submit \
  -H "Authorization: Bearer <REVIEWER_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "decision": "APPROVE",
    "corrections": {
      "invoiceNumber": "INV-2026-0041"
    }
  }'

A practical local setup should include:

at least one clearly classifiable invoice or form,
one intentionally low-quality or incomplete document that triggers review,
one validation rule failure scenario,
and one downstream routing simulation for accepted documents.

A good demo does not only show extraction success. It shows classification, validation, review escalation, correction, and final routing so the operational value is visible.

9. Evidence Pack

Checklist of included evidence artifacts:

[ ] Service startup logs showing DB, object storage, and model-provider readiness
[ ] Successful document upload with generated intake record
[ ] Database record in documents showing persisted storage URI and status transition
[ ] Classification proof in document_classifications showing label and confidence
[ ] Extraction proof in document_extractions showing structured candidate payload
[ ] Validation proof in field_validation_results showing field-level rule outcomes
[ ] Review proof in review_queue_items and review_actions showing escalation and correction
[ ] Routing proof in routing_results showing accepted payload delivery
[ ] Test evidence: integration test output for upload, extraction, review, and routing

A strong evidence pack makes this kind of solution credible. It shows that the pipeline is not only technically possible, but operationally usable.

For a public-facing solution post, the most convincing artifacts are usually:

a startup log showing ready state,
an uploaded document and resulting intake status,
a low-confidence or invalid example routed into review,
and a corrected document that is finally routed downstream.

Those artifacts are also useful later when document classes, models, or validation rules change.

10. Known Limitations

The solution improves throughput and reduces manual work, but it does not eliminate the need for human review in ambiguous or high-risk cases.
Multimodal extraction quality depends on document quality, layout complexity, language, and the strength of the chosen model.
Validation rules require domain knowledge. A weak validation layer can allow bad data through even when extraction looks plausible.

There are additional practical limitations worth stating directly.

First, classification errors can cascade. If the document type is wrong, the extraction schema may also be wrong, which makes validation and routing less reliable.

Second, some documents are inherently multi-purpose or mixed-content. A single file may contain several document types or attachments, which may require splitting or page-level classification.

Third, fully automated routing is less suitable for regulated, legal, or financially sensitive flows unless review policy and audit controls are strong.

11. Extension Points

Replace OpenAI-compatible multimodal extraction with specialized OCR/Document AI providers when template coverage, handwriting, or table extraction requirements are stronger.
Add page-level classification and document splitting for compound uploads.
Add duplicate detection using document fingerprints for repeated submissions.
Add entity matching against suppliers, customers, or internal records to enrich extracted payloads.
Production hardening: add encryption at rest, signed document access, reviewer SLAs, queue prioritization, and routing replay controls for stricter operational guarantees.

Several extensions are especially natural for real deployments.

Template-aware extraction can improve quality for known document families while preserving the generic path for unknown files.

Reviewer-assist UIs can highlight extracted fields directly on the document image to reduce correction time.

Feedback capture can turn reviewer corrections into labeled examples for improving prompts, rules, or model routing over time.

Domain-specific validation packs can package reusable rule sets for invoice, claims, onboarding, or complaint workflows.

Closing Notes

Intelligent Document Intake is not about proving that a model can read a file. It is about making uploaded documents operationally useful with enough structure, validation, and control to move work safely through a business process.

A basic document pipeline extracts text. An intake-aware pipeline asks a stricter question: what kind of document is this, what fields matter, are those fields valid, and is it safe to route them automatically?

That difference is small in a demo, but large in production behavior. It is the difference between:

storing parsed text that still requires manual handling,
and producing structured, validated, review-aware business data that can enter downstream systems with confidence.

If your current Spring Boot AI service already supports document upload or document search, the next meaningful step is not necessarily a larger model. In many cases, the biggest upgrade is to add the missing workflow controls: