Secure Document Ingestion for RAG: Chunking, Deduplication, and PII Redaction
A runnable ingestion pipeline that extracts text, deduplicates, redacts PII, generates embeddings, and produces evidence artifacts for compliance and quality.
1. Overview
This solution implements a secure, production-oriented document ingestion pipeline suitable for Retrieval-Augmented Generation (RAG). It accepts documents, extracts text, normalizes and chunks content, performs deduplication, redacts sensitive information (PII), generates embeddings, and persists both the processed text and vector representations for retrieval. It also emits auditable evidence artifacts to support compliance and operational verification.
Common ingestion implementations fail in production for predictable reasons:
- They treat parsing as “best effort” without defensible failure classification, leading to silent partial ingestion and hard-to-debug retrieval defects.
- They lack deterministic chunking and stable identifiers, so re-ingesting the same document creates duplicated vectors and inconsistent retrieval results.
- They do not provide a PII control surface that is testable, explainable, and reproducible, which is essential for regulated environments.
- They mix parsing, redaction, embedding, and persistence into a single synchronous request path, creating latency spikes and reliability issues when upstream dependencies degrade.
This implementation is production-ready because it:
- Separates ingestion into a durable job model with idempotent processing and explicit run states.
- Uses a canonical content hashing strategy to deduplicate at both document and chunk levels.
- Implements configurable, evidence-producing PII redaction with reversible decisions (what was removed, why, and when) without storing raw secrets.
- Persists vectors in PostgreSQL with pgvector using retrieval-appropriate indexes, while keeping lineage metadata for auditability.
- Provides deterministic chunking, stable chunk IDs, and repeatable embeddings workflows.
- Produces an evidence pack (logs + structured run artifacts + database state verification queries) that can be used to prove correct execution.
2. Architecture
Request flow, components, dependencies, and trust boundaries:
- Client → Ingestion API (Spring Boot)
- Ingestion API → PostgreSQL (job state, documents, chunks, embeddings metadata)
- Ingestion Worker (Spring Boot scheduled/async executor) → Tika (text extraction)
- Ingestion Worker → PII Redaction Engine (local library + configurable detectors)
- Ingestion Worker → Embedding Provider (OpenAI-compatible API or mock)
- Ingestion Worker → PostgreSQL + pgvector (chunk storage + embeddings)
- Audit events → PostgreSQL audit_log table (append-only)
- Evidence artifacts → Local filesystem volume mounted via Docker Compose (e.g., ./evidence)
Key components:
- IngestionController: Accepts documents and creates ingestion jobs.
- IngestionOrchestrator: State machine for runs (CREATED → EXTRACTED → REDACTED → EMBEDDED → COMPLETED / FAILED).
- TextExtractor: Tika-backed extraction and normalization.
- Chunker: Deterministic chunking (token/char-based) with stable chunk IDs.
- DedupService: Content-hash deduplication at document and chunk levels.
- PiiRedactor: Applies configured detectors and produces redaction report.
- EmbeddingClient: Calls embedding provider with timeouts/retries and per-chunk idempotency.
- Persistence Layer: Stores documents, chunks, embeddings, and audit events.
External dependencies:
- PostgreSQL + pgvector
- Apache Tika (in-process library usage)
- Embedding endpoint (OpenAI-compatible API or local/mock provider)
- Docker Compose for local orchestration
Trust boundaries:
- Untrusted input boundary: Document upload content and metadata.
- Secrets boundary: Embedding provider API keys and DB credentials (env vars).
- Data boundary: Raw extracted text is treated as sensitive; redaction occurs before long-term storage of chunk text intended for retrieval.
- Operator boundary: Admin endpoints and evidence artifacts access are restricted to ROLE_ADMIN.
3. Key Design Decisions
Technology stack
- Java 17 + Spring Boot 3.x: Mature operational model, strong observability ecosystem, and predictable dependency management.
- PostgreSQL + pgvector: Simplifies deployment (single data plane), supports transactional job state + vectors, and allows consistent retrieval semantics without an additional vector DB for this deployment model.
- Tika: Broad file format extraction coverage and stable behavior for common office/PDF/text formats.
Data storage model
- Store ingestion as a first-class run (job) with explicit status transitions, timestamps, and failure reasons.
- Persist documents and chunks with stable identifiers derived from canonical hashes.
- Store embeddings per chunk in a dedicated table with a vector column (pgvector), and maintain a strict lineage relationship: document → chunks → embeddings.
- Maintain an append-only audit log for state transitions and security-relevant events.
Synchrony vs asynchrony
- The upload request is synchronous only to validate input and create a run record; heavy work (extraction, redaction, embedding) is executed asynchronously by a worker executor.
- This avoids coupling user-facing latency to Tika performance and embedding provider availability, and allows controlled retries without client involvement.
Error handling and retries
-
Classify failures by domain:
- Extraction failures (corrupt file, unsupported type) → terminal FAILED with reason EXTRACT_ERROR.
- Redaction failures (misconfiguration, detector crash) → terminal FAILED with reason REDACTION_ERROR.
- Embedding failures (timeouts, 429, transient 5xx) → retriable with capped exponential backoff and jitter; terminal if exceeding attempt budget.
- Persistence failures (serialization, constraint violations) → terminal FAILED with reason DB_ERROR, after verifying idempotency safety.
-
Retries are bounded (e.g., max attempts per chunk) and recorded in run and chunk-level attempt counters to prevent unbounded loops.
Idempotency strategy
-
Idempotency is enforced through content-addressing:
- Document identity:
doc_sha256 = sha256(normalized_extracted_text + source_metadata_fingerprint) - Chunk identity:
chunk_sha256 = sha256(doc_sha256 + chunk_index + chunk_text_normalized)
- Document identity:
-
Unique constraints prevent duplicate inserts:
- Unique(document.doc_sha256)
- Unique(chunk.chunk_sha256)
- Unique(embedding.chunk_id, embedding_model)
-
Re-ingesting the same document results in either:
- A new run referencing the existing document record (if configured), or
- A short-circuit “DEDUPED” completion state (depending on operational preference), without generating duplicate vectors.
4. Data Model
Core tables and intent:
-
ingestion_run
- Purpose: Durable orchestration state for each ingestion attempt.
- Key columns:
id (uuid),status,source_filename,content_type,created_at,updated_at,failure_reason,error_detail,requested_by,dedup_mode,embedding_model,pii_policy_version. - Indexing:
(status, updated_at)for worker polling;(created_at)for operational queries.
-
document
- Purpose: Canonical representation of a unique extracted document.
- Key columns:
id,doc_sha256 (unique),title,source_metadata (jsonb),extracted_text_checksum,created_at. - Indexing: unique index on
doc_sha256; optional GIN onsource_metadataif searching metadata.
-
chunk
- Purpose: Deterministic chunks used for retrieval and embedding.
- Key columns:
id,document_id (fk),chunk_index,chunk_sha256 (unique),text_redacted,redaction_summary (jsonb),token_count_estimate,created_at. - Indexing: unique index on
chunk_sha256;(document_id, chunk_index)for ordered reconstruction; optional GIN onredaction_summary.
-
embedding
- Purpose: Stores embedding vectors for chunks and model provenance.
- Key columns:
id,chunk_id (fk),embedding_model,dims,vector (pgvector),provider_request_id,created_at. - Indexing: vector index (HNSW or IVFFLAT depending on pgvector version and configuration) on
vector; unique(chunk_id, embedding_model).
-
audit_log
- Purpose: Append-only audit trail for ingestion transitions and security events.
- Key columns:
id,event_type,run_id,actor,at,details (jsonb). - Indexing:
(run_id, at)and(event_type, at).
5. API Surface
-
POST /api/ingestions
- Purpose: Create an ingestion run from an uploaded document (multipart) or from a provided URL (optional).
- Auth: ROLE_USER
-
GET /api/ingestions/{runId}
- Purpose: Retrieve run status, failure reason, and processing summary (counts, dedup outcome).
- Auth: ROLE_USER (must be owner) or ROLE_ADMIN
-
GET /api/documents/{documentId}/chunks
- Purpose: List chunk metadata (indexes, hashes, redaction stats). Redacted text may be omitted unless explicitly allowed.
- Auth: ROLE_USER (owner) or ROLE_ADMIN
-
POST /internal/worker/poll
- Purpose: Internal worker endpoint to claim runnable ingestion tasks (if deployed as separate worker). In single-service mode, worker uses DB polling directly.
- Auth: service-to-service token (ROLE_SERVICE)
-
GET /admin/runs
- Purpose: Operational list view of runs and statuses.
- Auth: ROLE_ADMIN
-
GET /admin/runs/{runId}/evidence
- Purpose: Download/view evidence artifacts for a run (logs and generated reports).
- Auth: ROLE_ADMIN
-
GET /health
- Purpose: Liveness/readiness check.
- Auth: none (or protected behind gateway in production)
6. Security Model
Authentication
-
Local execution uses one of:
- HTTP Basic (dev-only) for simplicity, or
- JWT bearer tokens (recommended baseline even for local), with a small built-in issuer for development.
-
Service-to-service authentication for internal worker operations uses a distinct token/credential with ROLE_SERVICE.
Authorization (roles)
- ROLE_USER: Create ingestion runs, view own runs and documents.
- ROLE_ADMIN: View all runs, access evidence artifacts, inspect redaction reports, manage policies.
- ROLE_SERVICE: Claim and execute jobs if separated into worker processes.
How paid access is enforced (if applicable)
-
Enforced at the API boundary via a license/entitlement check middleware:
- Validate
tenant_idoraccount_identitlement on ingestion creation. - Enforce per-account quotas (documents/day, max size, embedding calls/day).
- Persist entitlement decision in
audit_logand run metadata for traceability.
- Validate
-
For local/demo deployments, entitlement can be disabled via config, but the enforcement hook remains in the request pipeline.
CSRF considerations
- For session-cookie based auth, CSRF protection is enabled for state-changing endpoints.
- For bearer token/JWT usage, CSRF is not applicable; endpoints require Authorization header and reject cookies for auth to avoid ambiguity.
Data isolation guarantees
-
Multi-tenant isolation is implemented by scoping
ingestion_run,document, andchunkto atenant_idand enforcing it:- At query layer (mandatory tenant filter).
- Via database constraints and indexes that include tenant_id (recommended).
-
Evidence artifacts are stored under
./evidence/{tenantId}/{runId}/to prevent cross-tenant leakage at the filesystem layer.
7. Operational Behavior
Startup behavior
-
On startup, the service:
- Validates database connectivity and pgvector availability.
- Verifies required configuration (embedding provider URL/model, redaction policy version).
- Registers scheduled worker loops (if single-process mode) and exposes health endpoints.
- Emits a startup record to logs and
audit_log(event_type=SERVICE_START).
Failure modes
- Database unavailable: service fails readiness; ingestion creation returns 503; worker pauses and logs structured error.
- Embedding provider degraded: worker marks chunk embedding attempts and retries; run remains in EMBEDDING state until success or attempt budget exceeded.
- Extraction failure: run transitions to FAILED with EXTRACT_ERROR and persists extractor diagnostics.
- Redaction misconfiguration: run transitions to FAILED with REDACTION_ERROR and includes policy version and rule set hash.
Retry and timeout behavior
- Embedding calls: strict timeout (connect + read), retry on transient errors (429/5xx/timeouts) with exponential backoff, capped attempts per chunk.
- Job claiming: optimistic locking or “SELECT … FOR UPDATE SKIP LOCKED” to prevent double-processing.
- All retries recorded in DB (attempt_count, last_attempt_at) to support operator diagnosis.
Observability hooks (logs, metrics, traces)
- Logs: structured JSON logs with
runId,documentId,chunkId, andeventType. - Metrics: counters for runs created/completed/failed, extraction failures, redaction matches by type, embedding call latency, retry counts.
- Traces: span-per-stage (extract, chunk, redact, embed, persist), with runId correlation propagated through worker execution.
8. Local Execution
Prerequisites
- Docker Desktop or Docker Engine + Docker Compose v2
- Java 17 (for running the Spring Boot app locally) or run everything via Docker
- curl
Environment variables (example)
SPRING_PROFILES_ACTIVE=localDB_URL=jdbc:postgresql://localhost:5432/ragDB_USER=ragDB_PASS=ragEMBEDDING_BASE_URL=http://localhost:8089(mock provider in compose) or your provider URLEMBEDDING_API_KEY=changeme(required if using external provider)EMBEDDING_MODEL=text-embedding-3-large(example; configurable)PII_POLICY_VERSION=2026-01-01EVIDENCE_DIR=/evidence(container path, mapped to ./evidence)
Docker Compose usage
- Start dependencies (PostgreSQL + optional mock embedding service):
docker compose up -d postgres embedding-mock
- Start the application:
- Option A (local JVM):
./mvnw spring-boot:run
- Option B (containerized app):
docker compose up -d app
Verification steps
- Health:
curl -s http://localhost:8080/health
- Create an ingestion run (multipart upload):
curl -s -X POST http://localhost:8080/api/ingestions \
-H "Authorization: Bearer <TOKEN>" \
-F "file=@./samples/sample.pdf"
- Check run status:
curl -s http://localhost:8080/api/ingestions/<RUN_ID> \
-H "Authorization: Bearer <TOKEN>"
- Confirm database state (from host):
docker exec -it rag-postgres psql -U rag -d rag -c "select status, failure_reason from ingestion_run order by created_at desc limit 5;"
docker exec -it rag-postgres psql -U rag -d rag -c "select count(*) as documents from document;"
docker exec -it rag-postgres psql -U rag -d rag -c "select count(*) as chunks from chunk;"
docker exec -it rag-postgres psql -U rag -d rag -c "select count(*) as embeddings from embedding;"
9. Evidence Pack (MANDATORY)
Checklist of included evidence artifacts proving execution and correctness:
-
Service startup logs (timestamped, includes DB connectivity and pgvector validation)
-
Successful ingestion run creation log (includes runId and actor)
-
Extraction stage logs for a real input document (includes detected content type and extracted character count)
-
Chunking report artifact (JSON): chunk count, sizes, chunk hashes, deterministic parameters used
-
Deduplication report artifact (JSON): document hash, chunk hashes, dedup decisions, unique-constraint outcomes
-
PII redaction report artifact (JSON): policy version, detectors triggered, redaction counts by PII type, sample redaction markers
-
Embedding call logs (provider request IDs, latency, retry attempts)
-
Successful run completion log (includes documentId, chunk count, embedding count)
-
Database verification queries and outputs (captured in a run-specific evidence file) showing:
- ingestion_run row state transitions
- document/chunk/embedding records created
- unique constraints preventing duplicates on re-ingestion
-
Error handling demonstration evidence:
- One intentionally failing sample (unsupported/corrupt file) with run transitioning to FAILED and recorded failure_reason
- One simulated embedding provider 429/timeout scenario with retries recorded and eventual success or terminal failure (depending on configuration)
10. Known Limitations
- This solution does not implement a full retrieval/query service; it focuses on ingestion, redaction, and embedding persistence.
- PII detection is rule/detector-based; it is not guaranteed to identify all sensitive data types in all languages or formats.
- OCR for image-only PDFs is not included; Tika extraction will be limited to embedded text unless OCR is added.
- Extremely large documents may require streaming extraction and chunking; the local deployment targets typical office/PDF sizes.
- Vector similarity search tuning (index parameters, recall/latency tradeoffs) is provided as configuration but not auto-optimized.
11. Extension Points
-
Replace or augment the redaction engine:
- Add jurisdiction-specific detectors, custom regexes, or ML-based PII classifiers.
- Introduce “allow-list” semantics for safe fields and structured documents.
-
Add OCR and richer extraction:
- Integrate an OCR pipeline (e.g., Tesseract) behind a feature flag for scanned PDFs.
-
Scale worker execution:
- Split into separate API and worker services; use DB-backed job claiming and horizontal worker scaling.
- Add a message queue (Kafka/SQS) for ingestion events if higher throughput is required.
-
Improve retrieval readiness:
- Add a retrieval API that performs vector similarity + metadata filtering, with tenant-aware constraints.
- Implement chunk-level encryption at rest for redacted text where required.
-
Production hardening:
- Enforce tenant-scoped RLS (Row Level Security) in PostgreSQL for stronger isolation.
- Move evidence artifacts to an object store (e.g., S3/MinIO) with signed URLs and retention policies.
- Integrate full OpenTelemetry export to a tracing backend and add SLO-based alerting.
1.1.0
- Solution write-up + runnable implementation
- Evidence images (when published)
- Code bundle downloads (when enabled)