LLM Eval Harness for Spring Boot: Golden Sets, Regression Tests, and CI Reports
A runnable evaluation harness that tests prompts/RAG outputs against golden datasets, computes metrics, and generates CI-friendly reports and evidence packs.
1. Overview
This solution provides a runnable evaluation harness for LLM-backed features in Spring Boot (prompt-only and RAG-style outputs). It solves the core production problem: changes in prompts, retrieval logic, model versions, or tokenization behavior can silently degrade quality. Without automated evaluation, regressions are typically detected by user complaints, inconsistent support tickets, or ad-hoc manual spot checks that are neither repeatable nor attributable.
Common approaches fail in production for predictable reasons:
- Tests are written as brittle string equality assertions, causing false failures on harmless formatting drift and missing semantic regressions.
- Evaluations are executed manually or on developer machines, producing non-auditable results and no historical trendline.
- Teams do not persist inputs/outputs with immutable run metadata, so it is impossible to reproduce a regression from CI logs.
- Metrics are computed inconsistently across scripts, notebooks, or vendor dashboards, which complicates governance and release gating.
This implementation is production-ready because it:
- Defines a first-class “evaluation run” domain model persisted in PostgreSQL (run metadata, dataset version, per-case results, and diffs).
- Executes evaluations deterministically via a dedicated runner that supports retries, timeouts, and stable normalization rules.
- Generates CI-friendly artifacts (JUnit XML, JSON, and HTML) and an evidence pack that proves execution and supports audit/repro.
- Provides API endpoints to trigger runs, inspect results, and compare regressions across runs (baseline vs candidate).
- Supports provider abstraction (OpenAI-compatible API, local model adapter, or mock) without changing evaluation semantics.
2. Architecture
Request flow and components:
- CI Pipeline / Developer → (HTTP) → Eval Service API
- Eval Service API → PostgreSQL (runs, cases, results, artifacts metadata)
- Eval Service → LLM Provider (OpenAI-compatible API or adapter)
- Eval Service → Report Generator (HTML/JSON/JUnit XML written to mounted volume)
- Eval Service → Evidence Pack Builder (collects logs + diffs + report manifests)
Key components:
- EvalController: Triggers runs, lists runs, exposes run details and comparisons.
- EvalRunner: Orchestrates execution of an evaluation run, case iteration, scoring, retries, persistence.
- DatasetResolver: Loads golden sets (versioned JSONL/CSV) from classpath or mounted directory; validates schema and hashes content.
- ModelClient (provider abstraction): Calls external LLM provider or local adapter; supports request timeout and structured response capture.
- ScoringEngine: Computes metrics per case and aggregates per run (exact match, normalized match, similarity, tool-call assertions, retrieval assertions).
- DiffEngine: Compares candidate run vs baseline run to compute regression diffs and severity.
- ReportService: Produces JSON summary, HTML report, and JUnit XML suitable for CI test reporting.
- EvidenceService: Assembles evidence artifacts and a manifest to prove correctness and execution.
External dependencies:
- PostgreSQL (state and reproducibility)
- Docker Compose (local execution)
- Optional: OpenAI-compatible LLM endpoint (or local adapter container)
Trust boundaries:
- Client/CI to Eval Service: authenticated and authorized.
- Eval Service to PostgreSQL: internal network boundary (Compose network).
- Eval Service to LLM Provider: outbound network boundary; treated as untrusted response source; captured verbatim for audit.
3. Key Design Decisions
Technology stack:
- Spring Boot 3.x + Java 17: consistent with enterprise deployment standards, mature dependency ecosystem, straightforward operationalization.
- JUnit-compatible reporting: integrates with common CI systems without custom plugins; enables gating via standard test result ingestion.
- PostgreSQL: ensures evaluation history, run reproducibility, and diff comparisons are durable and queryable; avoids “reports-only” ephemeral runs.
Data storage model:
- Persist immutable run records (inputs, dataset hash/version, model config snapshot, scoring config snapshot) and per-case results (raw output, normalized output, metrics, error details).
- Store report files on a mounted filesystem volume and reference them via DB metadata (path, hash, size). This avoids bloating DB with HTML while retaining auditability.
Synchrony vs asynchrony:
- Default execution is synchronous for small datasets (developer workflows) with a hard runtime bound (ESTIMATED_RUNTIME_SECONDS: 90).
- The design supports asynchronous execution by persisting run state transitions (QUEUED/RUNNING/COMPLETED/FAILED) and allowing a background executor to pick up runs. Locally, a single-node executor runs in-process to keep deployment minimal.
Error handling and retries:
- Per-case LLM invocation uses bounded retries with exponential backoff for transient failures (429/5xx/timeouts).
- A run does not fail-fast on a single case by default; it records the case failure and continues, then marks the run as FAILED only if failure ratio exceeds a configured threshold or if a hard dependency (dataset load, DB) fails.
- Timeouts are enforced at HTTP client level and runner level (overall run budget).
Idempotency strategy:
- Run creation supports an idempotency key (e.g., CI build ID + dataset version + model revision). If the same key is submitted, the API returns the existing run rather than creating duplicates.
- Per-case execution writes results with a unique constraint
(run_id, case_id)so reruns can resume safely if interrupted.
4. Data Model
Core tables and purpose:
-
eval_dataset
- Purpose: catalog datasets and their versions/hashes used for runs.
- Key columns:
id,name,version,content_hash,source_uri,created_at.
-
eval_run
- Purpose: immutable record of a single evaluation execution.
- Key columns:
id,status,dataset_id,dataset_hash,model_provider,model_name,model_params_json,scoring_profile,idempotency_key,started_at,finished_at,summary_json,baseline_run_id(optional).
-
eval_case
- Purpose: normalized representation of dataset test cases (optional if cases are embedded only in dataset files; recommended for indexing and search).
- Key columns:
id,dataset_id,external_key,input_json,expected_json,tags,created_at.
-
eval_result
- Purpose: per-case output, metrics, and error details.
- Key columns:
id,run_id,case_id,attempt_count,raw_output_text,normalized_output_text,metrics_json,passed,error_type,error_message,latency_ms,token_usage_json,created_at.
-
eval_artifact
- Purpose: report and evidence artifact inventory for audit.
- Key columns:
id,run_id,type(HTML_REPORT/JSON_SUMMARY/JUNIT_XML/DIFF/LOG_BUNDLE/MANIFEST),path,sha256,size_bytes,created_at.
Indexing strategy:
eval_run(idempotency_key)unique index for idempotent run creation.eval_result(run_id, case_id)unique index to support resume and prevent duplicates.eval_run(status, started_at)index for operational listing and cleanup jobs.eval_case(dataset_id, external_key)index to locate cases quickly by stable keys.- Optional GIN index on
eval_case(tags)and JSONB columns if tag-based filtering is used in APIs.
5. API Surface
-
POST /api/eval/runs – Create and execute an evaluation run (ROLE_EVAL)
- Supports dataset name/version, baseline_run_id (optional), idempotency_key (optional), model config snapshot.
-
GET /api/eval/runs – List recent evaluation runs with status and summary (ROLE_EVAL)
-
GET /api/eval/runs/{id} – Get full run details, aggregated metrics, and artifact inventory (ROLE_EVAL)
-
GET /api/eval/runs/{id}/results – Paginated per-case results with metrics and errors (ROLE_EVAL)
-
GET /api/eval/runs/{id}/compare/{baselineId} – Regression diff report (ROLE_EVAL)
-
GET /api/eval/runs/{id}/artifacts/{type} – Download or stream a specific artifact by type (ROLE_EVAL)
-
GET /actuator/health – Service health (unauthenticated or ROLE_MONITOR depending on deployment posture)
6. Security Model
Authentication:
- Local/dev: HTTP Basic or static API key header (
X-API-Key) configured via environment variables. - CI: API key with rotation policy; keys are not stored in source control and are injected via CI secrets.
Authorization (roles):
ROLE_EVAL: create runs, view runs, download artifacts.ROLE_ADMIN: manage datasets (if dataset management endpoints are enabled), retention policies, and configuration overrides.ROLE_MONITOR: access health/metrics endpoints (optional separation).
Paid access enforcement (if applicable):
- Enforced via API key tiering: keys mapped to an “entitlement” record that defines allowed dataset namespaces, max cases per run, and concurrency limits.
- Artifact downloads can be restricted by entitlement to prevent bulk export outside the paid plan.
CSRF considerations:
- This service is API-first and expects non-browser clients (CI, CLI). CSRF protection is typically disabled for stateless token/key auth.
- If an admin UI is later added, CSRF must be enabled for browser sessions while keeping API endpoints stateless.
Data isolation guarantees:
- Multi-tenant mode (optional):
tenant_idcolumn added toeval_run,eval_dataset,eval_case,eval_result,eval_artifactwith row-level filtering in repositories. - Single-tenant mode (default): isolation is by environment boundary (separate DB per deployment).
7. Operational Behavior
Startup behavior:
- Validates DB connectivity and runs schema migrations (Flyway/Liquibase).
- Loads dataset catalog (optional) and verifies the dataset directory is accessible if file-backed datasets are enabled.
- Warms up HTTP client pools for the LLM provider if configured.
Failure modes:
- PostgreSQL unavailable: service fails fast on startup (health down) or rejects run creation with explicit error if DB becomes unavailable at runtime.
- LLM provider transient failures: per-case retries; failures recorded with error types; run completes with partial failures based on threshold policy.
- Dataset schema invalid: run is rejected and marked FAILED with validation details; no partial execution.
- Artifact write failures (disk full/permissions): run can complete but is marked FAILED if required artifacts cannot be persisted; logs include artifact write diagnostics.
Retry and timeout behavior:
- LLM calls: bounded retries (configurable), exponential backoff with jitter, request timeout enforced by HTTP client.
- Overall run timeout: runner enforces a hard deadline aligned to the configured run budget; remaining cases are marked skipped/timeout.
Observability hooks:
- Structured logs with
run_id,case_id,dataset_version, andidempotency_key. - Metrics: run duration, per-case latency, pass rate, error counts by type, token usage aggregates.
- Traces (optional): spans around run execution, per-case evaluation, and external provider calls.
8. Local Execution
Prerequisites:
- Docker Desktop or Docker Engine + Docker Compose v2
- Java 17 (only required if running without Docker; recommended to run via Compose)
- An OpenAI-compatible API endpoint (optional; mock mode supported)
Environment variables:
-
SPRING_PROFILES_ACTIVE=local -
DB_URL=jdbc:postgresql://postgres:5432/eval -
DB_USER=eval -
DB_PASSWORD=eval -
EVAL_DATASET_DIR=/data/datasets -
ARTIFACT_DIR=/data/artifacts -
AUTH_MODE=api-key -
API_KEY_EVAL=<local-dev-key> -
LLM (choose one):
-
Real provider:
LLM_BASE_URL=<https endpoint>LLM_API_KEY=<key>LLM_MODEL=<model name>
-
Mock provider:
LLM_PROVIDER=mock
-
Docker Compose usage:
docker compose up -d
docker compose logs -f app
Verification steps:
- Health check:
curl -s http://localhost:8080/actuator/health
- Trigger a run (mock provider example):
curl -s -X POST http://localhost:8080/api/eval/runs \
-H "Content-Type: application/json" \
-H "X-API-Key: <local-dev-key>" \
-d '{
"dataset": { "name": "sample-support-bot", "version": "v1" },
"scoringProfile": "default",
"idempotencyKey": "local-0001",
"model": { "provider": "mock", "name": "mock-1" }
}'
- List runs:
curl -s http://localhost:8080/api/eval/runs \
-H "X-API-Key: <local-dev-key>"
- Fetch artifacts for a run (replace
{id}):
curl -s http://localhost:8080/api/eval/runs/{id}/artifacts/JSON_SUMMARY \
-H "X-API-Key: <local-dev-key>" > summary.json
9. Evidence Pack (MANDATORY)
Checklist of included artifacts proving execution and correctness:
-
Service startup logs showing DB migrations and dataset directory validation
-
Successful API invocation logs for run creation including returned
run_id -
Database records after execution:
eval_runpersisted with status, dataset hash, model config snapshot, and summary metricseval_resultrows for each evaluated case including per-case metrics and pass/faileval_artifactrows referencing generated reports and checksums
-
Evaluation reports:
- HTML report summarizing metrics, per-case breakdown, and error taxonomy
- JSON summary for machine ingestion (aggregates, config snapshot, environment details)
- JUnit XML report for CI test reporting and gating
-
Regression diffs (when baseline comparison is used):
- Per-case diff output for changed answers
- Aggregated regression summary with severity counts (new failures, fixed failures, unchanged failures)
-
Error handling demonstration:
- Captured artifacts for a forced transient provider failure (retry behavior visible in logs and per-case attempt counts)
- Captured artifacts for a forced timeout case (timeout classification and run-level accounting)
-
Evidence manifest:
- A manifest file enumerating all artifacts with SHA-256 checksums and generation timestamps
10. Known Limitations
- This solution does not guarantee semantic correctness beyond the configured scoring functions; if the scoring profile is weak, regressions can be missed.
- It does not attempt to solve dataset governance (labeling workflows, human review queues) beyond versioned dataset ingestion.
- High-throughput parallel evaluation is not enabled by default; single-node execution is bounded for local/CI use.
- Provider-side nondeterminism (temperature, model drift) can still introduce variance; the harness mitigates this via normalization and controlled parameters but cannot eliminate it entirely.
- If report artifacts are stored only on local disk, artifact retention is limited to the lifetime and storage capacity of the node unless external object storage is configured.
11. Extension Points
-
Add an async executor and queue:
- Replace in-process runner with a DB-backed job queue or Kafka-backed dispatch; keep the same
eval_runstate machine.
- Replace in-process runner with a DB-backed job queue or Kafka-backed dispatch; keep the same
-
Expand scoring profiles:
- Add embedding-based similarity, rubric-based scoring, citation checks for RAG, tool-call schema validation, and structured output contract tests.
-
Dataset management APIs:
- Add endpoints for uploading datasets, signing versions, and enforcing schema and PII checks before activation.
-
Artifact storage:
- Replace filesystem artifacts with object storage (S3/MinIO) and store only URIs + hashes in
eval_artifact.
- Replace filesystem artifacts with object storage (S3/MinIO) and store only URIs + hashes in
-
Production hardening:
- Add rate limiting per API key, tenant isolation with
tenant_id, and retention/compaction jobs for old runs and artifacts.
- Add rate limiting per API key, tenant isolation with
-
CI gating policy:
- Implement policy-as-code thresholds (e.g., pass rate >= X, no new critical regressions) enforced during run finalization, emitting a non-zero exit code in a CLI wrapper or failing the CI step via JUnit results.
1.1.0
- Solution write-up + runnable implementation
- Evidence images (when published)
- Code bundle downloads (when enabled)