1. Overview

This solution provides a runnable evaluation harness for LLM-backed features in Spring Boot (prompt-only and RAG-style outputs). It solves the core production problem: changes in prompts, retrieval logic, model versions, or tokenization behavior can silently degrade quality. Without automated evaluation, regressions are typically detected by user complaints, inconsistent support tickets, or ad-hoc manual spot checks that are neither repeatable nor attributable.

Common approaches fail in production for predictable reasons:

Tests are written as brittle string equality assertions, causing false failures on harmless formatting drift and missing semantic regressions.
Evaluations are executed manually or on developer machines, producing non-auditable results and no historical trendline.
Teams do not persist inputs/outputs with immutable run metadata, so it is impossible to reproduce a regression from CI logs.
Metrics are computed inconsistently across scripts, notebooks, or vendor dashboards, which complicates governance and release gating.

This implementation is production-ready because it:

Defines a first-class “evaluation run” domain model persisted in PostgreSQL (run metadata, dataset version, per-case results, and diffs).
Executes evaluations deterministically via a dedicated runner that supports retries, timeouts, and stable normalization rules.
Generates CI-friendly artifacts (JUnit XML, JSON, and HTML) and an evidence pack that proves execution and supports audit/repro.
Provides API endpoints to trigger runs, inspect results, and compare regressions across runs (baseline vs candidate).
Supports provider abstraction (OpenAI-compatible API, local model adapter, or mock) without changing evaluation semantics.

2. Architecture

Request flow and components:

CI Pipeline / Developer → (HTTP) → Eval Service API
Eval Service API → PostgreSQL (runs, cases, results, artifacts metadata)
Eval Service → LLM Provider (OpenAI-compatible API or adapter)
Eval Service → Report Generator (HTML/JSON/JUnit XML written to mounted volume)
Eval Service → Evidence Pack Builder (collects logs + diffs + report manifests)

Key components:

EvalController: Triggers runs, lists runs, exposes run details and comparisons.
EvalRunner: Orchestrates execution of an evaluation run, case iteration, scoring, retries, persistence.
DatasetResolver: Loads golden sets (versioned JSONL/CSV) from classpath or mounted directory; validates schema and hashes content.
ModelClient (provider abstraction): Calls external LLM provider or local adapter; supports request timeout and structured response capture.
ScoringEngine: Computes metrics per case and aggregates per run (exact match, normalized match, similarity, tool-call assertions, retrieval assertions).
DiffEngine: Compares candidate run vs baseline run to compute regression diffs and severity.
ReportService: Produces JSON summary, HTML report, and JUnit XML suitable for CI test reporting.
EvidenceService: Assembles evidence artifacts and a manifest to prove correctness and execution.

External dependencies:

PostgreSQL (state and reproducibility)
Docker Compose (local execution)
Optional: OpenAI-compatible LLM endpoint (or local adapter container)

Trust boundaries:

Client/CI to Eval Service: authenticated and authorized.
Eval Service to PostgreSQL: internal network boundary (Compose network).
Eval Service to LLM Provider: outbound network boundary; treated as untrusted response source; captured verbatim for audit.

3. Key Design Decisions

Technology stack:

Spring Boot 3.x + Java 17: consistent with enterprise deployment standards, mature dependency ecosystem, straightforward operationalization.
JUnit-compatible reporting: integrates with common CI systems without custom plugins; enables gating via standard test result ingestion.
PostgreSQL: ensures evaluation history, run reproducibility, and diff comparisons are durable and queryable; avoids “reports-only” ephemeral runs.

Data storage model:

Persist immutable run records (inputs, dataset hash/version, model config snapshot, scoring config snapshot) and per-case results (raw output, normalized output, metrics, error details).
Store report files on a mounted filesystem volume and reference them via DB metadata (path, hash, size). This avoids bloating DB with HTML while retaining auditability.

Synchrony vs asynchrony:

Default execution is synchronous for small datasets (developer workflows) with a hard runtime bound (ESTIMATED_RUNTIME_SECONDS: 90).
The design supports asynchronous execution by persisting run state transitions (QUEUED/RUNNING/COMPLETED/FAILED) and allowing a background executor to pick up runs. Locally, a single-node executor runs in-process to keep deployment minimal.

Error handling and retries:

Per-case LLM invocation uses bounded retries with exponential backoff for transient failures (429/5xx/timeouts).
A run does not fail-fast on a single case by default; it records the case failure and continues, then marks the run as FAILED only if failure ratio exceeds a configured threshold or if a hard dependency (dataset load, DB) fails.
Timeouts are enforced at HTTP client level and runner level (overall run budget).

Idempotency strategy:

Run creation supports an idempotency key (e.g., CI build ID + dataset version + model revision). If the same key is submitted, the API returns the existing run rather than creating duplicates.
Per-case execution writes results with a unique constraint (run_id, case_id) so reruns can resume safely if interrupted.

4. Data Model

Core tables and purpose:

eval_dataset
- Purpose: catalog datasets and their versions/hashes used for runs.
- Key columns: id, name, version, content_hash, source_uri, created_at.
eval_run
- Purpose: immutable record of a single evaluation execution.
- Key columns: id, status, dataset_id, dataset_hash, model_provider, model_name, model_params_json, scoring_profile, idempotency_key, started_at, finished_at, summary_json, baseline_run_id (optional).
eval_case
- Purpose: normalized representation of dataset test cases (optional if cases are embedded only in dataset files; recommended for indexing and search).
- Key columns: id, dataset_id, external_key, input_json, expected_json, tags, created_at.
eval_result
- Purpose: per-case output, metrics, and error details.
- Key columns: id, run_id, case_id, attempt_count, raw_output_text, normalized_output_text, metrics_json, passed, error_type, error_message, latency_ms, token_usage_json, created_at.
eval_artifact
- Purpose: report and evidence artifact inventory for audit.
- Key columns: id, run_id, type (HTML_REPORT/JSON_SUMMARY/JUNIT_XML/DIFF/LOG_BUNDLE/MANIFEST), path, sha256, size_bytes, created_at.

Indexing strategy:

eval_run(idempotency_key) unique index for idempotent run creation.
eval_result(run_id, case_id) unique index to support resume and prevent duplicates.
eval_run(status, started_at) index for operational listing and cleanup jobs.
eval_case(dataset_id, external_key) index to locate cases quickly by stable keys.
Optional GIN index on eval_case(tags) and JSONB columns if tag-based filtering is used in APIs.

5. API Surface

POST /api/eval/runs – Create and execute an evaluation run (ROLE_EVAL)
- Supports dataset name/version, baseline_run_id (optional), idempotency_key (optional), model config snapshot.
GET /api/eval/runs – List recent evaluation runs with status and summary (ROLE_EVAL)
GET /api/eval/runs/{id} – Get full run details, aggregated metrics, and artifact inventory (ROLE_EVAL)
GET /api/eval/runs/{id}/results – Paginated per-case results with metrics and errors (ROLE_EVAL)
GET /api/eval/runs/{id}/compare/{baselineId} – Regression diff report (ROLE_EVAL)
GET /api/eval/runs/{id}/artifacts/{type} – Download or stream a specific artifact by type (ROLE_EVAL)
GET /actuator/health – Service health (unauthenticated or ROLE_MONITOR depending on deployment posture)

6. Security Model

Authentication:

Local/dev: HTTP Basic or static API key header (X-API-Key) configured via environment variables.
CI: API key with rotation policy; keys are not stored in source control and are injected via CI secrets.

Authorization (roles):

ROLE_EVAL: create runs, view runs, download artifacts.
ROLE_ADMIN: manage datasets (if dataset management endpoints are enabled), retention policies, and configuration overrides.
ROLE_MONITOR: access health/metrics endpoints (optional separation).

Paid access enforcement (if applicable):

Enforced via API key tiering: keys mapped to an “entitlement” record that defines allowed dataset namespaces, max cases per run, and concurrency limits.
Artifact downloads can be restricted by entitlement to prevent bulk export outside the paid plan.

CSRF considerations:

This service is API-first and expects non-browser clients (CI, CLI). CSRF protection is typically disabled for stateless token/key auth.
If an admin UI is later added, CSRF must be enabled for browser sessions while keeping API endpoints stateless.

Data isolation guarantees:

Multi-tenant mode (optional): tenant_id column added to eval_run, eval_dataset, eval_case, eval_result, eval_artifact with row-level filtering in repositories.
Single-tenant mode (default): isolation is by environment boundary (separate DB per deployment).

7. Operational Behavior

Startup behavior:

Validates DB connectivity and runs schema migrations (Flyway/Liquibase).
Loads dataset catalog (optional) and verifies the dataset directory is accessible if file-backed datasets are enabled.
Warms up HTTP client pools for the LLM provider if configured.

Failure modes:

PostgreSQL unavailable: service fails fast on startup (health down) or rejects run creation with explicit error if DB becomes unavailable at runtime.
LLM provider transient failures: per-case retries; failures recorded with error types; run completes with partial failures based on threshold policy.
Dataset schema invalid: run is rejected and marked FAILED with validation details; no partial execution.
Artifact write failures (disk full/permissions): run can complete but is marked FAILED if required artifacts cannot be persisted; logs include artifact write diagnostics.

Retry and timeout behavior:

LLM calls: bounded retries (configurable), exponential backoff with jitter, request timeout enforced by HTTP client.
Overall run timeout: runner enforces a hard deadline aligned to the configured run budget; remaining cases are marked skipped/timeout.

Observability hooks:

Structured logs with run_id, case_id, dataset_version, and idempotency_key.
Metrics: run duration, per-case latency, pass rate, error counts by type, token usage aggregates.
Traces (optional): spans around run execution, per-case evaluation, and external provider calls.

8. Local Execution

Prerequisites:

Docker Desktop or Docker Engine + Docker Compose v2
Java 17 (only required if running without Docker; recommended to run via Compose)
An OpenAI-compatible API endpoint (optional; mock mode supported)

Environment variables:

SPRING_PROFILES_ACTIVE=local
DB_URL=jdbc:postgresql://postgres:5432/eval
DB_USER=eval
DB_PASSWORD=eval
EVAL_DATASET_DIR=/data/datasets
ARTIFACT_DIR=/data/artifacts
AUTH_MODE=api-key
API_KEY_EVAL=<local-dev-key>
LLM (choose one):
- Real provider:
  - LLM_BASE_URL=<https endpoint>
  - LLM_API_KEY=<key>
  - LLM_MODEL=<model name>
- Mock provider:
  - LLM_PROVIDER=mock

Docker Compose usage:

docker compose up -d
docker compose logs -f app

Verification steps:

Health check:

curl -s http://localhost:8080/actuator/health

Trigger a run (mock provider example):

curl -s -X POST http://localhost:8080/api/eval/runs \
  -H "Content-Type: application/json" \
  -H "X-API-Key: <local-dev-key>" \
  -d '{
    "dataset": { "name": "sample-support-bot", "version": "v1" },
    "scoringProfile": "default",
    "idempotencyKey": "local-0001",
    "model": { "provider": "mock", "name": "mock-1" }
  }'

List runs:

curl -s http://localhost:8080/api/eval/runs \
  -H "X-API-Key: <local-dev-key>"

Fetch artifacts for a run (replace {id}):

curl -s http://localhost:8080/api/eval/runs/{id}/artifacts/JSON_SUMMARY \
  -H "X-API-Key: <local-dev-key>" > summary.json

9. Evidence Pack (MANDATORY)

Checklist of included artifacts proving execution and correctness:

Service startup logs showing DB migrations and dataset directory validation
Successful API invocation logs for run creation including returned run_id
Database records after execution:
- eval_run persisted with status, dataset hash, model config snapshot, and summary metrics
- eval_result rows for each evaluated case including per-case metrics and pass/fail
- eval_artifact rows referencing generated reports and checksums
Evaluation reports:
- HTML report summarizing metrics, per-case breakdown, and error taxonomy
- JSON summary for machine ingestion (aggregates, config snapshot, environment details)
- JUnit XML report for CI test reporting and gating
Regression diffs (when baseline comparison is used):
- Per-case diff output for changed answers
- Aggregated regression summary with severity counts (new failures, fixed failures, unchanged failures)
Error handling demonstration:
- Captured artifacts for a forced transient provider failure (retry behavior visible in logs and per-case attempt counts)
- Captured artifacts for a forced timeout case (timeout classification and run-level accounting)
Evidence manifest:
- A manifest file enumerating all artifacts with SHA-256 checksums and generation timestamps

10. Known Limitations

This solution does not guarantee semantic correctness beyond the configured scoring functions; if the scoring profile is weak, regressions can be missed.
It does not attempt to solve dataset governance (labeling workflows, human review queues) beyond versioned dataset ingestion.
High-throughput parallel evaluation is not enabled by default; single-node execution is bounded for local/CI use.
Provider-side nondeterminism (temperature, model drift) can still introduce variance; the harness mitigates this via normalization and controlled parameters but cannot eliminate it entirely.
If report artifacts are stored only on local disk, artifact retention is limited to the lifetime and storage capacity of the node unless external object storage is configured.

11. Extension Points

Add an async executor and queue:
- Replace in-process runner with a DB-backed job queue or Kafka-backed dispatch; keep the same eval_run state machine.
Expand scoring profiles:
- Add embedding-based similarity, rubric-based scoring, citation checks for RAG, tool-call schema validation, and structured output contract tests.
Dataset management APIs:
- Add endpoints for uploading datasets, signing versions, and enforcing schema and PII checks before activation.
Artifact storage:
- Replace filesystem artifacts with object storage (S3/MinIO) and store only URIs + hashes in eval_artifact.
Production hardening:
- Add rate limiting per API key, tenant isolation with tenant_id, and retention/compaction jobs for old runs and artifacts.
CI gating policy:
- Implement policy-as-code thresholds (e.g., pass rate >= X, no new critical regressions) enforced during run finalization, emitting a non-zero exit code in a CLI wrapper or failing the CI step via JUnit results.

LLM Eval Harness for Spring Boot: Golden Sets, Regression Tests, and CI Reports

Business Fit

Enterprise Readiness

Delivery Package

Implementation Notes