LLM Eval Harness for Spring Boot: Golden Sets, Regression Tests, and CI Reports

A runnable evaluation harness that tests prompts/RAG outputs against golden datasets, computes metrics, and generates CI-friendly reports and evidence packs.

Verified v1.1.0 Redhat 8/9 / Ubuntu / macOS / Windows (Docker) Java 17 · Spring Boot 3.x · JUnit · PostgreSQL · Docker Compose
Register account for free
Unlock full implementation + downloads
Account access required
This solution includes runnable code bundles and full implementation details intended for production use.

1. Overview

This solution provides a runnable evaluation harness for LLM-backed features in Spring Boot (prompt-only and RAG-style outputs). It solves the core production problem: changes in prompts, retrieval logic, model versions, or tokenization behavior can silently degrade quality. Without automated evaluation, regressions are typically detected by user complaints, inconsistent support tickets, or ad-hoc manual spot checks that are neither repeatable nor attributable.

Common approaches fail in production for predictable reasons:

  • Tests are written as brittle string equality assertions, causing false failures on harmless formatting drift and missing semantic regressions.
  • Evaluations are executed manually or on developer machines, producing non-auditable results and no historical trendline.
  • Teams do not persist inputs/outputs with immutable run metadata, so it is impossible to reproduce a regression from CI logs.
  • Metrics are computed inconsistently across scripts, notebooks, or vendor dashboards, which complicates governance and release gating.

This implementation is production-ready because it:

  • Defines a first-class “evaluation run” domain model persisted in PostgreSQL (run metadata, dataset version, per-case results, and diffs).
  • Executes evaluations deterministically via a dedicated runner that supports retries, timeouts, and stable normalization rules.
  • Generates CI-friendly artifacts (JUnit XML, JSON, and HTML) and an evidence pack that proves execution and supports audit/repro.
  • Provides API endpoints to trigger runs, inspect results, and compare regressions across runs (baseline vs candidate).
  • Supports provider abstraction (OpenAI-compatible API, local model adapter, or mock) without changing evaluation semantics.

2. Architecture

Request flow and components:

  • CI Pipeline / Developer → (HTTP) → Eval Service API
  • Eval Service API → PostgreSQL (runs, cases, results, artifacts metadata)
  • Eval Service → LLM Provider (OpenAI-compatible API or adapter)
  • Eval Service → Report Generator (HTML/JSON/JUnit XML written to mounted volume)
  • Eval Service → Evidence Pack Builder (collects logs + diffs + report manifests)

Key components:

  • EvalController: Triggers runs, lists runs, exposes run details and comparisons.
  • EvalRunner: Orchestrates execution of an evaluation run, case iteration, scoring, retries, persistence.
  • DatasetResolver: Loads golden sets (versioned JSONL/CSV) from classpath or mounted directory; validates schema and hashes content.
  • ModelClient (provider abstraction): Calls external LLM provider or local adapter; supports request timeout and structured response capture.
  • ScoringEngine: Computes metrics per case and aggregates per run (exact match, normalized match, similarity, tool-call assertions, retrieval assertions).
  • DiffEngine: Compares candidate run vs baseline run to compute regression diffs and severity.
  • ReportService: Produces JSON summary, HTML report, and JUnit XML suitable for CI test reporting.
  • EvidenceService: Assembles evidence artifacts and a manifest to prove correctness and execution.

External dependencies:

  • PostgreSQL (state and reproducibility)
  • Docker Compose (local execution)
  • Optional: OpenAI-compatible LLM endpoint (or local adapter container)

Trust boundaries:

  • Client/CI to Eval Service: authenticated and authorized.
  • Eval Service to PostgreSQL: internal network boundary (Compose network).
  • Eval Service to LLM Provider: outbound network boundary; treated as untrusted response source; captured verbatim for audit.

3. Key Design Decisions

Technology stack:

  • Spring Boot 3.x + Java 17: consistent with enterprise deployment standards, mature dependency ecosystem, straightforward operationalization.
  • JUnit-compatible reporting: integrates with common CI systems without custom plugins; enables gating via standard test result ingestion.
  • PostgreSQL: ensures evaluation history, run reproducibility, and diff comparisons are durable and queryable; avoids “reports-only” ephemeral runs.

Data storage model:

  • Persist immutable run records (inputs, dataset hash/version, model config snapshot, scoring config snapshot) and per-case results (raw output, normalized output, metrics, error details).
  • Store report files on a mounted filesystem volume and reference them via DB metadata (path, hash, size). This avoids bloating DB with HTML while retaining auditability.

Synchrony vs asynchrony:

  • Default execution is synchronous for small datasets (developer workflows) with a hard runtime bound (ESTIMATED_RUNTIME_SECONDS: 90).
  • The design supports asynchronous execution by persisting run state transitions (QUEUED/RUNNING/COMPLETED/FAILED) and allowing a background executor to pick up runs. Locally, a single-node executor runs in-process to keep deployment minimal.

Error handling and retries:

  • Per-case LLM invocation uses bounded retries with exponential backoff for transient failures (429/5xx/timeouts).
  • A run does not fail-fast on a single case by default; it records the case failure and continues, then marks the run as FAILED only if failure ratio exceeds a configured threshold or if a hard dependency (dataset load, DB) fails.
  • Timeouts are enforced at HTTP client level and runner level (overall run budget).

Idempotency strategy:

  • Run creation supports an idempotency key (e.g., CI build ID + dataset version + model revision). If the same key is submitted, the API returns the existing run rather than creating duplicates.
  • Per-case execution writes results with a unique constraint (run_id, case_id) so reruns can resume safely if interrupted.

4. Data Model

Core tables and purpose:

  • eval_dataset

    • Purpose: catalog datasets and their versions/hashes used for runs.
    • Key columns: id, name, version, content_hash, source_uri, created_at.
  • eval_run

    • Purpose: immutable record of a single evaluation execution.
    • Key columns: id, status, dataset_id, dataset_hash, model_provider, model_name, model_params_json, scoring_profile, idempotency_key, started_at, finished_at, summary_json, baseline_run_id (optional).
  • eval_case

    • Purpose: normalized representation of dataset test cases (optional if cases are embedded only in dataset files; recommended for indexing and search).
    • Key columns: id, dataset_id, external_key, input_json, expected_json, tags, created_at.
  • eval_result

    • Purpose: per-case output, metrics, and error details.
    • Key columns: id, run_id, case_id, attempt_count, raw_output_text, normalized_output_text, metrics_json, passed, error_type, error_message, latency_ms, token_usage_json, created_at.
  • eval_artifact

    • Purpose: report and evidence artifact inventory for audit.
    • Key columns: id, run_id, type (HTML_REPORT/JSON_SUMMARY/JUNIT_XML/DIFF/LOG_BUNDLE/MANIFEST), path, sha256, size_bytes, created_at.

Indexing strategy:

  • eval_run(idempotency_key) unique index for idempotent run creation.
  • eval_result(run_id, case_id) unique index to support resume and prevent duplicates.
  • eval_run(status, started_at) index for operational listing and cleanup jobs.
  • eval_case(dataset_id, external_key) index to locate cases quickly by stable keys.
  • Optional GIN index on eval_case(tags) and JSONB columns if tag-based filtering is used in APIs.

5. API Surface

  • POST /api/eval/runs – Create and execute an evaluation run (ROLE_EVAL)

    • Supports dataset name/version, baseline_run_id (optional), idempotency_key (optional), model config snapshot.
  • GET /api/eval/runs – List recent evaluation runs with status and summary (ROLE_EVAL)

  • GET /api/eval/runs/{id} – Get full run details, aggregated metrics, and artifact inventory (ROLE_EVAL)

  • GET /api/eval/runs/{id}/results – Paginated per-case results with metrics and errors (ROLE_EVAL)

  • GET /api/eval/runs/{id}/compare/{baselineId} – Regression diff report (ROLE_EVAL)

  • GET /api/eval/runs/{id}/artifacts/{type} – Download or stream a specific artifact by type (ROLE_EVAL)

  • GET /actuator/health – Service health (unauthenticated or ROLE_MONITOR depending on deployment posture)

6. Security Model

Authentication:

  • Local/dev: HTTP Basic or static API key header (X-API-Key) configured via environment variables.
  • CI: API key with rotation policy; keys are not stored in source control and are injected via CI secrets.

Authorization (roles):

  • ROLE_EVAL: create runs, view runs, download artifacts.
  • ROLE_ADMIN: manage datasets (if dataset management endpoints are enabled), retention policies, and configuration overrides.
  • ROLE_MONITOR: access health/metrics endpoints (optional separation).

Paid access enforcement (if applicable):

  • Enforced via API key tiering: keys mapped to an “entitlement” record that defines allowed dataset namespaces, max cases per run, and concurrency limits.
  • Artifact downloads can be restricted by entitlement to prevent bulk export outside the paid plan.

CSRF considerations:

  • This service is API-first and expects non-browser clients (CI, CLI). CSRF protection is typically disabled for stateless token/key auth.
  • If an admin UI is later added, CSRF must be enabled for browser sessions while keeping API endpoints stateless.

Data isolation guarantees:

  • Multi-tenant mode (optional): tenant_id column added to eval_run, eval_dataset, eval_case, eval_result, eval_artifact with row-level filtering in repositories.
  • Single-tenant mode (default): isolation is by environment boundary (separate DB per deployment).

7. Operational Behavior

Startup behavior:

  • Validates DB connectivity and runs schema migrations (Flyway/Liquibase).
  • Loads dataset catalog (optional) and verifies the dataset directory is accessible if file-backed datasets are enabled.
  • Warms up HTTP client pools for the LLM provider if configured.

Failure modes:

  • PostgreSQL unavailable: service fails fast on startup (health down) or rejects run creation with explicit error if DB becomes unavailable at runtime.
  • LLM provider transient failures: per-case retries; failures recorded with error types; run completes with partial failures based on threshold policy.
  • Dataset schema invalid: run is rejected and marked FAILED with validation details; no partial execution.
  • Artifact write failures (disk full/permissions): run can complete but is marked FAILED if required artifacts cannot be persisted; logs include artifact write diagnostics.

Retry and timeout behavior:

  • LLM calls: bounded retries (configurable), exponential backoff with jitter, request timeout enforced by HTTP client.
  • Overall run timeout: runner enforces a hard deadline aligned to the configured run budget; remaining cases are marked skipped/timeout.

Observability hooks:

  • Structured logs with run_id, case_id, dataset_version, and idempotency_key.
  • Metrics: run duration, per-case latency, pass rate, error counts by type, token usage aggregates.
  • Traces (optional): spans around run execution, per-case evaluation, and external provider calls.

8. Local Execution

Prerequisites:

  • Docker Desktop or Docker Engine + Docker Compose v2
  • Java 17 (only required if running without Docker; recommended to run via Compose)
  • An OpenAI-compatible API endpoint (optional; mock mode supported)

Environment variables:

  • SPRING_PROFILES_ACTIVE=local

  • DB_URL=jdbc:postgresql://postgres:5432/eval

  • DB_USER=eval

  • DB_PASSWORD=eval

  • EVAL_DATASET_DIR=/data/datasets

  • ARTIFACT_DIR=/data/artifacts

  • AUTH_MODE=api-key

  • API_KEY_EVAL=<local-dev-key>

  • LLM (choose one):

    • Real provider:

      • LLM_BASE_URL=<https endpoint>
      • LLM_API_KEY=<key>
      • LLM_MODEL=<model name>
    • Mock provider:

      • LLM_PROVIDER=mock

Docker Compose usage:

docker compose up -d
docker compose logs -f app

Verification steps:

  1. Health check:
curl -s http://localhost:8080/actuator/health
  1. Trigger a run (mock provider example):
curl -s -X POST http://localhost:8080/api/eval/runs \
  -H "Content-Type: application/json" \
  -H "X-API-Key: <local-dev-key>" \
  -d '{
    "dataset": { "name": "sample-support-bot", "version": "v1" },
    "scoringProfile": "default",
    "idempotencyKey": "local-0001",
    "model": { "provider": "mock", "name": "mock-1" }
  }'
  1. List runs:
curl -s http://localhost:8080/api/eval/runs \
  -H "X-API-Key: <local-dev-key>"
  1. Fetch artifacts for a run (replace {id}):
curl -s http://localhost:8080/api/eval/runs/{id}/artifacts/JSON_SUMMARY \
  -H "X-API-Key: <local-dev-key>" > summary.json

9. Evidence Pack (MANDATORY)

Checklist of included artifacts proving execution and correctness:

  • Service startup logs showing DB migrations and dataset directory validation

  • Successful API invocation logs for run creation including returned run_id

  • Database records after execution:

    • eval_run persisted with status, dataset hash, model config snapshot, and summary metrics
    • eval_result rows for each evaluated case including per-case metrics and pass/fail
    • eval_artifact rows referencing generated reports and checksums
  • Evaluation reports:

    • HTML report summarizing metrics, per-case breakdown, and error taxonomy
    • JSON summary for machine ingestion (aggregates, config snapshot, environment details)
    • JUnit XML report for CI test reporting and gating
  • Regression diffs (when baseline comparison is used):

    • Per-case diff output for changed answers
    • Aggregated regression summary with severity counts (new failures, fixed failures, unchanged failures)
  • Error handling demonstration:

    • Captured artifacts for a forced transient provider failure (retry behavior visible in logs and per-case attempt counts)
    • Captured artifacts for a forced timeout case (timeout classification and run-level accounting)
  • Evidence manifest:

    • A manifest file enumerating all artifacts with SHA-256 checksums and generation timestamps

10. Known Limitations

  • This solution does not guarantee semantic correctness beyond the configured scoring functions; if the scoring profile is weak, regressions can be missed.
  • It does not attempt to solve dataset governance (labeling workflows, human review queues) beyond versioned dataset ingestion.
  • High-throughput parallel evaluation is not enabled by default; single-node execution is bounded for local/CI use.
  • Provider-side nondeterminism (temperature, model drift) can still introduce variance; the harness mitigates this via normalization and controlled parameters but cannot eliminate it entirely.
  • If report artifacts are stored only on local disk, artifact retention is limited to the lifetime and storage capacity of the node unless external object storage is configured.

11. Extension Points

  • Add an async executor and queue:

    • Replace in-process runner with a DB-backed job queue or Kafka-backed dispatch; keep the same eval_run state machine.
  • Expand scoring profiles:

    • Add embedding-based similarity, rubric-based scoring, citation checks for RAG, tool-call schema validation, and structured output contract tests.
  • Dataset management APIs:

    • Add endpoints for uploading datasets, signing versions, and enforcing schema and PII checks before activation.
  • Artifact storage:

    • Replace filesystem artifacts with object storage (S3/MinIO) and store only URIs + hashes in eval_artifact.
  • Production hardening:

    • Add rate limiting per API key, tenant isolation with tenant_id, and retention/compaction jobs for old runs and artifacts.
  • CI gating policy:

    • Implement policy-as-code thresholds (e.g., pass rate >= X, no new critical regressions) enforced during run finalization, emitting a non-zero exit code in a CLI wrapper or failing the CI step via JUnit results.
Changelog
Release notes

1.1.0

Locked
Register account to unlock implementation details and assets.
Account


  • Solution write-up + runnable implementation
  • Evidence images (when published)
  • Code bundle downloads (when enabled)
Evidence
9 item(s)
code-structure-0.png
build-success-1.png
health-status-up-2.png
Trigger an evaluation run-3.png
run-summary.json-4.png
report.html-5.png
junit.xml-6.png
manifest.json-7.png
List runs-8.png
Code downloads
2 file(s)
spring-boot-llm-eval-harness-ci-regression_v1.1.zip
ZIP bundle
Locked
spring-boot-llm-eval-harness-ci-regression.zip
ZIP bundle
Locked