Prompt Versioning & A/B Testing in Spring Boot: Canary Deployment, Statistical Significance, and Safe Rollout

A runnable prompt management service that versions prompt templates in PostgreSQL, routes live traffic between control and candidate variants using weighted canary splits, evaluates outputs with deterministic and LLM-as-judge scorers, and auto-promotes or auto-rolls-back based on statistical significance.

Verified v1.0.0 Redhat 8/9 / Ubuntu / macOS / Windows (Docker) Java 17 · Spring Boot 3.3.x · Spring AI 1.0.x · PostgreSQL 16 · Flyway · Docker Compose
Register account for free
Unlock full implementation + downloads
Account access required
This solution includes runnable code bundles and full implementation details intended for production use.

1. Overview

This solution implements a production-ready prompt versioning and A/B testing system in Spring Boot. Prompt templates are stored as immutable versioned records in PostgreSQL. An experiment configuration assigns weighted traffic splits between a control variant and one or more candidates. Every request is assigned to a variant, the assigned prompt is rendered and executed via Spring AI's ChatClient, the response is scored asynchronously, and the resulting metrics accumulate until a rollout decision can be made with statistical confidence.

The problem it solves is straightforward: once a team starts iterating on LLM prompts, naive approaches become operationally unsafe and scientifically invalid. Typical failure patterns include:

  • Prompts managed in code or environment variables: every prompt change requires a full application redeployment, coupling prompt iteration velocity to release cadence.
  • Manual side-by-side comparison: engineers copy-paste outputs into spreadsheets, compare anecdotally, and promote based on intuition rather than data.
  • No user-level consistency: the same user sees different prompt versions across requests, invalidating A/B test results and degrading user experience.
  • Statistical errors: tests are stopped too early when a "trend" appears, or run on samples too small to detect the true effect size, producing false positives that push regressions to production.
  • No rollback mechanism: when a new prompt underperforms, the team must redeploy the previous version from source control, causing minutes-to-hours of degraded experience.
  • Evaluation subjectivity: output quality is assessed by whoever is available, with no consistent scoring rubric or quantitative record.

Existing approaches often fail in production because they treat prompt management as a configuration concern rather than an experimentation discipline. When prompt changes go wrong — and they will — the recovery path is a deployment, not a database update.

This implementation is production-ready because it treats prompt versions as first-class durable artifacts:

  • Every prompt version is an immutable, timestamped record. You can never lose a prompt version.
  • Traffic splitting is enforced at the request router layer with sticky session assignment: a given sessionId always maps to the same variant for the duration of an experiment.
  • Statistical significance is computed continuously against a configurable p-value threshold. Promotion does not happen until the sample size justifies the decision.
  • Auto-rollback fires when error rate or safety violation rate spikes above a configurable threshold, independent of the significance check.
  • All decisions — assignments, evaluations, promotions, rollbacks — are written to an append-only audit log with timestamps and actor identity.

2. Architecture

Request flow and dependencies:

  • Client → Spring Boot REST API (POST /api/complete)
  • REST API → Prompt Router (reads active experiment from DB, assigns variant via weighted random + sticky hash)
  • Prompt Router → Prompt Registry (PostgreSQL, reads versioned template for assigned variant)
  • Prompt Router → Variant Assignment Log (PostgreSQL INSERT, records which variant this session received)
  • Prompt Template + User Input → LLM Executor (ChatClient, renders and calls LLM)
  • LLM Response → Evaluation Pipeline (deterministic scorers + async LLM-as-judge)
  • Evaluation results → Metric Event Store (PostgreSQL INSERT, records scores per variant per request)
  • Metric Event Store → Significance Checker (scheduled job, aggregates win-rates, computes p-value)
  • Significance Checker → Rollout Controller (applies auto-promote or auto-rollback decision)
  • Rollout Controller → Prompt Registry (UPDATE experiment weight / status)
  • Rollout Controller → Audit Log (append-only, records every decision with rationale)
  • REST API → Admin endpoints (manual promote/rollback/pause, requires ROLE_ADMIN)

Key components:

  • PromptRouter: stateless Spring @Component that resolves the active experiment for a tenant, applies weighted random assignment, enforces sticky session mapping via a hash of sessionId, and returns the resolved prompt version ID.
  • PromptRegistry: Spring Data JPA repository over the prompt_version and ab_experiment tables. All writes are insert-only (no UPDATE on prompt content). Experiment metadata (weights, status) is mutable.
  • LlmExecutor: thin wrapper over ChatClient that renders the Mustache/Handlebars prompt template with request variables and records token count, latency, and cost per call.
  • EvaluationPipeline: executes deterministic scorers (JSON schema validation, regex assertion, latency gate) synchronously and submits LLM-as-judge scoring as a CompletableFuture to a bounded thread pool.
  • SignificanceChecker: @Scheduled job (configurable interval, default 5 minutes) that queries aggregated metric events per variant, runs a two-proportion z-test for win-rate comparison, and emits a promotion/rollback signal when thresholds are met.
  • RolloutController: executes the signal from the SignificanceChecker, or from admin override, as a DB transaction updating ab_experiment.control_variant_id and inserting an audit event.
  • AuditLog: append-only experiment_event table. Never updated; only inserted.

Trust boundaries:

  • Inbound boundary: tenant ID and session ID extracted from JWT claims, validated before routing. A tenant can only read their own experiments and prompt versions.
  • LLM boundary: rendered prompts are sent to the LLM; raw user input is never injected without the template wrapper. This prevents prompt injection from user-controlled content.
  • Evaluation boundary: LLM-as-judge calls use a separate, lower-cost model and a fixed system prompt. The judge model is isolated from the user session.
  • Admin boundary: promote/rollback/pause endpoints require ROLE_ADMIN. Automated rollout controller uses a dedicated service account with only UPDATE permission on ab_experiment.

3. Key Design Decisions

Immutable prompt versions

Prompt content is never updated after INSERT. If a prompt needs correction, a new version is created and the experiment is updated to point to the new version. This guarantees:

  • Every metric event can be traced back to the exact prompt text that produced it.
  • Rollback is instant: update the experiment's control_variant_id to a previous version ID. No redeployment.
  • Audit trail is complete: the full history of what was served to whom and when is reconstructable from the DB.

The alternative — storing prompts in code or environment variables — makes this reconstruction impossible and ties prompt rollback to deployment pipelines.

Sticky session assignment

Users must consistently see the same variant throughout an experiment. Without stickiness, a user could receive variant A for one turn and variant B for the next in a multi-turn conversation, corrupting the experiment. Stickiness is enforced by hashing sessionId % total_weight and comparing to the variant weight boundaries. This is deterministic and requires no persistent session state — the same hash always produces the same assignment for a given experiment configuration.

Why a two-proportion z-test, not a t-test

The primary metric in most prompt A/B tests is a win-rate: the proportion of responses that meet a quality threshold (score ≥ threshold from the judge). This is a proportion (0 or 1 per request), making the two-proportion z-test the correct choice. A t-test is appropriate for continuous metrics (latency, token count). The system supports both; the SignificanceChecker selects the test type based on the metric column type in ab_experiment.primary_metric.

Technology stack

  • PostgreSQL: required for transactional consistency between variant assignment and metric recording, and for the append-only audit table with row-level security. SQLite is insufficient for concurrent writes under load.
  • Flyway: all schema changes are versioned migrations. The schema for prompt_version, ab_experiment, variant_assignment, metric_event, and experiment_event is reproducible from scratch in any environment.
  • Spring AI ChatClient: provides a model-agnostic execution layer. Switching from OpenAI to Claude requires only a dependency swap and two configuration properties — no changes to prompt rendering or evaluation logic.
  • Mustache templating: prompt templates use {{variable}} syntax. Mustache is intentionally logic-less, preventing template injection and keeping prompt logic in the prompt, not in Java code.
  • @Scheduled significance checker: a lightweight alternative to a full workflow engine. The 5-minute polling interval is sufficient for experiments that run for hours or days. For sub-minute decisions, replace with a PostgreSQL LISTEN/NOTIFY trigger.

Evaluation strategy

Three tiers of evaluation run in sequence:

  1. Deterministic (synchronous): JSON schema validation, required keyword presence, latency < threshold. Fast, cheap, always run. Failed deterministic checks increment error_count immediately and may trigger auto-rollback.
  2. LLM-as-judge (asynchronous): a secondary LLM rates the response on relevance, faithfulness, and safety on a 1–5 scale. Runs in a bounded thread pool with a 10-second timeout. Results write to metric_event when complete.
  3. Human feedback (event-driven): thumbs-up/down signals from the client write a metric_event of type HUMAN_FEEDBACK. These are weighted at 3× in the win-rate calculation.

4. Data Model

Core tables

prompt_version — immutable prompt catalog

id              UUID PRIMARY KEY
tenant_id       VARCHAR NOT NULL
name            VARCHAR NOT NULL          -- human label, e.g. "customer-support-v1.3"
template        TEXT NOT NULL             -- Mustache template content
model_hint      VARCHAR                   -- preferred model, e.g. "gpt-4o-mini"
variables       JSONB                     -- declared variable names + types
created_by      VARCHAR NOT NULL
created_at      TIMESTAMPTZ DEFAULT now()
-- NO update columns: content is immutable after INSERT

ab_experiment — experiment configuration (mutable metadata only)

id                  UUID PRIMARY KEY
tenant_id           VARCHAR NOT NULL
name                VARCHAR NOT NULL
status              VARCHAR NOT NULL      -- DRAFT | ACTIVE | PAUSED | CONCLUDED
control_variant_id  UUID REFERENCES prompt_version(id)
candidate_variant_id UUID REFERENCES prompt_version(id)
control_weight      INT NOT NULL DEFAULT 90   -- out of 100
candidate_weight    INT NOT NULL DEFAULT 10
primary_metric      VARCHAR NOT NULL      -- WIN_RATE | AVG_LATENCY | AVG_SCORE
significance_threshold NUMERIC DEFAULT 0.05
min_sample_size     INT DEFAULT 200
auto_promote        BOOLEAN DEFAULT false
auto_rollback_error_rate NUMERIC DEFAULT 0.05
started_at          TIMESTAMPTZ
concluded_at        TIMESTAMPTZ

variant_assignment — sticky session log (insert-only)

id              UUID PRIMARY KEY
experiment_id   UUID REFERENCES ab_experiment(id)
session_id      VARCHAR NOT NULL
variant_id      UUID REFERENCES prompt_version(id)
assigned_at     TIMESTAMPTZ DEFAULT now()
UNIQUE(experiment_id, session_id)         -- enforces stickiness at DB level

metric_event — per-request evaluation record (insert-only)

id              UUID PRIMARY KEY
experiment_id   UUID REFERENCES ab_experiment(id)
variant_id      UUID REFERENCES prompt_version(id)
session_id      VARCHAR NOT NULL
event_type      VARCHAR NOT NULL          -- DETERMINISTIC | LLM_JUDGE | HUMAN_FEEDBACK
score           NUMERIC                   -- 0.0–1.0 normalised
latency_ms      INT
token_count     INT
cost_usd        NUMERIC(10,6)
is_error        BOOLEAN DEFAULT false
is_safety_flag  BOOLEAN DEFAULT false
metadata        JSONB
created_at      TIMESTAMPTZ DEFAULT now()

experiment_event — audit log (append-only, never updated)

id              UUID PRIMARY KEY
experiment_id   UUID REFERENCES ab_experiment(id)
event_type      VARCHAR NOT NULL    -- STARTED | PROMOTED | ROLLED_BACK | PAUSED | MANUAL_OVERRIDE
actor           VARCHAR NOT NULL    -- "system:significance-checker" or "admin:username"
rationale       TEXT                -- p-value, win-rates, or human note
snapshot        JSONB               -- full experiment state at decision time
created_at      TIMESTAMPTZ DEFAULT now()

Indexing strategy

  • variant_assignment(experiment_id, session_id) — unique index enforces DB-level stickiness.
  • metric_event(experiment_id, variant_id, created_at DESC) — hot path for the significance checker aggregation query.
  • metric_event(experiment_id, is_error) — partial index for auto-rollback error rate query.
  • experiment_event(experiment_id, created_at DESC) — audit log retrieval.
  • ab_experiment(tenant_id, status) — experiment listing per tenant.

5. API Surface

Completion endpoint (user-facing)

  • POST /api/complete — submit a user input; receives AI response with prompt variant transparently applied. Request: { "sessionId": "...", "tenantId": "...", "input": "..." }. Response: { "output": "...", "variantId": "...", "latencyMs": ... }. (ROLE_USER)

Prompt management (admin)

  • POST /api/prompts — create a new prompt version; returns versionId. (ROLE_ADMIN)
  • GET /api/prompts/{id} — retrieve prompt version content and metadata. (ROLE_ADMIN)
  • GET /api/prompts?tenantId={t} — list all versions for a tenant. (ROLE_ADMIN)

Experiment management (admin)

  • POST /api/experiments — create an experiment, specifying control + candidate version IDs and weights. (ROLE_ADMIN)
  • GET /api/experiments/{id} — current experiment status, weights, and live metrics summary. (ROLE_ADMIN)
  • POST /api/experiments/{id}/start — activate the experiment (status: DRAFT → ACTIVE). (ROLE_ADMIN)
  • POST /api/experiments/{id}/pause — pause traffic split (status: ACTIVE → PAUSED). (ROLE_ADMIN)
  • POST /api/experiments/{id}/promote — manually promote candidate to control. (ROLE_ADMIN)
  • POST /api/experiments/{id}/rollback — manually revert to control. (ROLE_ADMIN)

Metrics and audit

  • GET /api/experiments/{id}/metrics — aggregated win-rate, latency, cost, and significance per variant. (ROLE_ADMIN)
  • GET /api/experiments/{id}/audit — paginated audit event log for the experiment. (ROLE_ADMIN)

Human feedback

  • POST /api/feedback — submit thumbs-up/down for a session. Request: { "sessionId": "...", "positive": true }. (ROLE_USER)

6. Security Model

Authentication

Spring Security with stateless JWT bearer token authentication. Tokens carry tenantId, userId, and roles claims. All endpoints validate tenant scope — a ROLE_ADMIN user from tenant A cannot read or modify experiments for tenant B.

Authorization

  • ROLE_USER: call /api/complete, submit /api/feedback.
  • ROLE_ADMIN: all prompt and experiment management, metrics, and audit endpoints.
  • ROLE_SYSTEM: used by the SignificanceChecker scheduled job for its internal promote/rollback writes. Not assignable to human users.

Prompt injection prevention

User input is injected into the prompt template only through the Mustache rendering context. Template variables are declared in prompt_version.variables (JSONB). Any variable not declared is silently ignored by the renderer. Raw user content cannot escape the template context to modify the system prompt or inject new instructions.

Data isolation

Every table with user-associated data includes tenant_id. All queries include a tenant predicate derived from the JWT claim, not from the request body. Row-level security (PostgreSQL RLS) enforces this at the database layer as a defense-in-depth measure.


7. Operational Behavior

Startup behavior

On startup, the service runs Flyway migrations, validates that all ACTIVE experiments have matching prompt_version records, and emits a startup log summarising the count of active experiments per tenant. The SignificanceChecker scheduler starts after a configurable warm-up delay (default 60 seconds) to avoid triggering on pre-loaded metric data from the previous run.

Failure modes

  • DB unavailable: fail fast on startup. Health endpoint reports DOWN. No requests served.
  • LLM provider error: the LlmExecutor records an is_error=true metric event. If error rate for a variant exceeds auto_rollback_error_rate within a rolling 10-minute window, the SignificanceChecker triggers auto-rollback on the next tick, regardless of significance.
  • Evaluation timeout: LLM-as-judge calls that exceed 10 seconds are abandoned; the metric event is written with event_type=LLM_JUDGE and score=null. Null scores are excluded from win-rate calculations but counted in sample size.
  • Significance checker crash: the scheduler is Spring-managed and restarts automatically. The last written metric events are durable; no data is lost.
  • Sticky assignment conflict: if two concurrent requests for the same session arrive before the first assignment is committed, the UNIQUE(experiment_id, session_id) constraint causes the second INSERT to fail with a constraint violation. The handler catches this, re-reads the committed assignment, and proceeds. The first assignment wins.

Significance checker algorithm

Run every 5 minutes (configurable):

  1. For each ACTIVE experiment with auto_promote=true or auto_rollback_error_rate set:
  2. Query metric_event grouped by variant_id: count total events, sum scores, sum is_error.
  3. Check auto-rollback: if candidate error_rate > auto_rollback_error_rate AND candidate_sample >= 20 → trigger rollback.
  4. Check sample size: if either variant has < min_sample_size events → skip significance test, continue accumulating.
  5. Compute two-proportion z-test on win-rate (score ≥ 0.5): z-statistic, p-value.
  6. If p-value < significance_threshold AND candidate_win_rate > control_win_rate AND auto_promote=true → promote.
  7. Write experiment_event with full snapshot (sample sizes, p-value, win-rates).

Observability

Structured logs (every request): experiment_id, variant_id, session_id, tenant_id, latency_ms, is_error, event_type.

Micrometer metrics (exported to any compatible backend):

  • prompt.ab.request.total{experiment_id, variant_id} — counter.
  • prompt.ab.latency{experiment_id, variant_id} — timer histogram.
  • prompt.ab.score{experiment_id, variant_id} — distribution summary.
  • prompt.ab.error_rate{experiment_id, variant_id} — gauge, updated each checker tick.

8. Local Execution

Prerequisites

  • Docker Desktop with Compose v2
  • JDK 17 (for local test runs)
  • Available ports: 8080 (app), 5432 (PostgreSQL)
  • OpenAI API key (or any Spring AI–compatible provider)

Project structure

prompt-ab-solution/
├── docker-compose.yml
├── .env.template
├── pom.xml
└── src/main/
    ├── java/com/exesolution/promptab/
    │   ├── PromptAbApplication.java
    │   ├── router/PromptRouter.java
    │   ├── registry/PromptRegistry.java         (JPA repos)
    │   ├── executor/LlmExecutor.java
    │   ├── evaluation/EvaluationPipeline.java
    │   ├── checker/SignificanceChecker.java
    │   ├── rollout/RolloutController.java
    │   └── api/CompletionController.java
    │       api/ExperimentController.java
    │       api/FeedbackController.java
    └── resources/
        ├── application.properties
        └── db/migration/
            ├── V1__create_prompt_version.sql
            ├── V2__create_ab_experiment.sql
            ├── V3__create_variant_assignment.sql
            ├── V4__create_metric_event.sql
            └── V5__create_experiment_event.sql

Environment variables

OPENAI_API_KEY=sk-...
DB_URL=jdbc:postgresql://localhost:5432/promptab
DB_USER=promptab
DB_PASS=promptab
SPRING_PROFILES_ACTIVE=local
SIGNIFICANCE_CHECK_INTERVAL_MS=300000   # 5 minutes
AUTO_PROMOTE_ENABLED=true

Docker Compose

services:
  app:
    build: .
    ports: ["8080:8080"]
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      DB_URL: jdbc:postgresql://postgres:5432/promptab
    depends_on:
      postgres: { condition: service_healthy }

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: promptab
      POSTGRES_USER: promptab
      POSTGRES_PASSWORD: promptab
    ports: ["5432:5432"]
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "promptab"]
      interval: 5s
      retries: 10

Build and start

cp .env.template .env        # add OPENAI_API_KEY
docker compose up -d --build

Verification steps

1. Health check

curl -s http://localhost:8080/actuator/health | jq .
# Expected: {"status":"UP"}

2. Create prompt versions

# Control (v1.2.0)
curl -s -u admin:admin-secret -X POST http://localhost:8080/api/prompts \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "t-001",
    "name": "customer-support-v1.2",
    "template": "You are a helpful customer support assistant.\n\nUser question: {{userInput}}\n\nProvide a clear, concise answer.",
    "modelHint": "gpt-4o-mini",
    "variables": {"userInput": "string"}
  }' | jq .id
# Save as CONTROL_ID

# Candidate (v1.3.0) — more structured output
curl -s -u admin:admin-secret -X POST http://localhost:8080/api/prompts \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "t-001",
    "name": "customer-support-v1.3",
    "template": "You are an expert customer support assistant. Answer questions accurately and empathetically.\n\nQuestion: {{userInput}}\n\nAnswer (be specific, limit to 3 sentences):",
    "modelHint": "gpt-4o-mini",
    "variables": {"userInput": "string"}
  }' | jq .id
# Save as CANDIDATE_ID

3. Create and start experiment

curl -s -u admin:admin-secret -X POST http://localhost:8080/api/experiments \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "t-001",
    "name": "support-prompt-conciseness-test",
    "controlVariantId": "<CONTROL_ID>",
    "candidateVariantId": "<CANDIDATE_ID>",
    "controlWeight": 90,
    "candidateWeight": 10,
    "primaryMetric": "WIN_RATE",
    "significanceThreshold": 0.05,
    "minSampleSize": 200,
    "autoPromote": true,
    "autoRollbackErrorRate": 0.05
  }' | jq .id
# Save as EXPERIMENT_ID

curl -s -u admin:admin-secret -X POST \
  http://localhost:8080/api/experiments/<EXPERIMENT_ID>/start

4. Send completion requests (simulate traffic)

# Session sess-001 will be consistently assigned to one variant
curl -s -X POST http://localhost:8080/api/complete \
  -H "Authorization: Bearer <USER_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"sessionId":"sess-001","tenantId":"t-001","input":"How do I reset my password?"}' \
  | jq '{output:.output, variantId:.variantId, latencyMs:.latencyMs}'

# Same session, second request — must return the same variantId (sticky)
curl -s -X POST http://localhost:8080/api/complete \
  -H "Authorization: Bearer <USER_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"sessionId":"sess-001","tenantId":"t-001","input":"What are your support hours?"}' \
  | jq .variantId
# Expected: same variantId as first request

5. Submit human feedback

curl -s -X POST http://localhost:8080/api/feedback \
  -H "Authorization: Bearer <USER_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"sessionId":"sess-001","positive":true}'

6. Check live metrics

curl -s -u admin:admin-secret \
  http://localhost:8080/api/experiments/<EXPERIMENT_ID>/metrics | jq .
# Expected: per-variant sample sizes, win-rates, avg latency, p-value

7. Verify sticky assignment at DB level

docker compose exec postgres psql -U promptab -c \
  "SELECT session_id, variant_id FROM variant_assignment \
   WHERE experiment_id = '<EXPERIMENT_ID>' LIMIT 10;"
# Each session_id maps to exactly one variant_id

8. Check audit log after auto-promote or manual action

curl -s -u admin:admin-secret \
  http://localhost:8080/api/experiments/<EXPERIMENT_ID>/audit | jq .
# Each entry shows: event_type, actor, rationale (p-value, win-rates), timestamp

9. Evidence Pack

Checklist of included evidence artifacts:

  • [*] Service startup logs showing Flyway migrations applied (V1–V5) and scheduler started
  • [*] GET /actuator/health returning UP with DB connectivity confirmed
  • [*] POST /api/prompts × 2 — two version IDs returned, DB rows visible in prompt_version
  • [*] POST /api/experiments + start — experiment status transitions DRAFT → ACTIVE
  • [*] POST /api/complete × 2 for same sessionId — both responses show identical variantId (sticky assignment proof)
  • [*] SELECT * FROM variant_assignment screenshot — single row per session confirming DB-level uniqueness constraint
  • [*] GET /api/experiments/{id}/metrics — per-variant sample counts, win-rates, and p-value after simulated traffic
  • [*] POST /api/feedback accepted and metric_event row visible in DB with event_type=HUMAN_FEEDBACK
  • [*] GET /api/experiments/{id}/audit — at least one STARTED event with full snapshot
  • [*] Auto-rollback demonstration: forced high error rate on candidate → rollback event in audit log within one checker cycle

10. Known Limitations

  • Single candidate per experiment: the current data model supports one control and one candidate variant. Multi-arm experiments (A/B/C/n) require schema extension to a experiment_variant join table and updated weight distribution logic.
  • Statistical test is fixed per metric type: win-rate uses a two-proportion z-test; continuous metrics use Welch's t-test. More sophisticated methods (Bayesian estimation, sequential probability ratio test for early stopping) are not included.
  • LLM-as-judge cost: every evaluated response triggers a second LLM call. For high-volume deployments, sample the evaluation rate (e.g., judge 20% of requests) to control cost. This sampling is not implemented in this solution.
  • Significance checker is single-node: the @Scheduled job runs on one instance. In a multi-instance deployment, use a distributed lock (e.g., ShedLock with the PostgreSQL adapter) to prevent duplicate checker runs.
  • No shadow testing mode: this solution implements canary (live traffic split) only. Shadow testing — where the candidate response is generated but not shown to the user — doubles inference cost and is not included.
  • MinSampleSize is fixed at experiment creation: adaptive sample size (updated when variance estimates become available) requires a more complex power analysis loop not included here.

11. Extension Points

Multi-arm experiments

Replace control_variant_id / candidate_variant_id with an experiment_variant join table. The router selects a variant from the weighted list. The significance checker uses one-way ANOVA across arms for continuous metrics or a chi-square test for win-rates.

Bayesian significance testing

Replace the frequentist z-test with a Beta-Binomial Bayesian model that computes the probability that the candidate variant is better. This eliminates the fixed sample size requirement and supports early stopping with controlled false-positive rates — better suited for low-traffic scenarios.

ShedLock for distributed significance checker

@Scheduled(fixedDelayString = "${significance.check.interval.ms}")
@SchedulerLock(name = "significance-checker", lockAtLeastFor = "PT4M")
public void check() { ... }

Add net.javacrumbs.shedlock:shedlock-spring and shedlock-provider-jdbc-template to ensure exactly-one execution per interval across any number of instances.

Prompt deployment decoupled from application release

Expose a POST /api/experiments/{id}/promote webhook that CI/CD pipelines can call after offline evaluation passes. This allows prompt engineers to iterate without requiring a Java developer or a deployment pipeline run.

Cost-aware auto-promotion

Extend the significance checker to include a cost gate: if the candidate variant uses more tokens per request than the control by more than a configurable margin, require a higher win-rate threshold before auto-promoting. This prevents the system from promoting more expensive prompts unless quality improvement justifies the cost.


References: Traceloop A/B Testing Guide · Llama A/B Testing Deployment Guide · Statsig LLM Optimization · Maxim AI Prompt Versioning

Locked
Register account to unlock implementation details and assets.
Account


  • Solution write-up + runnable implementation
  • Evidence images (when published)
  • Code bundle downloads (when enabled)
Evidence
11 item(s)
ev-01-health.png
ev-02-flyway.png
ev-03-prompts.png
ev-04-experiment.png
ev-05-sticky.png
ev-06-db-sticky.png
ev-07-feedback.png
ev-08-metrics.png
ev-09-audit.png
ev-10-rollback.png
ev-11-promote.png
Code downloads
1 file(s)
prompt-ab-solution_v1.0.zip
ZIP bundle
Locked