LLM Semantic Cache in Spring Boot: Similarity Hits, Freshness Policies, and Safe Fallbacks

A semantic cache changes the serving contract. Instead of sending every request to the model, the application first asks whether a prior answer is similar enough, fresh enough, and safe enough to reuse.

Verified v1.0.0 Redhat 8/9 / Ubuntu / macOS / Windows (Docker) Java 17 · Spring Boot 3.x · PostgreSQL/pgvector · Redis/Valkey · OpenAI/Compatible LLM API · Docker Compose

Unlock full implementation + downloads

Account access required

This solution includes runnable code bundles and full implementation details intended for production use.

1. Overview

This solution implements a production-ready semantic cache layer for Spring Boot applications that use LLM APIs. It is designed for teams that already have chat, RAG, or assistant-style endpoints in production, but need a safe way to reduce cost and latency for repeated or semantically similar requests without serving stale or misleading answers.

The problem it solves is straightforward: in production, many LLM requests are not identical, but they are similar enough that recomputing the answer is wasteful. A standard request cache misses because users phrase the same intent differently, while a naive semantic cache can return the wrong answer if similarity thresholds, freshness rules, and tenant boundaries are not enforced carefully.

Typical failure patterns include:

Users ask the same FAQ, support, or policy question with slightly different wording, so a normal key-based cache never hits and the system calls the model again.
A cached answer is returned for a semantically similar question even though the underlying source documents, prompt version, or business policy have changed.
Multi-tenant applications accidentally reuse cached answers across tenants because cache keys are partitioned too loosely.
A cache stores answers for high-risk prompts that should never be reused without revalidation, such as account-specific or time-sensitive requests.
A cache hit hides answer drift because the system has no policy for source-version invalidation, TTL, or safe bypass.

Existing approaches often fail in production because they treat semantic caching as “vector search over previous prompts” and stop there. That is enough for a benchmark demo, but not for a service where a fast answer is only useful if it is still valid, correctly scoped, and safe to reuse.

This implementation is production-ready because:

It separates cache lookup, match validation, freshness checks, and fallback execution into explicit runtime stages.
It partitions cache entries by tenant, model contract, prompt version, and content scope to prevent unsafe reuse.
It supports freshness policies, invalidation hooks, and safe bypass rules so the cache reduces cost and latency without quietly degrading answer quality.

In practice, this matters most in applications where repeated intent is common: customer support assistants, internal knowledge search, policy lookup, onboarding chat, developer copilots, and common workflow help. Public documentation from major platforms now treats semantic caching as a first-class optimization layer for LLM applications because it can reduce repeated model calls and lower latency when implemented with vector similarity and scoped cache policies.

2. Architecture

Request flow and dependencies:

Client → Spring Boot REST API
Spring Boot REST API → Request validation and cache eligibility policy
Spring Boot service → Query normalizer for cache-safe request canonicalization
Spring Boot service → Embedding provider for semantic lookup vector
Spring Boot service → Redis/Valkey for fast cache metadata and payload retrieval
Spring Boot service → PostgreSQL/pgvector or vector-capable cache index for similarity lookup
Spring Boot service → Cache match validator for threshold, tenant, prompt version, and freshness checks
Spring Boot service → LLM provider for fallback generation on cache miss or cache bypass
Spring Boot service → Source fingerprint service for invalidation and freshness policy enforcement
Spring Boot service → Persistence store for cache entries, hit/miss records, and invalidation history
Spring Boot REST API → JSON response containing answer, cache metadata, and hit/miss status

Key components:

API Controller: Accepts requests, validates input, and returns responses with cache outcome metadata.
Eligibility Policy Engine: Decides whether a request is allowed to use semantic caching at all.
Query Normalizer: Produces a stable normalized representation for embedding and policy checks.
Embedding Client: Converts the request into a vector used for similarity lookup.
Similarity Lookup Service: Searches prior cache entries for the nearest eligible match.
Match Validator: Applies tenant partitioning, threshold rules, prompt-version filters, freshness checks, and risk gating.
Cache Store: Stores cached prompt embeddings, response payloads, metadata, TTL state, and source fingerprints.
Fallback Generation Service: Calls the LLM when the request misses cache or fails validation.
Invalidation Service: Expires or suppresses entries when source documents, prompts, or policies change.
Observability Layer: Emits structured logs, metrics, and traces for hits, misses, bypasses, similarity scores, and stale-entry suppression.

Trust boundaries:

Inbound boundary: The REST API validates user identity, tenant scope, and whether the request type is cache-eligible.
Model boundary: Embeddings and generated answers are treated as untrusted until scoped and validated by cache policy.
Cache boundary: Cache payloads are never reused until similarity, freshness, and partition checks all pass.
Storage boundary: Only application-controlled services write cache entries, invalidation records, and hit/miss telemetry.
Tenant boundary: Cache lookup and reuse are always scoped by tenant and application contract to prevent cross-tenant leakage.

The architecture is deliberately arranged so that cache reuse is not a single decision based only on cosine similarity:

Eligibility determines whether the request is allowed to use semantic caching.
Similarity lookup finds candidate prior requests.
Validation checks whether the candidate is reusable in the current scope.
Freshness policy determines whether it is still valid.
Fallback execution regenerates the answer when any of those checks fail.

That separation matters in production. If a semantic cache is wrong, the team needs to know whether the problem came from the similarity threshold, tenant partitioning, invalidation policy, model drift, or unsafe reuse of a stale answer. Without that decomposition, all failures collapse into “the cache returned the wrong thing,” which is not actionable.

3. Key Design Decisions

Technology stack

Spring Boot 3.x is used because it provides a strong operational base for REST APIs, policy enforcement, observability, transactions, and configuration management in Java 17 services.

Redis or Valkey is selected for low-latency cache payload access because cache hits must be significantly faster than full LLM execution to justify the additional lookup layer.

PostgreSQL with pgvector is used to store semantic cache embeddings and support similarity lookup with operational simplicity. This keeps the cache index close to the service data plane and works well for moderate workloads where correctness and observability matter more than extreme retrieval scale.

OpenAI-compatible LLM APIs are used for both generation and embeddings so the implementation can work with cloud-hosted or local providers behind the same contract.

Docker Compose is chosen for local execution because it makes the service, database, and cache easy to run and verify in a reproducible environment.

Why not use only Redis vector search or only a normal HTTP cache? Because the problem requires two different properties:

low-latency payload access for confirmed hits,
and semantic similarity lookup for non-identical prompts.

A simple request cache cannot handle paraphrases, and a pure vector store without cache policy controls does not make reuse safe.

Cache eligibility model

Not every LLM request should be semantically cached. The runtime first determines whether a request is eligible based on:

task type,
tenant policy,
risk classification,
personalization level,
and time sensitivity.

This is critical because some prompts should bypass the cache entirely, for example:

user-specific account questions,
requests involving ephemeral operational state,
prompts that depend on the current time,
and high-risk prompts where reuse without revalidation is unacceptable.

A semantic cache that tries to cover every request usually becomes unsafe.

Similarity threshold strategy

Similarity threshold is treated as an application policy, not just a vector database parameter. The threshold may differ by route, task class, or content domain.

This matters because a low threshold may improve hit rate while silently degrading correctness, and a high threshold may preserve quality but remove most of the cost benefit. The system should therefore:

store the similarity score used for each hit,
expose threshold settings by endpoint or policy group,
and support offline tuning using known evaluation sets.

We do not hard-code one global threshold for all prompts because the acceptable reuse distance for FAQ-style content is not the same as for policy interpretation or workflow guidance.

Freshness and invalidation model

Each cache entry carries metadata that determines whether it is still safe to reuse:

creation time,
TTL,
prompt version,
model contract,
source fingerprint or corpus version,
and optional business policy version.

This is one of the most important decisions in the system. A semantic cache is only useful if it can serve a prior answer quickly, but it is only trustworthy if it knows when that answer is no longer valid.

A strong invalidation strategy usually combines:

TTL expiration for general age control,
source-version invalidation when documents or policy data change,
prompt-version invalidation when the application changes the answer contract,
and manual flush controls for emergency suppression.

Multi-tenant partitioning

Cache entries are partitioned by tenant and request contract. A match must share the correct tenant scope, route or feature identity, prompt version, and model family constraints before it is even considered reusable.

This matters because semantic similarity alone says nothing about ownership. Two tenants may ask the same policy question using different underlying documents, branding, access rules, or business policies. A semantic cache that is not partitioned properly can create data leakage and wrong-answer leakage at the same time.

Synchrony versus asynchrony

Cache lookup is synchronous because it must happen inline before the application decides whether to call the LLM. Invalidation, metrics aggregation, and some refresh operations can run asynchronously.

That split keeps the request path simple:

lookup inline,
validate inline,
fallback inline if needed,
update supporting telemetry asynchronously where possible.

Error handling and retries

Transient (embedding timeout, cache timeout, 5xx from LLM): bounded retry with strict timeout budget, then fallback or fail based on route policy.
Permanent (invalid request, schema mismatch, ineligible prompt): fail fast or bypass semantic cache entirely.

Retries are intentionally conservative because cache lookup should not consume more latency budget than the LLM call it is trying to avoid. If the semantic cache path becomes slow or unstable, the service should bypass it rather than amplify tail latency.

Safe fallback behavior

Fallback is not treated as a failure. It is a first-class outcome. If the semantic cache candidate is stale, insufficiently similar, wrong-scope, or policy-ineligible, the system calls the LLM and optionally writes a fresh cache entry afterward.

This is essential because the semantic cache should optimize the answer path, not define correctness. When in doubt, the runtime should regenerate rather than guess.

4. Data Model

Core tables:

semantic_cache_entries
- Purpose: Stores canonical semantic cache entries and metadata.
- Key columns: id, tenant_id, cache_key, route_name, normalized_prompt, embedding, response_payload, prompt_version, model_name, source_fingerprint, ttl_expires_at, created_at
semantic_cache_hits
- Purpose: Records successful cache hits for observability and tuning.
- Key columns: id, tenant_id, cache_entry_id, request_id, similarity_score, validation_result, created_at
semantic_cache_misses
- Purpose: Records misses, bypasses, and rejection reasons.
- Key columns: id, tenant_id, request_id, route_name, miss_reason, top_similarity_score, created_at
semantic_cache_invalidations
- Purpose: Stores invalidation events for entries or scoped cache segments.
- Key columns: id, tenant_id, scope_type, scope_value, reason, triggered_by, created_at
source_versions
- Purpose: Tracks source corpus or document fingerprints used for freshness checks.
- Key columns: id, tenant_id, source_scope, source_fingerprint, created_at
prompt_contract_versions
- Purpose: Tracks prompt or answer contract versions used to segment reuse.
- Key columns: id, route_name, prompt_version, status, created_at

Indexing strategy:

semantic_cache_entries(tenant_id, route_name, prompt_version) for scoped cache lookup
pgvector index on semantic_cache_entries(embedding) for similarity search
semantic_cache_entries(tenant_id, ttl_expires_at) for expiration scanning
semantic_cache_entries(tenant_id, source_fingerprint) for targeted invalidation
semantic_cache_hits(cache_entry_id, created_at) for hit-rate analytics
semantic_cache_misses(tenant_id, route_name, created_at) for miss analysis
semantic_cache_invalidations(tenant_id, scope_type, created_at) for invalidation tracing

The structure above supports the core operational goals of semantic caching:

Fast reuse — confirmed hits return quickly with low overhead.
Safe scoping — entries are partitioned by tenant, route, version, and source scope.
Freshness control — invalidation and TTL policies can suppress stale entries deterministically.
Tuning visibility — the team can analyze hit rates, miss reasons, and threshold performance over time.

For entry design, the runtime should clearly distinguish between:

prompt text,
normalized prompt,
embedding vector,
response payload,
and freshness contract metadata.

That separation makes it easier to change prompt formatting, invalidation policy, or matching rules without breaking the whole cache model.

5. API Surface

POST /api/chat – Submit a prompt and receive answer plus cache hit metadata (ROLE_USER)
GET /api/cache/entries/{id} – Fetch a cache entry and its metadata for diagnostics (ROLE_ADMIN)
GET /api/cache/stats – Return hit rate, miss rate, bypass rate, and top miss reasons (ROLE_ADMIN)
POST /api/admin/cache/invalidate – Invalidate cache entries by tenant, route, source fingerprint, or prompt version (ROLE_ADMIN)
POST /api/admin/cache/flush – Flush a scoped cache partition (ROLE_ADMIN)
GET /actuator/health – Health endpoint for service readiness (ROLE_ADMIN / ops network)
GET /actuator/prometheus – Metrics scraping endpoint (ROLE_ADMIN / ops network)

Example response from POST /api/chat:

{
  "requestId": "e1f4f3a1-65aa-4ef8-9bb7-40be15000001",
  "answer": "Refund requests require finance review before entering the refund workflow.",
  "cache": {
    "status": "HIT",
    "semantic": true,
    "similarityScore": 0.94,
    "entryId": "sc_1023",
    "fresh": true
  }
}

Example miss response:

{
  "requestId": "e1f4f3a1-65aa-4ef8-9bb7-40be15000002",
  "answer": "Based on the latest policy documents, refund requests require finance review before entering the refund workflow.",
  "cache": {
    "status": "MISS",
    "semantic": false,
    "missReason": "SOURCE_FINGERPRINT_MISMATCH"
  }
}

The API surface is intentionally small because semantic caching is an optimization and control layer around an existing LLM endpoint, not a separate product by itself.

A few implementation notes matter here:

The runtime should expose cache metadata in responses only where it is useful and safe.
Admin invalidation endpoints should support scoped invalidation rather than only global flushes.
Stats endpoints should distinguish between MISS, BYPASS, and REJECTED_MATCH because those outcomes imply different tuning actions.
Feature owners should be able to disable semantic caching per route when quality is uncertain.

6. Security Model

Authentication

Authentication is handled through Spring Security with JWT bearer tokens or opaque access tokens issued by the upstream identity provider. Every request carries user identity and tenant identity claims.

Authorization (roles)

ROLE_USER: Can invoke cache-enabled endpoints within tenant scope.
ROLE_ADMIN: Can inspect entries, invalidate scoped cache partitions, and view operational cache statistics.

Data isolation guarantees

Every cache entry, hit record, miss record, and invalidation event includes tenant_id, and all lookup and reuse decisions are tenant-scoped. Route name, prompt version, and source fingerprint checks are also enforced before a candidate entry is reusable.

This isolation has to be applied consistently:

Similarity search must filter by tenant and route contract before ranking.
Invalidation endpoints must affect only the requested scope.
Admin inspection endpoints must never expose another tenant’s prompts or cached responses.
Cache payload encryption or field-level redaction may be required if cached outputs contain sensitive content.

Security also intersects with eligibility policy. Requests involving regulated or sensitive data may need to bypass semantic caching entirely, even when a strong semantic match exists.

7. Operational Behavior

Startup behavior

On startup, the application:

Validates required environment variables for Redis/Valkey, PostgreSQL, and model providers
Runs schema migrations
Verifies PostgreSQL connectivity
Verifies cache connectivity and health
Initializes embedding and generation clients
Loads prompt contract versions and cache policy settings
Registers health indicators for DB, cache, embedding provider, and LLM provider

The service only reports ready after all core dependencies used in the cache decision path are available or explicitly configured for bypass mode.

Failure modes

Cache unavailable: the service bypasses semantic caching and falls back to direct LLM execution if route policy allows.
Embedding provider unavailable: the semantic lookup path is skipped and the service falls back to direct LLM execution.
Vector store unavailable: the service records a cache-bypass event and falls back.
LLM unavailable: cache hits may still succeed, but misses cannot be generated; the route returns dependency failure if no safe cached answer is available.
Source fingerprint mismatch: the candidate entry is rejected and treated as a miss.
Prompt version mismatch: the candidate entry is ignored and treated as a miss.
TTL expired: the candidate entry is not reused and may be refreshed after fallback generation.

Those behaviors should be deliberate. The most dangerous semantic cache failure is not “missing a hit.” It is “reusing an answer that should have been bypassed.”

Retry and timeout behavior

Recommended defaults:

Embedding lookup timeout: 1s to 2s
Cache payload lookup timeout: 50ms to 150ms
Similarity search timeout: 100ms to 300ms
LLM generation timeout: route-specific, typically 5s to 15s

Retry policy:

Cache or vector lookup: at most one retry for short transient network failures
Embedding: at most one retry if the total request budget still makes semantic caching worthwhile
No repeated retries on cache path if they would exceed a reasonable fraction of model latency
LLM fallback follows its own route-level retry policy

If the semantic cache path becomes slow, the service should degrade toward direct model execution rather than turn optimization into added latency.

Observability hooks

Structured logs: request_id, tenant_id, route_name, cache_status, similarity_score, cache_entry_id, miss_reason, prompt_version, source_fingerprint, latency_ms, model_name

OpenTelemetry traces:

semantic_cache.request: root span for the cache-aware request lifecycle
semantic_cache.embed: embedding provider call
semantic_cache.lookup_vector: similarity search span
semantic_cache.validate_match: threshold and freshness validation span
semantic_cache.fetch_payload: cache payload retrieval span
semantic_cache.fallback_generate: LLM fallback generation span
semantic_cache.store_entry: cache write-back span
semantic_cache.invalidate: invalidation action span

These hooks are essential because cache tuning is an empirical process. Operators need to answer:

Which routes have useful hit rates?
Are misses mostly due to threshold, freshness, or policy bypass?
Are stale answers being suppressed correctly?
Did a prompt rollout invalidate enough of the cache?
Is the cache actually reducing latency and LLM spend?

Without that telemetry, the team cannot tune thresholds or trust the optimization.

8. Local Execution

Prerequisites

Docker Desktop with Compose v2
JDK 17
Available ports: 8080, 5432, 6379

Environment variables

SPRING_DATASOURCE_URL=jdbc:postgresql://localhost:5432/semanticcachedb
SPRING_DATASOURCE_USERNAME=semantic
SPRING_DATASOURCE_PASSWORD=semantic

SPRING_DATA_REDIS_HOST=localhost
SPRING_DATA_REDIS_PORT=6379

LLM_API_BASE_URL=http://host.docker.internal:11434/v1
LLM_API_KEY=dummy
LLM_GENERATION_MODEL=gpt-4.1-mini
EMBEDDING_MODEL=text-embedding-3-small

APP_CACHE_TOP_K=3
APP_CACHE_SIMILARITY_THRESHOLD=0.92
APP_CACHE_DEFAULT_TTL_SECONDS=3600
APP_CACHE_ENABLE_SOURCE_FINGERPRINT=true
APP_CACHE_ENABLE_SEMANTIC_CACHE=true

Docker Compose usage

docker compose up -d --build

Verification steps

Health check:

curl -s http://localhost:8080/actuator/health

Send the first request to populate cache:

curl -s -X POST http://localhost:8080/api/chat \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is the refund process?",
    "tenantId": "demo-tenant"
  }'

Send a semantically similar request and inspect hit status:

curl -s -X POST http://localhost:8080/api/chat \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "How do refund requests get handled?",
    "tenantId": "demo-tenant"
  }'

Invalidate a scoped cache segment:

curl -s -X POST http://localhost:8080/api/admin/cache/invalidate \
  -H "Authorization: Bearer <ADMIN_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "demo-tenant",
    "scopeType": "SOURCE_FINGERPRINT",
    "scopeValue": "refund-policy-v4"
  }'

A practical local setup should include:

at least one repeated FAQ-style route,
one route configured to bypass semantic caching,
one source fingerprint change scenario,
and one threshold tuning example where a low score is rejected as unsafe.

A good demo does not only show a cache hit. It shows a hit, a bypass, and an invalidation-triggered miss so the safety model is visible.

9. Evidence Pack

Checklist of included evidence artifacts:

[ ] Service startup logs showing DB, cache, and provider readiness
[ ] Successful first POST /api/chat invocation with cache miss and response generation
[ ] Successful second semantically similar POST /api/chat invocation with cache hit metadata
[ ] Database record in semantic_cache_entries showing stored embedding, scope, and TTL fields
[ ] Cache miss record in semantic_cache_misses showing reason such as threshold rejection or source mismatch
[ ] Invalidation proof in semantic_cache_invalidations showing scoped invalidation event
[ ] Test evidence: integration test output for hit, miss, bypass, and invalidation behavior

A strong evidence pack makes this kind of solution credible. It shows that the semantic cache is not only theoretically sound, but actually scoped, observable, and safe to tune.

For a public-facing solution post, the most convincing artifacts are usually:

a startup log showing ready state,
a miss-then-hit request pair,
a scoped invalidation example,
and a database or cache inspection proving that stale entries are no longer reused after source change.

Those artifacts are also useful later when threshold or invalidation policy changes.

10. Known Limitations

The solution improves cost and latency, but it does not make weak answers correct; a cached bad answer is still a bad answer if eligibility and freshness policies are weak.
Similarity threshold tuning is domain-specific. A threshold that works for FAQ routes may be unsafe for policy or compliance-oriented questions.
Semantic caching adds operational complexity compared with direct model calls. For low-volume or highly personalized workloads, the extra layer may not be justified.

There are additional practical limitations worth stating directly.

First, embeddings and similarity search are an approximation. Strong semantic similarity does not guarantee that two prompts are interchangeable in business meaning.

Second, source-version invalidation only works if the application has a clear notion of what source state the answer depends on. Weak source fingerprinting leads to stale-hit risk.

Third, semantic caching is less suitable for highly personalized, rapidly changing, or stateful requests where reuse is inherently dangerous.

11. Extension Points

Replace PostgreSQL/pgvector with Redis vector search, OpenSearch, Elasticsearch, or a dedicated vector engine for larger-scale lookup workloads.
Add route-specific threshold tuning and offline evaluation datasets for cache safety calibration.
Add response chunk fingerprints so the cache can invalidate by affected source segment instead of broad corpus version.
Add gateway-level semantic caching in front of multiple model backends when the application architecture centralizes LLM traffic.
Production hardening: add encryption at rest for cached payloads, adaptive TTL policies, rate-aware bypass controls, and cost dashboards for stricter operational guarantees.

Several extensions are especially natural for real deployments.

Tenant-specific cache policies can apply different thresholds and TTLs per customer.

Hybrid cache strategy can combine exact-match caching, semantic caching, and result streaming for the best latency profile.

A/B threshold experiments can measure hit rate versus answer quality tradeoffs before applying policy changes globally.

Cost attribution dashboards can turn hit-rate gains into route-level savings estimates for product and platform owners.

Closing Notes

Semantic caching is not about making the model smarter. It is about making the serving layer more disciplined about when a prior answer can be reused safely.

A basic LLM service sends every request to the model. A semantic-cache-aware service asks a stricter question first: has a sufficiently similar, sufficiently fresh, correctly scoped answer already been computed?

That difference is small at the endpoint level, but large in production behavior. It is the difference between:

paying model cost again for every paraphrase,
and serving repeated intent with lower latency and lower spend while preserving control over freshness and safety.

If your current Spring Boot AI service already supports chat, RAG, or knowledge-style question answering, the next meaningful step is not necessarily a larger model or more aggressive prompt engineering. In many cases, the biggest upgrade is to add the missing serving controls: