Problem Statement

Teams want to ship LLM features quickly. Operations need control:

keys must be scoped and rotated
usage must be isolated by tenant and project
spend must be predictable with hard caps
abuse must be blocked (prompt injection / jailbreak attempts / large payloads)
auditability must exist (for incident response + billing + compliance)

Direct-to-provider SDK usage makes this messy and inconsistent. The gateway centralizes controls.

This solution delivers a Spring Boot LLM Gateway that sits between clients and an OpenAI-compatible provider.

Requirements

Runtime

Java 17 + Spring Boot 3.x
PostgreSQL (policy + audit + usage aggregation)
Redis (optional; recommended for high-QPS counters + caching)
Docker Compose for local runnable stack

Supported Provider

OpenAI-compatible API (configurable base URL + API key)

Architecture

High-Level Flow

Client → Gateway (AuthN/AuthZ) → Policy Engine → Quota/Rate Limit → (Cache?) → Provider → Response Filter → Audit → Client

Core Components

Tenant & Key Management
- tenant table (id, name, status)
- api_key table (hashed key, tenant_id, scopes, created_at, last_used_at, rotated_at)
- key rotation workflow (new key active, old key expires after grace period)
Policy Engine
- input limits: max prompt bytes, max tokens, allowed models
- output clamp: max completion tokens
- deny rules (optional): basic patterns, disallowed content flags, model allowlist
Quota & Cost Controls
- rate limits: requests/sec, concurrent in-flight
- budgets: daily/monthly spend caps per tenant/project
- token budgets: daily/monthly token caps
- enforcement modes:
  - HARD: reject once limit reached
  - SOFT: degrade (fallback model, lower max tokens)
  - OBSERVE: record but do not block (for rollout)
Caching (Optional)
- safe caching only for:
  - deterministic temperature=0 calls
  - explicitly cacheable routes
- key includes tenant_id + model + canonicalized prompt + relevant params
- TTL by route/policy
- cache hit/miss recorded
Audit & Evidence Store
- per-request record: tenant_id, key_id, model, tokens_in/out, latency_ms, policy_decision, outcome
- redactable prompt/response storage (policy-driven)
- trace_id for correlation

Data Model (Recommended)

Tables (minimum)

tenants

id (uuid)
name
status (ACTIVE/SUSPENDED)
created_at

api_keys

id
tenant_id
key_hash (never store raw key)
scopes (json)
status
created_at
last_used_at

policies

tenant_id
allowed_models (json)
max_prompt_bytes
max_input_tokens
max_output_tokens
rate_limit_rps
max_inflight
daily_budget_usd
monthly_budget_usd
daily_token_cap
monthly_token_cap
enforcement_mode (HARD/SOFT/OBSERVE)
redact_mode (NONE/BASIC/STRICT)

usage_rollup_daily

tenant_id
date
requests
tokens_in
tokens_out
cost_usd_est
blocked_requests

audit_log

request_id
tenant_id
key_id
model
request_ts
latency_ms
tokens_in/out
cost_usd_est
decision (ALLOW/BLOCK/DEGRADE)
reason_code
trace_id
prompt_redacted (optional)
response_redacted (optional)

Request Pipeline

1) Authentication

accept Authorization: Bearer <tenant_api_key>
lookup by hash (constant-time compare)
resolve tenant + scopes
reject if tenant/key is suspended

2) Input Normalization

canonicalize request:
- model
- temperature/top_p
- messages/prompt
- max_tokens (requested)
compute:
- prompt_bytes
- estimated tokens_in (approx ok; final tokens from provider response)

3) Policy Decision

verify allowed model
clamp max_tokens to policy max_output_tokens
enforce max_prompt_bytes
optional: enforce max_input_tokens (approx or via tokenizer lib)

Decision outcomes:

ALLOW
DEGRADE (rewrite request: cheaper model, lower max_tokens, stricter params)
BLOCK (reject with reason code)

4) Quota Enforcement

Rate limit

sliding window / token bucket (Redis recommended)
per-tenant and optionally per-key
enforce max in-flight with semaphore counter

Budget

compute estimated cost for the request using configured price table:
- input_tokens * price_in + output_tokens * price_out
check daily/monthly rollups
if limit reached:
- HARD: reject
- SOFT: degrade (cheaper model + max_tokens down)
- OBSERVE: allow, but flag in audit

5) Cache (Optional)

only if policy allows caching
only if temperature=0 (or explicit allow)
record cache hit in audit

6) Provider Call

bounded timeout + retry only on safe transient failures
propagate trace_id
parse provider usage to get actual tokens/cost

7) Response Filtering

ensure no provider metadata leaks secrets
optionally redact sensitive fields
return to client

8) Persist Audit + Update Usage Rollups

write audit_log
update usage_rollup_daily (transaction or async with idempotency)

Failure Modes & Safeguards

Failure Mode: Provider Outage / 5xx

Mitigation

short bounded retries (max 1–2) for transient errors
circuit breaker to shed load
fallback strategy (optional):
- alternative provider endpoint
- lower-capability model
- cached responses if allowed

Failure Mode: Redis Down (if used)

Mitigation

policy setting:
- HARD FAIL: block requests (strict)
- SOFT FAIL: allow but log “quota_unavailable” (risky)
recommended: degrade throughput if quota store unavailable

Failure Mode: Budget Calculation Mismatch

Mitigation

always store provider-reported tokens in audit when available
rollups based on actual usage
price table versioning

Failure Mode: Key Leakage

Mitigation

keys are hashed at rest
rotate keys quickly + revoke old key
per-key rate limits and anomaly alerting (optional)

Security & Compliance

Prompt/Response Storage

Storage modes:

NONE: do not store prompts/responses, only metadata
BASIC: store truncated + redacted
STRICT: store hashes only, plus small safe excerpts

Controls

never log Authorization header
redact secrets patterns (API keys, JWTs, emails) before persistence
configurable retention (e.g., audit logs 7–30 days, rollups 90+ days)

Cost & Scaling Notes

Scaling Strategy

stateless gateway instances behind load balancer
Redis for fast counters + caching
Postgres for durable audit + rollups
async audit write is possible if you include idempotency per request_id

Hot Paths

auth lookup (cache key_id in Redis)
quota counters (Redis)
audit writes (batch insert or async worker if needed)

Cost Controls Best Practices

per-tenant monthly cap + per-key RPS cap
max output tokens clamp prevents “runaway responses”
degrade mode reduces bill shock during traffic spikes

Verification Checklist (Evidence to Publish)

Publish these as evidence artifacts for buyers:

Quota block
- show tenant hitting daily budget
- gateway returns 429/402-like error with reason code
Degrade mode
- show request rewritten to cheaper model + max_tokens reduced
Audit record
- screenshot/log of audit_log entry with tokens, cost estimate, decision, trace_id
Rate limit
- load test script showing 429 after threshold
Cache hit
- demonstrate repeated request served from cache + audit marks cache_hit=true

Run Instructions

docker compose up -d
create tenant + API key (bootstrap script)
call gateway endpoint with Bearer key
observe:
- audit_log updates
- usage_rollup_daily increments
test enforcement:
- set low daily_budget_usd
- run request loop to trigger block/degrade

Configuration Reference

Provider:
- LLM_BASE_URL
- LLM_API_KEY
Gateway:
- GATEWAY_PORT
- AUTH_HEADER=Authorization
Policy defaults:
- DEFAULT_ALLOWED_MODELS
- DEFAULT_MAX_PROMPT_BYTES
- DEFAULT_MAX_OUTPUT_TOKENS
- DEFAULT_RATE_LIMIT_RPS
- DEFAULT_MONTHLY_BUDGET_USD
- DEFAULT_ENFORCEMENT_MODE (HARD/SOFT/OBSERVE)
Redis:
- REDIS_URL (optional but recommended)
Postgres:
- SPRING_DATASOURCE_URL
- SPRING_DATASOURCE_USERNAME
- SPRING_DATASOURCE_PASSWORD

Changelog Guidance

Record changes that affect:

enforcement behavior (HARD/SOFT/OBSERVE)
pricing tables / cost estimation
schema changes (audit and rollups)
caching behavior
security/redaction modes

LLM Gateway for Spring Boot: Multi-tenant API Keys, Quotas, and Cost Controls

Business Fit

Enterprise Readiness

Delivery Package

Implementation Notes

Problem Statement

Requirements

Runtime

Supported Provider

Architecture

High-Level Flow

Core Components

Data Model (Recommended)

Tables (minimum)

Request Pipeline

1) Authentication

2) Input Normalization

3) Policy Decision

4) Quota Enforcement

5) Cache (Optional)

6) Provider Call

7) Response Filtering

8) Persist Audit + Update Usage Rollups

Failure Modes & Safeguards

Failure Mode: Provider Outage / 5xx

Failure Mode: Redis Down (if used)

Failure Mode: Budget Calculation Mismatch

Failure Mode: Key Leakage

Security & Compliance

Prompt/Response Storage

Controls

Cost & Scaling Notes

Scaling Strategy

Hot Paths

Cost Controls Best Practices

Verification Checklist (Evidence to Publish)

Run Instructions

Configuration Reference

Changelog Guidance