Problem Statement
Teams want to ship LLM features quickly. Operations need control:
- keys must be scoped and rotated
- usage must be isolated by tenant and project
- spend must be predictable with hard caps
- abuse must be blocked (prompt injection / jailbreak attempts / large payloads)
- auditability must exist (for incident response + billing + compliance)
Direct-to-provider SDK usage makes this messy and inconsistent. The gateway centralizes controls.
This solution delivers a Spring Boot LLM Gateway that sits between clients and an OpenAI-compatible provider.
Requirements
Runtime
- Java 17 + Spring Boot 3.x
- PostgreSQL (policy + audit + usage aggregation)
- Redis (optional; recommended for high-QPS counters + caching)
- Docker Compose for local runnable stack
Supported Provider
- OpenAI-compatible API (configurable base URL + API key)
Architecture
High-Level Flow
Client → Gateway (AuthN/AuthZ) → Policy Engine → Quota/Rate Limit → (Cache?) → Provider → Response Filter → Audit → Client
Core Components
-
Tenant & Key Management
- tenant table (id, name, status)
- api_key table (hashed key, tenant_id, scopes, created_at, last_used_at, rotated_at)
- key rotation workflow (new key active, old key expires after grace period)
-
Policy Engine
- input limits: max prompt bytes, max tokens, allowed models
- output clamp: max completion tokens
- deny rules (optional): basic patterns, disallowed content flags, model allowlist
-
Quota & Cost Controls
- rate limits: requests/sec, concurrent in-flight
- budgets: daily/monthly spend caps per tenant/project
- token budgets: daily/monthly token caps
- enforcement modes:
- HARD: reject once limit reached
- SOFT: degrade (fallback model, lower max tokens)
- OBSERVE: record but do not block (for rollout)
-
Caching (Optional)
- safe caching only for:
- deterministic temperature=0 calls
- explicitly cacheable routes
- key includes tenant_id + model + canonicalized prompt + relevant params
- TTL by route/policy
- cache hit/miss recorded
- safe caching only for:
-
Audit & Evidence Store
- per-request record: tenant_id, key_id, model, tokens_in/out, latency_ms, policy_decision, outcome
- redactable prompt/response storage (policy-driven)
- trace_id for correlation
Data Model (Recommended)
Tables (minimum)
tenants
- id (uuid)
- name
- status (ACTIVE/SUSPENDED)
- created_at
api_keys
- id
- tenant_id
- key_hash (never store raw key)
- scopes (json)
- status
- created_at
- last_used_at
policies
- tenant_id
- allowed_models (json)
- max_prompt_bytes
- max_input_tokens
- max_output_tokens
- rate_limit_rps
- max_inflight
- daily_budget_usd
- monthly_budget_usd
- daily_token_cap
- monthly_token_cap
- enforcement_mode (HARD/SOFT/OBSERVE)
- redact_mode (NONE/BASIC/STRICT)
usage_rollup_daily
- tenant_id
- date
- requests
- tokens_in
- tokens_out
- cost_usd_est
- blocked_requests
audit_log
- request_id
- tenant_id
- key_id
- model
- request_ts
- latency_ms
- tokens_in/out
- cost_usd_est
- decision (ALLOW/BLOCK/DEGRADE)
- reason_code
- trace_id
- prompt_redacted (optional)
- response_redacted (optional)
Request Pipeline
1) Authentication
- accept
Authorization: Bearer <tenant_api_key> - lookup by hash (constant-time compare)
- resolve tenant + scopes
- reject if tenant/key is suspended
2) Input Normalization
- canonicalize request:
- model
- temperature/top_p
- messages/prompt
- max_tokens (requested)
- compute:
- prompt_bytes
- estimated tokens_in (approx ok; final tokens from provider response)
3) Policy Decision
- verify allowed model
- clamp max_tokens to policy max_output_tokens
- enforce max_prompt_bytes
- optional: enforce max_input_tokens (approx or via tokenizer lib)
Decision outcomes:
- ALLOW
- DEGRADE (rewrite request: cheaper model, lower max_tokens, stricter params)
- BLOCK (reject with reason code)
4) Quota Enforcement
Rate limit
- sliding window / token bucket (Redis recommended)
- per-tenant and optionally per-key
- enforce max in-flight with semaphore counter
Budget
- compute estimated cost for the request using configured price table:
- input_tokens * price_in + output_tokens * price_out
- check daily/monthly rollups
- if limit reached:
- HARD: reject
- SOFT: degrade (cheaper model + max_tokens down)
- OBSERVE: allow, but flag in audit
5) Cache (Optional)
- only if policy allows caching
- only if temperature=0 (or explicit allow)
- record cache hit in audit
6) Provider Call
- bounded timeout + retry only on safe transient failures
- propagate trace_id
- parse provider usage to get actual tokens/cost
7) Response Filtering
- ensure no provider metadata leaks secrets
- optionally redact sensitive fields
- return to client
8) Persist Audit + Update Usage Rollups
- write audit_log
- update usage_rollup_daily (transaction or async with idempotency)
Failure Modes & Safeguards
Failure Mode: Provider Outage / 5xx
Mitigation
- short bounded retries (max 1–2) for transient errors
- circuit breaker to shed load
- fallback strategy (optional):
- alternative provider endpoint
- lower-capability model
- cached responses if allowed
Failure Mode: Redis Down (if used)
Mitigation
- policy setting:
- HARD FAIL: block requests (strict)
- SOFT FAIL: allow but log “quota_unavailable” (risky)
- recommended: degrade throughput if quota store unavailable
Failure Mode: Budget Calculation Mismatch
Mitigation
- always store provider-reported tokens in audit when available
- rollups based on actual usage
- price table versioning
Failure Mode: Key Leakage
Mitigation
- keys are hashed at rest
- rotate keys quickly + revoke old key
- per-key rate limits and anomaly alerting (optional)
Security & Compliance
Prompt/Response Storage
Storage modes:
- NONE: do not store prompts/responses, only metadata
- BASIC: store truncated + redacted
- STRICT: store hashes only, plus small safe excerpts
Controls
- never log Authorization header
- redact secrets patterns (API keys, JWTs, emails) before persistence
- configurable retention (e.g., audit logs 7–30 days, rollups 90+ days)
Cost & Scaling Notes
Scaling Strategy
- stateless gateway instances behind load balancer
- Redis for fast counters + caching
- Postgres for durable audit + rollups
- async audit write is possible if you include idempotency per request_id
Hot Paths
- auth lookup (cache key_id in Redis)
- quota counters (Redis)
- audit writes (batch insert or async worker if needed)
Cost Controls Best Practices
- per-tenant monthly cap + per-key RPS cap
- max output tokens clamp prevents “runaway responses”
- degrade mode reduces bill shock during traffic spikes
Verification Checklist (Evidence to Publish)
Publish these as evidence artifacts for buyers:
-
Quota block
- show tenant hitting daily budget
- gateway returns 429/402-like error with reason code
-
Degrade mode
- show request rewritten to cheaper model + max_tokens reduced
-
Audit record
- screenshot/log of audit_log entry with tokens, cost estimate, decision, trace_id
-
Rate limit
- load test script showing 429 after threshold
-
Cache hit
- demonstrate repeated request served from cache + audit marks cache_hit=true
Run Instructions
docker compose up -d- create tenant + API key (bootstrap script)
- call gateway endpoint with Bearer key
- observe:
- audit_log updates
- usage_rollup_daily increments
- test enforcement:
- set low daily_budget_usd
- run request loop to trigger block/degrade
Configuration Reference
-
Provider:
LLM_BASE_URLLLM_API_KEY
-
Gateway:
GATEWAY_PORTAUTH_HEADER=Authorization
-
Policy defaults:
DEFAULT_ALLOWED_MODELSDEFAULT_MAX_PROMPT_BYTESDEFAULT_MAX_OUTPUT_TOKENSDEFAULT_RATE_LIMIT_RPSDEFAULT_MONTHLY_BUDGET_USDDEFAULT_ENFORCEMENT_MODE(HARD/SOFT/OBSERVE)
-
Redis:
REDIS_URL(optional but recommended)
-
Postgres:
SPRING_DATASOURCE_URLSPRING_DATASOURCE_USERNAMESPRING_DATASOURCE_PASSWORD
Changelog Guidance
Record changes that affect:
- enforcement behavior (HARD/SOFT/OBSERVE)
- pricing tables / cost estimation
- schema changes (audit and rollups)
- caching behavior
- security/redaction modes