Agentic Workflows in Spring Boot: Tool Calling, Idempotency, and Durable Runs
A runnable workflow engine for LLM tool-calling with durable run state, retries, idempotency keys, and human-in-the-loop checkpoints.
1. Overview
This solution implements a runnable “agentic workflow” execution service in Spring Boot that can orchestrate LLM tool-calling while guaranteeing production-grade properties: durable run state, bounded retries with backoff, idempotency keys to prevent duplicate side effects, and human-in-the-loop checkpoints for sensitive actions.
The problem it solves is straightforward: once an LLM is allowed to call tools (HTTP, database writes, internal service calls), naive implementations become operationally unsafe. Typical failure patterns include:
- Duplicate tool execution when clients retry requests or when the LLM repeats calls after partial failures.
- Lost or inconsistent workflow state when the process restarts mid-run.
- Unbounded loops or runaway costs when the model keeps calling tools without constraints.
- Non-replayable incidents because intermediate decisions, tool inputs/outputs, and retry history are not persisted.
- “Human approval” steps implemented as ad-hoc blocking logic that breaks under restarts and timeouts.
Existing approaches often fail in production because they treat tool-calling as a synchronous chat completion loop with in-memory state. When the node restarts, state is lost; when a request times out, clients retry and side effects duplicate; when a tool fails transiently, there is no consistent retry semantics or visibility into what happened.
This implementation is production-ready because it treats workflows as durable runs:
- Every run is persisted with step-level state transitions (queued → running → waiting_for_approval → completed/failed).
- Tool calls are recorded as first-class events with idempotency protection at the tool boundary.
- Retries are deterministic and observable, with attempt counters, next-at timestamps, and structured error capture.
- Human-in-the-loop is modeled as a durable wait state, not a process-level block.
- End-to-end tracing (OpenTelemetry) correlates API requests, run execution, tool calls, and retries.
2. Architecture
Request flow and dependencies:
- Client → Spring Boot REST API
- REST API → PostgreSQL (durable run + step state, outbox events, idempotency records)
- REST API → LLM Provider (OpenAI-compatible API) for “next action” planning and tool selection
- Workflow Engine (in-process) → Tool Registry (internal interface) → External Systems (HTTP tools, DB tools, mock tools)
- Scheduler (Spring Scheduling) → Run Dispatcher → Workflow Engine (process pending work, retries, timeouts)
- Core Service → OpenTelemetry SDK → OTLP Collector (optional) → Tracing backend (e.g., Jaeger/Tempo/Langfuse-compatible collector)
Key components:
- Run API: creates runs, queries run status, lists history, handles approval.
- Workflow Engine: executes deterministic state machine over persisted steps.
- Planner Adapter: wraps LLM calls and normalizes “tool call” directives and constraints.
- Tool Registry: maps tool names to strongly-typed implementations; enforces idempotency and timeouts.
- Run Dispatcher: picks eligible runs/steps from DB (using leases) and executes them.
- Idempotency Service: stores and checks idempotency keys per tool execution boundary.
- Outbox/Event Log: optional append-only event records for audits and replay.
- Observability: structured logs + OTel traces (request/run/step/tool spans).
Trust boundaries:
- Inbound boundary: client input to REST API (authz + validation).
- LLM boundary: planner output is untrusted and must be constrained (allowed tools, max steps, schema validation).
- Tool boundary: all side effects must be guarded (idempotency + authorization + allowlists).
- Data boundary: run history contains sensitive data; access controlled by roles and tenant scoping.
3. Key Design Decisions
Technology stack
- Spring Boot 3.x / Java 17: stable operational model, mature observability and security integrations, straightforward packaging for Docker Compose.
- PostgreSQL: required for durable run state, idempotency ledger, leases, and replayable history; transactional consistency is central to correctness.
- Spring Scheduling: lightweight dispatcher mechanism for a single-node Compose deployment; avoids introducing a message broker while still enabling asynchronous progress.
- OpenTelemetry: vendor-neutral tracing, consistent correlation across HTTP, DB, LLM, and tool calls.
Data storage model
-
The system uses a run/step/event model:
- A run is the durable container (workflow instance).
- A step is the unit of work (LLM plan, tool execution, approval wait, completion).
- An event log (optional but included) captures append-only transitions for audit and replay.
-
This model supports:
- Restart-safe continuation.
- Precise retry semantics at the step boundary.
- Deterministic replay and debugging using persisted tool inputs/outputs.
Synchrony vs asynchrony
-
Run creation is synchronous (returns run id immediately).
-
Execution is asynchronous via the scheduler-driven dispatcher:
- Prevents client timeouts from forcing duplicate work.
- Allows durable waiting (approvals, backoffs) without tying up threads.
-
Certain endpoints optionally support “wait-for” semantics (polling) but do not drive execution in-process on the request thread by default.
Error handling and retries
-
Tool failures are categorized:
- Transient (timeouts, 5xx): retry with bounded exponential backoff and jitter.
- Permanent (4xx, validation, policy): mark step failed without retry.
-
Retries are step-local, with columns tracking attempt count, last error, and next eligible execution time.
-
The dispatcher uses leases to avoid double execution if multiple instances are ever introduced.
Idempotency strategy
-
Idempotency is enforced at two levels:
- Run creation idempotency (client-supplied idempotency key): prevents duplicate runs on client retry.
- Tool execution idempotency (derived key): prevents duplicate side effects for the same run/step/tool input.
-
The idempotency ledger stores:
- Key, scope (tenant), tool name, request hash, status, response snapshot, timestamps.
-
Tool calls must check and persist idempotency results in the same transaction that advances step state (or with a strict ordering guaranteeing at-most-once semantics per key).
4. Data Model
Core tables and purpose:
-
workflow_run
- Purpose: top-level workflow instance and lifecycle.
- Key columns:
id (uuid),tenant_id,status,created_at,updated_at,input_json,max_steps,deadline_at,idempotency_key.
-
workflow_step
- Purpose: durable unit of execution inside a run.
- Key columns:
id,run_id,seq,type(PLAN|TOOL|APPROVAL|FINAL),status,attempt,next_attempt_at,lease_owner,lease_expires_at,request_json,result_json,error_code,error_detail,started_at,finished_at.
-
tool_execution
- Purpose: canonical record of tool calls (inputs/outputs) for replay and audit.
- Key columns:
id,run_id,step_id,tool_name,idempotency_key,request_hash,request_json,response_json,status,duration_ms,created_at.
-
idempotency_record
- Purpose: global ledger to dedupe side effects (run creation and tool calls).
- Key columns:
idempotency_key (pk),tenant_id,scope(RUN_CREATE|TOOL_CALL),status,request_hash,response_json,created_at,updated_at.
-
approval_checkpoint
- Purpose: human-in-the-loop durable wait state.
- Key columns:
id,run_id,step_id,status(PENDING|APPROVED|REJECTED),requested_by,decided_by,reason,created_at,decided_at.
-
run_event (append-only)
- Purpose: replayable history and audit trail of transitions.
- Key columns:
id,run_id,step_id,event_type,payload_json,created_at.
Indexing strategy:
-
workflow_run(tenant_id, created_at desc)for listing runs per tenant. -
workflow_run(idempotency_key, tenant_id)unique constraint to dedupe run creation. -
workflow_step(run_id, seq)unique to preserve deterministic ordering. -
Dispatcher hot path indexes:
workflow_step(status, next_attempt_at)partial index forstatus in (QUEUED, RETRY_PENDING)ordered by time.workflow_step(lease_expires_at)to reclaim stuck leases.
-
Idempotency:
idempotency_record(tenant_id, scope, idempotency_key)unique / primary key (depending on chosen schema).tool_execution(idempotency_key)unique for tool-level at-most-once.
5. API Surface
-
POST /api/runs – Create a new workflow run; returns run id (ROLE_USER)
- Supports
Idempotency-Keyheader for run creation dedupe.
- Supports
-
GET /api/runs/{id} – Get current run state summary (ROLE_USER, tenant-scoped)
-
GET /api/runs/{id}/history – Get step-by-step history including tool inputs/outputs (ROLE_USER, tenant-scoped; redaction applied)
-
POST /api/runs/{id}/approve – Approve a pending checkpoint and resume execution (ROLE_APPROVER)
-
POST /api/runs/{id}/reject – Reject a pending checkpoint and fail/abort run (ROLE_APPROVER)
-
POST /internal/dispatcher/tick – Trigger a dispatcher tick (ROLE_ADMIN; used for deterministic testing)
-
GET /actuator/health – Health check (public or ROLE_ADMIN depending on deployment)
-
GET /admin/runs – List runs across tenants (ROLE_ADMIN)
-
GET /admin/runs/{id} – Admin view including raw events and lease state (ROLE_ADMIN)
6. Security Model
Authentication
- Spring Security with stateless authentication (e.g., JWT bearer tokens) for API endpoints.
- For local Compose, a dev profile can enable a fixed test token issuer or basic auth for simplicity while keeping the security model intact.
Authorization (roles)
- ROLE_USER: create runs, view own tenant runs, read history (with redaction).
- ROLE_APPROVER: approve/reject checkpoints for the tenant.
- ROLE_ADMIN: cross-tenant admin endpoints, dispatcher tick, operational views.
Paid access enforcement (if applicable)
-
Enforced at the API layer via:
- Tenant subscription status stored in DB or resolved via a billing adapter.
- A request filter that blocks run creation and tool execution when subscription is inactive.
-
The enforcement point must be before run creation and before tool execution (to prevent side effects).
CSRF considerations
- APIs are stateless and designed for token-based auth; CSRF is disabled for non-browser clients.
- If browser-based sessions are enabled, restrict cookie-based auth to admin UI endpoints and enable CSRF there only.
Data isolation guarantees
- Every run is tagged with
tenant_id. - All queries include tenant predicates except ROLE_ADMIN endpoints.
- History endpoints apply output redaction policies for tool inputs/outputs (secrets, tokens, PII markers) before returning to ROLE_USER.
7. Operational Behavior
Startup behavior
-
On startup, the service:
- Runs DB migrations (Flyway/Liquibase).
- Initializes tool registry and validates allowed tool schemas.
- Starts the dispatcher scheduler (unless disabled via profile).
- Emits a startup log line including version, active profiles, DB connectivity, and OTel exporter mode.
Failure modes
- DB unavailable: fail fast on startup; health becomes unhealthy; no runs executed.
- LLM provider unavailable: PLAN steps fail transiently and retry until max attempts; run transitions to FAILED when exhausted.
- Tool failure: step transitions to RETRY_PENDING (transient) or FAILED (permanent).
- Process restart during execution: leases expire; dispatcher reclaims and resumes from the last durable state.
Retry and timeout behavior
- Planner (LLM) timeout is bounded; failures are retried with backoff up to
planner.maxAttempts. - Tool calls have per-tool timeouts and max attempts.
- Backoff is stored in
next_attempt_atto ensure restart-safe scheduling. - A run-level
deadline_atprevents indefinite execution; once exceeded, remaining steps fail with TIMEOUT.
Observability hooks
-
Structured logs:
run_id,step_id,tenant_id,tool_name,attempt,lease_owner,idempotency_key.
-
OpenTelemetry traces:
- HTTP span for inbound requests.
- Run execution span (per step) linked via trace/span attributes.
- Nested spans for LLM calls and tool calls with result status and latency.
-
Metrics (via Micrometer + OTel bridge if desired):
- Runs created/completed/failed.
- Step retries, tool error rates, dispatcher lag.
- Idempotency hits vs misses.
8. Local Execution
Prerequisites
- Docker Desktop (or Docker Engine) with Compose v2
- JDK 17 (for running tests locally; container build uses JDK image)
- Available ports: 8080 (app), 5432 (postgres), 4317 (optional OTLP)
Environment variables
SPRING_PROFILES_ACTIVE=localDB_URL=jdbc:postgresql://localhost:5432/workflowsDB_USER=workflowsDB_PASS=workflowsLLM_BASE_URL=http://mock-llm:8081(or your provider)LLM_API_KEY=...(if using real provider)OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317(optional)WORKFLOWS_MAX_STEPS=25WORKFLOWS_DEFAULT_DEADLINE_SECONDS=120
Docker Compose usage
docker compose up -d --build
Verification steps
- Health:
curl -s http://localhost:8080/actuator/health
- Create a run (with idempotency):
curl -s -X POST http://localhost:8080/api/runs \
-H "Authorization: Bearer <TOKEN>" \
-H "Idempotency-Key: run-123" \
-H "Content-Type: application/json" \
-d '{
"tenantId": "t-001",
"input": { "goal": "Create a ticket via tool, requires approval" }
}'
- Poll status:
curl -s http://localhost:8080/api/runs/<RUN_ID> \
-H "Authorization: Bearer <TOKEN>"
- If run is waiting for approval:
curl -s -X POST http://localhost:8080/api/runs/<RUN_ID>/approve \
-H "Authorization: Bearer <APPROVER_TOKEN>" \
-H "Content-Type: application/json" \
-d '{ "reason": "Approved for demo" }'
- Confirm history is replayable:
curl -s http://localhost:8080/api/runs/<RUN_ID>/history \
-H "Authorization: Bearer <TOKEN>"
- Idempotency validation (repeat create with same key; should return same run id or a conflict-safe response):
curl -i -X POST http://localhost:8080/api/runs \
-H "Authorization: Bearer <TOKEN>" \
-H "Idempotency-Key: run-123" \
-H "Content-Type: application/json" \
-d '{ "tenantId": "t-001", "input": { "goal": "Create a ticket via tool, requires approval" } }'
9. Evidence Pack
Checklist of included evidence artifacts proving execution and correctness:
-
Service startup logs showing DB migration completion and dispatcher start
-
Successful
POST /api/runsinvocation logs including returned run id andIdempotency-Keyhandling -
Database records after run creation:
workflow_runrow- initial
workflow_steprows
-
Replayable run history output from
GET /api/runs/{id}/historydemonstrating persisted step transitions -
Idempotency proof:
- repeated
POST /api/runswith same idempotency key returning same run reference idempotency_recordrow showing stored response snapshot
- repeated
-
Retry behavior demonstration:
- forced transient tool failure logs showing attempts incrementing and
next_attempt_atscheduling - run eventually completes after retry or fails after max attempts
- forced transient tool failure logs showing attempts incrementing and
-
Human-in-the-loop checkpoint proof:
- run enters
WAITING_FOR_APPROVAL - approval action transitions run back to runnable state and continues execution
approval_checkpointrow showing approver and timestamp
- run enters
-
OpenTelemetry trace export proof:
- trace showing correlated spans: inbound request → run execution → planner call → tool call → retry span (if triggered)
-
Test evidence:
- idempotency unit/integration test output
- deterministic replay test verifying identical step history from persisted events
10. Known Limitations
- Single-node dispatcher design by default; horizontal scaling requires distributed locking/leases across multiple instances and careful tuning of lease durations and concurrency.
- Tool result redaction is policy-based and requires explicit configuration per tool schema; it does not automatically detect all sensitive data.
- This solution does not provide a full UI for approvals; it exposes API endpoints and an optional minimal admin view only.
- Exactly-once semantics are scoped to idempotency keys; if external tools are non-idempotent and keys are not enforced at the boundary, side effects can still duplicate.
- No built-in long-term artifact storage for large tool outputs; payloads are stored as JSON and should be capped or externalized for large binaries.
11. Extension Points
-
Replace the scheduler-based dispatcher with:
- a queue-driven model (Kafka/RabbitMQ) for higher throughput, while preserving the same run/step persistence and idempotency ledger.
-
Add a “tool gateway” service:
- isolate high-risk tools behind an internal API with separate authorization and auditing.
-
Introduce multi-tenant scaling:
- per-tenant concurrency limits, per-tenant rate limiting for LLM calls, and quota enforcement integrated with subscription status.
-
Add stronger determinism and replay:
- treat planner outputs as immutable events and support “replay with fixed plan” without re-calling the LLM.
-
Production hardening changes:
- run/step partitioning and retention policies
- outbox forwarding to an external audit system
- dedicated tracing backend and sampling strategy
- secret management (Vault/KMS) and per-tool credential isolation
1.1.0
- Solution write-up + runnable implementation
- Evidence images (when published)
- Code bundle downloads (when enabled)