1. Overview

This solution implements a runnable “agentic workflow” execution service in Spring Boot that can orchestrate LLM tool-calling while guaranteeing production-grade properties: durable run state, bounded retries with backoff, idempotency keys to prevent duplicate side effects, and human-in-the-loop checkpoints for sensitive actions.

The problem it solves is straightforward: once an LLM is allowed to call tools (HTTP, database writes, internal service calls), naive implementations become operationally unsafe. Typical failure patterns include:

Duplicate tool execution when clients retry requests or when the LLM repeats calls after partial failures.
Lost or inconsistent workflow state when the process restarts mid-run.
Unbounded loops or runaway costs when the model keeps calling tools without constraints.
Non-replayable incidents because intermediate decisions, tool inputs/outputs, and retry history are not persisted.
“Human approval” steps implemented as ad-hoc blocking logic that breaks under restarts and timeouts.

Existing approaches often fail in production because they treat tool-calling as a synchronous chat completion loop with in-memory state. When the node restarts, state is lost; when a request times out, clients retry and side effects duplicate; when a tool fails transiently, there is no consistent retry semantics or visibility into what happened.

This implementation is production-ready because it treats workflows as durable runs:

Every run is persisted with step-level state transitions (queued → running → waiting_for_approval → completed/failed).
Tool calls are recorded as first-class events with idempotency protection at the tool boundary.
Retries are deterministic and observable, with attempt counters, next-at timestamps, and structured error capture.
Human-in-the-loop is modeled as a durable wait state, not a process-level block.
End-to-end tracing (OpenTelemetry) correlates API requests, run execution, tool calls, and retries.

2. Architecture

Request flow and dependencies:

Client → Spring Boot REST API
REST API → PostgreSQL (durable run + step state, outbox events, idempotency records)
REST API → LLM Provider (OpenAI-compatible API) for “next action” planning and tool selection
Workflow Engine (in-process) → Tool Registry (internal interface) → External Systems (HTTP tools, DB tools, mock tools)
Scheduler (Spring Scheduling) → Run Dispatcher → Workflow Engine (process pending work, retries, timeouts)
Core Service → OpenTelemetry SDK → OTLP Collector (optional) → Tracing backend (e.g., Jaeger/Tempo/Langfuse-compatible collector)

Key components:

Run API: creates runs, queries run status, lists history, handles approval.
Workflow Engine: executes deterministic state machine over persisted steps.
Planner Adapter: wraps LLM calls and normalizes “tool call” directives and constraints.
Tool Registry: maps tool names to strongly-typed implementations; enforces idempotency and timeouts.
Run Dispatcher: picks eligible runs/steps from DB (using leases) and executes them.
Idempotency Service: stores and checks idempotency keys per tool execution boundary.
Outbox/Event Log: optional append-only event records for audits and replay.
Observability: structured logs + OTel traces (request/run/step/tool spans).

Trust boundaries:

Inbound boundary: client input to REST API (authz + validation).
LLM boundary: planner output is untrusted and must be constrained (allowed tools, max steps, schema validation).
Tool boundary: all side effects must be guarded (idempotency + authorization + allowlists).
Data boundary: run history contains sensitive data; access controlled by roles and tenant scoping.

3. Key Design Decisions

Technology stack

Spring Boot 3.x / Java 17: stable operational model, mature observability and security integrations, straightforward packaging for Docker Compose.
PostgreSQL: required for durable run state, idempotency ledger, leases, and replayable history; transactional consistency is central to correctness.
Spring Scheduling: lightweight dispatcher mechanism for a single-node Compose deployment; avoids introducing a message broker while still enabling asynchronous progress.
OpenTelemetry: vendor-neutral tracing, consistent correlation across HTTP, DB, LLM, and tool calls.

Data storage model

The system uses a run/step/event model:
- A run is the durable container (workflow instance).
- A step is the unit of work (LLM plan, tool execution, approval wait, completion).
- An event log (optional but included) captures append-only transitions for audit and replay.
This model supports:
- Restart-safe continuation.
- Precise retry semantics at the step boundary.
- Deterministic replay and debugging using persisted tool inputs/outputs.

Synchrony vs asynchrony

Run creation is synchronous (returns run id immediately).
Execution is asynchronous via the scheduler-driven dispatcher:
- Prevents client timeouts from forcing duplicate work.
- Allows durable waiting (approvals, backoffs) without tying up threads.
Certain endpoints optionally support “wait-for” semantics (polling) but do not drive execution in-process on the request thread by default.

Error handling and retries

Tool failures are categorized:
- Transient (timeouts, 5xx): retry with bounded exponential backoff and jitter.
- Permanent (4xx, validation, policy): mark step failed without retry.
Retries are step-local, with columns tracking attempt count, last error, and next eligible execution time.
The dispatcher uses leases to avoid double execution if multiple instances are ever introduced.

Idempotency strategy

Idempotency is enforced at two levels:
1. Run creation idempotency (client-supplied idempotency key): prevents duplicate runs on client retry.
2. Tool execution idempotency (derived key): prevents duplicate side effects for the same run/step/tool input.
The idempotency ledger stores:
- Key, scope (tenant), tool name, request hash, status, response snapshot, timestamps.
Tool calls must check and persist idempotency results in the same transaction that advances step state (or with a strict ordering guaranteeing at-most-once semantics per key).

4. Data Model

Core tables and purpose:

workflow_run
- Purpose: top-level workflow instance and lifecycle.
- Key columns: id (uuid), tenant_id, status, created_at, updated_at, input_json, max_steps, deadline_at, idempotency_key.
workflow_step
- Purpose: durable unit of execution inside a run.
- Key columns: id, run_id, seq, type (PLAN|TOOL|APPROVAL|FINAL), status, attempt, next_attempt_at, lease_owner, lease_expires_at, request_json, result_json, error_code, error_detail, started_at, finished_at.
tool_execution
- Purpose: canonical record of tool calls (inputs/outputs) for replay and audit.
- Key columns: id, run_id, step_id, tool_name, idempotency_key, request_hash, request_json, response_json, status, duration_ms, created_at.
idempotency_record
- Purpose: global ledger to dedupe side effects (run creation and tool calls).
- Key columns: idempotency_key (pk), tenant_id, scope (RUN_CREATE|TOOL_CALL), status, request_hash, response_json, created_at, updated_at.
approval_checkpoint
- Purpose: human-in-the-loop durable wait state.
- Key columns: id, run_id, step_id, status (PENDING|APPROVED|REJECTED), requested_by, decided_by, reason, created_at, decided_at.
run_event (append-only)
- Purpose: replayable history and audit trail of transitions.
- Key columns: id, run_id, step_id, event_type, payload_json, created_at.

Indexing strategy:

workflow_run(tenant_id, created_at desc) for listing runs per tenant.
workflow_run(idempotency_key, tenant_id) unique constraint to dedupe run creation.
workflow_step(run_id, seq) unique to preserve deterministic ordering.
Dispatcher hot path indexes:
- workflow_step(status, next_attempt_at) partial index for status in (QUEUED, RETRY_PENDING) ordered by time.
- workflow_step(lease_expires_at) to reclaim stuck leases.
Idempotency:
- idempotency_record(tenant_id, scope, idempotency_key) unique / primary key (depending on chosen schema).
- tool_execution(idempotency_key) unique for tool-level at-most-once.

5. API Surface

POST /api/runs – Create a new workflow run; returns run id (ROLE_USER)
- Supports Idempotency-Key header for run creation dedupe.
GET /api/runs/{id} – Get current run state summary (ROLE_USER, tenant-scoped)
GET /api/runs/{id}/history – Get step-by-step history including tool inputs/outputs (ROLE_USER, tenant-scoped; redaction applied)
POST /api/runs/{id}/approve – Approve a pending checkpoint and resume execution (ROLE_APPROVER)
POST /api/runs/{id}/reject – Reject a pending checkpoint and fail/abort run (ROLE_APPROVER)
POST /internal/dispatcher/tick – Trigger a dispatcher tick (ROLE_ADMIN; used for deterministic testing)
GET /actuator/health – Health check (public or ROLE_ADMIN depending on deployment)
GET /admin/runs – List runs across tenants (ROLE_ADMIN)
GET /admin/runs/{id} – Admin view including raw events and lease state (ROLE_ADMIN)

6. Security Model

Authentication

Spring Security with stateless authentication (e.g., JWT bearer tokens) for API endpoints.
For local Compose, a dev profile can enable a fixed test token issuer or basic auth for simplicity while keeping the security model intact.

Authorization (roles)

ROLE_USER: create runs, view own tenant runs, read history (with redaction).
ROLE_APPROVER: approve/reject checkpoints for the tenant.
ROLE_ADMIN: cross-tenant admin endpoints, dispatcher tick, operational views.

Paid access enforcement (if applicable)

Enforced at the API layer via:
- Tenant subscription status stored in DB or resolved via a billing adapter.
- A request filter that blocks run creation and tool execution when subscription is inactive.
The enforcement point must be before run creation and before tool execution (to prevent side effects).

CSRF considerations

APIs are stateless and designed for token-based auth; CSRF is disabled for non-browser clients.
If browser-based sessions are enabled, restrict cookie-based auth to admin UI endpoints and enable CSRF there only.

Data isolation guarantees

Every run is tagged with tenant_id.
All queries include tenant predicates except ROLE_ADMIN endpoints.
History endpoints apply output redaction policies for tool inputs/outputs (secrets, tokens, PII markers) before returning to ROLE_USER.

7. Operational Behavior

Startup behavior

On startup, the service:
- Runs DB migrations (Flyway/Liquibase).
- Initializes tool registry and validates allowed tool schemas.
- Starts the dispatcher scheduler (unless disabled via profile).
- Emits a startup log line including version, active profiles, DB connectivity, and OTel exporter mode.

Failure modes

DB unavailable: fail fast on startup; health becomes unhealthy; no runs executed.
LLM provider unavailable: PLAN steps fail transiently and retry until max attempts; run transitions to FAILED when exhausted.
Tool failure: step transitions to RETRY_PENDING (transient) or FAILED (permanent).
Process restart during execution: leases expire; dispatcher reclaims and resumes from the last durable state.

Retry and timeout behavior

Planner (LLM) timeout is bounded; failures are retried with backoff up to planner.maxAttempts.
Tool calls have per-tool timeouts and max attempts.
Backoff is stored in next_attempt_at to ensure restart-safe scheduling.
A run-level deadline_at prevents indefinite execution; once exceeded, remaining steps fail with TIMEOUT.

Observability hooks

Structured logs:
- run_id, step_id, tenant_id, tool_name, attempt, lease_owner, idempotency_key.
OpenTelemetry traces:
- HTTP span for inbound requests.
- Run execution span (per step) linked via trace/span attributes.
- Nested spans for LLM calls and tool calls with result status and latency.
Metrics (via Micrometer + OTel bridge if desired):
- Runs created/completed/failed.
- Step retries, tool error rates, dispatcher lag.
- Idempotency hits vs misses.

8. Local Execution

Prerequisites

Docker Desktop (or Docker Engine) with Compose v2
JDK 17 (for running tests locally; container build uses JDK image)
Available ports: 8080 (app), 5432 (postgres), 4317 (optional OTLP)

Environment variables

SPRING_PROFILES_ACTIVE=local
DB_URL=jdbc:postgresql://localhost:5432/workflows
DB_USER=workflows
DB_PASS=workflows
LLM_BASE_URL=http://mock-llm:8081 (or your provider)
LLM_API_KEY=... (if using real provider)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 (optional)
WORKFLOWS_MAX_STEPS=25
WORKFLOWS_DEFAULT_DEADLINE_SECONDS=120

Docker Compose usage

docker compose up -d --build

Verification steps

Health:

curl -s http://localhost:8080/actuator/health

Create a run (with idempotency):

curl -s -X POST http://localhost:8080/api/runs \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Idempotency-Key: run-123" \
  -H "Content-Type: application/json" \
  -d '{
    "tenantId": "t-001",
    "input": { "goal": "Create a ticket via tool, requires approval" }
  }'

Poll status:

curl -s http://localhost:8080/api/runs/<RUN_ID> \
  -H "Authorization: Bearer <TOKEN>"

If run is waiting for approval:

curl -s -X POST http://localhost:8080/api/runs/<RUN_ID>/approve \
  -H "Authorization: Bearer <APPROVER_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "reason": "Approved for demo" }'

Confirm history is replayable:

curl -s http://localhost:8080/api/runs/<RUN_ID>/history \
  -H "Authorization: Bearer <TOKEN>"

Idempotency validation (repeat create with same key; should return same run id or a conflict-safe response):

curl -i -X POST http://localhost:8080/api/runs \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Idempotency-Key: run-123" \
  -H "Content-Type: application/json" \
  -d '{ "tenantId": "t-001", "input": { "goal": "Create a ticket via tool, requires approval" } }'

9. Evidence Pack

Checklist of included evidence artifacts proving execution and correctness:

Service startup logs showing DB migration completion and dispatcher start
Successful POST /api/runs invocation logs including returned run id and Idempotency-Key handling
Database records after run creation:
- workflow_run row
- initial workflow_step rows
Replayable run history output from GET /api/runs/{id}/history demonstrating persisted step transitions
Idempotency proof:
- repeated POST /api/runs with same idempotency key returning same run reference
- idempotency_record row showing stored response snapshot
Retry behavior demonstration:
- forced transient tool failure logs showing attempts incrementing and next_attempt_at scheduling
- run eventually completes after retry or fails after max attempts
Human-in-the-loop checkpoint proof:
- run enters WAITING_FOR_APPROVAL
- approval action transitions run back to runnable state and continues execution
- approval_checkpoint row showing approver and timestamp
OpenTelemetry trace export proof:
- trace showing correlated spans: inbound request → run execution → planner call → tool call → retry span (if triggered)
Test evidence:
- idempotency unit/integration test output
- deterministic replay test verifying identical step history from persisted events

10. Known Limitations

Single-node dispatcher design by default; horizontal scaling requires distributed locking/leases across multiple instances and careful tuning of lease durations and concurrency.
Tool result redaction is policy-based and requires explicit configuration per tool schema; it does not automatically detect all sensitive data.
This solution does not provide a full UI for approvals; it exposes API endpoints and an optional minimal admin view only.
Exactly-once semantics are scoped to idempotency keys; if external tools are non-idempotent and keys are not enforced at the boundary, side effects can still duplicate.
No built-in long-term artifact storage for large tool outputs; payloads are stored as JSON and should be capped or externalized for large binaries.

11. Extension Points

Replace the scheduler-based dispatcher with:
- a queue-driven model (Kafka/RabbitMQ) for higher throughput, while preserving the same run/step persistence and idempotency ledger.
Add a “tool gateway” service:
- isolate high-risk tools behind an internal API with separate authorization and auditing.
Introduce multi-tenant scaling:
- per-tenant concurrency limits, per-tenant rate limiting for LLM calls, and quota enforcement integrated with subscription status.
Add stronger determinism and replay:
- treat planner outputs as immutable events and support “replay with fixed plan” without re-calling the LLM.
Production hardening changes:
- run/step partitioning and retention policies
- outbox forwarding to an external audit system
- dedicated tracing backend and sampling strategy
- secret management (Vault/KMS) and per-tool credential isolation

Agentic Workflows in Spring Boot: Tool Calling, Idempotency, and Durable Runs

Business Fit

Enterprise Readiness

Delivery Package

Implementation Notes