Self-Healing Agents in Spring Boot: Failure Detection, Recovery Policies, and Durable Execution

This solution implements a production-ready Self-Healing Agent runtime for Spring Boot applications. It is designed for teams building tool-calling or workflow-driven AI agents that must continue operating even when individual steps fail due to malformed tool arguments, transient API outages, inconsistent intermediate state, or partial execution across multiple dependencies.

Verified v1.0.0 Redhat 8/9 / Ubuntu / macOS / Windows (Docker) Java 17 · Spring Boot 3.x · PostgreSQL · OpenAI/Compatible LLM API · Docker Compose

Unlock full implementation + downloads

Account access required

This solution includes runnable code bundles and full implementation details intended for production use.

1. Overview

The problem it solves is straightforward: in production, an agent rarely fails because the whole task is impossible. It usually fails because one step in the chain breaks and the runtime has no structured way to detect the failure, classify it, decide whether it is recoverable, and continue from a safe checkpoint.

Typical failure patterns include:

The model emits tool arguments that are syntactically invalid or semantically incompatible with the downstream tool schema.
A third-party API returns a timeout, rate limit, or intermittent 5xx after the agent has already completed earlier steps.
A multi-step task partially succeeds, but the agent has no durable state and cannot safely resume without replaying already-completed work.
The model calls the wrong tool, or selects the correct tool with incomplete parameters, and the runtime treats the failure as terminal instead of recoverable.
A retry repeats a side-effecting operation because the system has no idempotency controls around external execution.

Existing approaches often fail in production because they treat agents as single-pass prompt executions with optimistic tool calling. That can work for demos, but it is not sufficient for real systems where failures are normal, external dependencies are unstable, and recovery behavior must be explicit and observable.

This implementation is production-ready because:

It separates planning, execution, failure detection, recovery decisioning, and durable state persistence into explicit runtime stages.
It persists task runs, step attempts, failure reasons, recovery actions, and final terminal outcomes for auditability and replay-safe behavior.
It supports controlled retries, alternative recovery policies, and idempotent execution guards so the runtime can recover without duplicating side effects.

In practice, this matters most when agents move beyond toy chat tasks and start performing real work: calling business APIs, orchestrating multi-step flows, writing records, generating reports, or coordinating several tools with interdependencies. In those environments, failure is not an exception. It is a normal operating condition.

A Self-Healing Agent runtime changes the execution contract. Instead of assuming the model will “get it right in one shot,” the application assumes that failures will occur and equips the runtime to detect, classify, recover, and continue where safe.

2. Architecture

Request flow and dependencies:

Client → Spring Boot REST API
Spring Boot REST API → Request validation and task submission
Spring Boot service → Planner to produce initial step graph or ordered plan
Spring Boot service → Durable task store for run creation
Spring Boot service → Executor for the current step
Executor → Tool registry for schema lookup and execution dispatch
Executor → External tools, HTTP APIs, internal business services, or function adapters
Spring Boot service → Failure detector for exception capture, response validation, and output inspection
Spring Boot service → Recovery engine for retry, re-plan, step-skip, tool-switch, or fail-closed decisioning
Spring Boot service → Durable task store for step attempt, recovery action, and run state persistence
Spring Boot service → LLM provider for planning, argument repair, or re-planning when needed
Spring Boot REST API → JSON response containing run state, current step status, recovery history, and final outcome when complete

Key components:

API Controller: Accepts task execution requests, validates input, and exposes run lifecycle endpoints.
Planner Service: Produces an initial executable plan or ordered sequence of steps from the user task.
Task Run Manager: Creates and updates durable run records, step states, and terminal outcomes.
Executor Service: Executes the current step by dispatching to a registered tool or business operation.
Tool Registry: Resolves tool names, schemas, adapters, and execution policies.
Failure Detector: Classifies exceptions, malformed outputs, invalid tool arguments, and policy violations.
Recovery Engine: Selects a recovery strategy based on error type, retry budget, step type, and side-effect risk.
Argument Repair Service: Uses deterministic normalization or LLM-assisted repair to fix tool inputs when allowed.
Idempotency Guard: Prevents duplicate execution of side-effecting operations during retries or resumes.
Observability Layer: Emits structured logs, metrics, and traces for run state, tool failures, recovery decisions, and latency.

Trust boundaries:

Inbound boundary: The REST API validates payload size, user identity, tenant scope, and allowed task types before creating a run.
Model boundary: Planner and repair outputs from the LLM are treated as untrusted until validated against task and tool schemas.
Tool boundary: External tools and APIs are considered unreliable; all responses and side effects are validated before state transition.
Storage boundary: Only application-controlled services update run, attempt, and recovery records.
Tenant boundary: Task runs, step payloads, tool calls, and execution history are all tenant-scoped to prevent cross-tenant leakage.

The runtime is deliberately organized so that execution is stateful and inspectable:

Planning determines what the agent intends to do.
Execution attempts one step at a time.
Failure detection determines what went wrong.
Recovery selects what happens next.
Durable state makes retries and resumes safe.

That separation is critical in production. If an agent fails, operators need to know whether the problem was bad planning, wrong tool selection, malformed arguments, a transient dependency error, a non-retryable business rejection, or a duplicated side effect. Without explicit runtime stages, all of those collapse into “the agent failed,” which is not actionable.

3. Key Design Decisions

Technology stack

Spring Boot 3.x is used because it provides a strong operational foundation for stateful service workflows, REST APIs, security, transaction management, validation, and observability. For agent runtimes that interact with business systems, those conventional service qualities matter as much as model quality.

PostgreSQL is selected as the durable execution store because it provides transactional guarantees for task runs, step attempts, checkpoints, and idempotency keys. A self-healing agent needs durable state before it needs distributed orchestration.

OpenAI-compatible LLM APIs are used for planning, optional argument repair, and re-planning because provider flexibility is valuable. The runtime should not depend on one model vendor or one agent SDK.

Docker Compose is chosen for local execution because it makes the service, database, and supporting components easy to reproduce and test.

Why not start with a full workflow engine or a generic agent framework? Because those tools do not automatically solve the core production problem. The real problem is not that the system lacks a graph abstraction. It is that recovery policy, failure classification, idempotency, and durable state are usually left implicit. A self-healing agent runtime requires those controls to be first-class.

Execution model

The runtime uses stepwise durable execution rather than a single monolithic prompt loop. Each task run consists of:

a stable run identifier,
a current state,
one or more ordered or graph-linked steps,
step attempts,
recovery actions,
and a terminal outcome.

This matters because the runtime must be able to answer:

What has already completed?
What is currently failing?
What can be retried safely?
What must never be replayed?
What recovery action was selected, and why?

A purely in-memory agent loop cannot answer those questions reliably after a crash, restart, or partial external success.

Planning versus execution

Planning is separated from execution. The planner decides what the task should do; the executor decides how to carry out each step; the recovery engine decides how to proceed when execution fails.

This is important because the right recovery action depends on where the failure occurred:

If the arguments are malformed, repair may be enough.
If the tool is wrong, re-plan may be needed.
If the dependency timed out, retry may be correct.
If the action already succeeded externally, resume must avoid replay.

Collapsing planning, execution, and recovery into one LLM loop makes those distinctions much harder to enforce and audit.

Durable state and checkpoints

Every run persists:

the original task request,
the generated plan,
step definitions,
each execution attempt,
error classifications,
chosen recovery actions,
idempotency keys where relevant,
and the final outcome.

This design makes the runtime restart-safe and replay-aware. After a process restart, the service can resume incomplete runs from the last known safe checkpoint instead of rerunning the entire task.

A self-healing system without durable checkpoints is not actually self-healing. It is only retrying optimistically.

Synchrony versus asynchrony

The runtime supports both synchronous submission and asynchronous execution, but the core design assumes that complex agent tasks are long-running relative to a typical HTTP request budget. For that reason, the API creates a run and tracks progress independently from client connection lifetime.

We do not force the entire runtime into a synchronous request path because:

external tools may be slow,
recovery may involve multiple retries or re-planning steps,
and long-running tasks need resumable execution anyway.

A short task may still complete inline, but the durable run model is the default.

Recovery policy model

Recovery behavior is explicit and policy-driven. Common policies include:

retry the same step,
repair arguments and retry,
switch to an alternative tool,
re-plan from the current state,
skip an optional step,
or fail closed.

This matters because not all failures are alike. A transient timeout and a validation rejection should not follow the same recovery path. Recovery policy is therefore driven by:

failure class,
step criticality,
tool type,
side-effect risk,
retry budget,
and business rules.

Error handling and retries

Transient (timeouts, 429, 5xx): retry with bounded backoff, subject to per-step retry budget and overall run budget.
Permanent (4xx validation, schema mismatch after repair attempts, business rejection): mark the step failed and either re-plan, skip, or terminate based on policy.

Retries are controlled by both policy and idempotency rules. The runtime must never retry a side-effecting step unless it can prove that replay is safe or deduplicated.

Idempotency and side effects

Side-effecting tools are executed behind an idempotency guard. Each step attempt can carry a deterministic operation key so that retries, resumes, or duplicate submissions do not repeat irreversible work such as sending emails, creating tickets, or writing payment-related records.

This is one of the most important design decisions in the whole system. Many agent runtimes fail not because they stop too early, but because they retry unsafely.

4. Data Model

Core tables:

agent_runs
- Purpose: Stores the lifecycle of each submitted task run.
- Key columns: id, tenant_id, run_id, task_type, input_payload, status, current_step_id, created_by, created_at, updated_at, completed_at
agent_plans
- Purpose: Stores the generated plan or step graph for a run.
- Key columns: id, tenant_id, agent_run_id, plan_version, plan_json, planner_model, created_at
agent_steps
- Purpose: Stores step definitions and execution state for each run.
- Key columns: id, tenant_id, agent_run_id, step_key, step_type, tool_name, sequence_no, status, input_payload, output_payload, is_optional, created_at, updated_at
step_attempts
- Purpose: Stores each concrete execution attempt for a step.
- Key columns: id, tenant_id, agent_step_id, attempt_no, status, request_payload, response_payload, error_code, error_class, started_at, ended_at
recovery_actions
- Purpose: Stores recovery decisions taken after a failed attempt.
- Key columns: id, tenant_id, agent_step_id, step_attempt_id, action_type, action_reason, action_payload, created_at
tool_definitions
- Purpose: Stores registered tool metadata, schemas, and execution policy flags.
- Key columns: id, tenant_id, tool_name, input_schema_json, is_side_effecting, supports_idempotency, timeout_ms, retry_policy_json, status
idempotency_records
- Purpose: Stores deduplication keys and final effect markers for side-effecting operations.
- Key columns: id, tenant_id, operation_key, tool_name, agent_step_id, external_reference, status, created_at
run_events
- Purpose: Stores ordered lifecycle events for auditing and replay diagnostics.
- Key columns: id, tenant_id, agent_run_id, event_type, event_payload, created_at

Indexing strategy:

agent_runs(tenant_id, run_id) for run lookup and audit
agent_runs(tenant_id, status, updated_at) for active run scanning
agent_steps(agent_run_id, sequence_no) for step reconstruction
agent_steps(agent_run_id, status) for pending and failed step lookup
step_attempts(agent_step_id, attempt_no) for step execution history
recovery_actions(agent_step_id, created_at) for recovery reconstruction
idempotency_records(tenant_id, operation_key) unique for replay prevention
run_events(agent_run_id, created_at) for ordered lifecycle replay

The structure above supports the core operational goals of a self-healing runtime:

Durability — task state survives restarts and dependency failures.
Replay safety — retries and resumes do not blindly repeat side effects.
Diagnosability — every failure and recovery decision is reconstructable.

For step modeling, the runtime should distinguish clearly between:

read-only steps,
side-effecting steps,
compensatable steps,
and optional enrichment steps.

Those classifications affect retry policy, recovery policy, and terminal failure behavior. A report-fetching step can often be retried or skipped. A ticket-creation step usually cannot be replayed safely without idempotency protections.

5. API Surface

POST /api/agent/runs – Submit a new agent task and create a durable run (ROLE_USER)
GET /api/agent/runs/{id} – Fetch run summary, current status, and final output if complete (ROLE_USER)
GET /api/agent/runs/{id}/steps – Return ordered steps and their current states (ROLE_USER)
GET /api/agent/runs/{id}/attempts – Return step attempt history for diagnostics (ROLE_ADMIN)
GET /api/agent/runs/{id}/events – Return ordered lifecycle and recovery events (ROLE_ADMIN)
POST /api/agent/runs/{id}/resume – Resume a paused or recoverable run (ROLE_ADMIN)
POST /api/agent/runs/{id}/cancel – Cancel an active run (ROLE_USER or ROLE_ADMIN based on ownership policy)
POST /api/admin/tools/register – Register or update a tool schema and execution policy (ROLE_ADMIN)
POST /api/admin/tools/{toolName}/disable – Disable a tool from future execution (ROLE_ADMIN)
GET /actuator/health – Health endpoint for service readiness (ROLE_ADMIN / ops network)
GET /actuator/prometheus – Metrics scraping endpoint (ROLE_ADMIN / ops network)

Example response from POST /api/agent/runs:

{
  "runId": "a4bd0f4a-7f1f-46c4-b7a8-f27dfe430001",
  "status": "RUNNING",
  "taskType": "report_generation",
  "currentStep": {
    "stepKey": "fetch_source_data",
    "status": "IN_PROGRESS",
    "attemptNo": 1
  },
  "recovery": {
    "lastAction": null,
    "recoverable": true
  }
}

The API surface is intentionally centered on durable execution rather than chat-style interactivity. The system is not just returning text. It is managing long-running work with retries, checkpoints, and recovery state.

A few implementation notes matter here:

POST /api/agent/runs should support client-supplied idempotency keys when task submission itself must be deduplicated.
GET /api/agent/runs/{id} should provide a compact summary suitable for product UIs.
Attempt and event endpoints should remain admin-facing because they expose internal reasoning, tool payloads, and diagnostic data.
Resume and cancel operations should be guarded carefully because they alter execution state rather than simply reading it.

In production, teams often add two more API shapes:

A dead-letter review API for permanently failed runs that need manual intervention.
A policy simulation API that evaluates how the runtime would react to different failure classes without executing tools.

Those are useful extensions, but they are not required for the initial solution pattern.

6. Security Model

Authentication

Authentication is handled through Spring Security with JWT bearer tokens or opaque access tokens issued by the upstream identity provider. Every request carries user identity and tenant identity claims.

For service-to-service execution, the same model works with client credentials or internal signed tokens. The key requirement is that every run, step, and tool call remains attributable to a caller identity and tenant boundary.

Authorization (roles)

ROLE_USER: Can submit runs, read their run state, and cancel runs they own within tenant scope.
ROLE_ADMIN: Can inspect step attempts, view recovery history, register tools, resume recoverable runs, and perform administrative control actions.

Role design should stay operationally simple. The runtime should not become a second identity platform. It should rely on upstream identity and enforce only the roles required for execution, diagnostics, and administration.

Data isolation guarantees

Every persisted record includes tenant_id, and all run, step, attempt, and tool-configuration lookups are tenant-scoped. This prevents one tenant’s tasks, recovery history, or execution data from being exposed to another tenant.

This isolation has to be applied consistently:

Run queries must filter by tenant before returning state.
Tool registration must be tenant-aware unless tools are explicitly platform-global.
Recovery and attempt history must not leak internal payloads across tenant boundaries.
Idempotency records must be scoped so that operation keys are unique only within the intended boundary.

Security also intersects with model prompts and tool payloads. Sensitive input data, API secrets, and internal execution metadata should never be exposed to the model unless necessary for the step being planned or repaired.

7. Operational Behavior

Startup behavior

On startup, the application:

Validates required environment variables for database and model providers
Runs schema migrations
Verifies PostgreSQL connectivity
Initializes planner and repair model clients
Loads active tool definitions and execution policies
Registers health indicators for DB and model dependencies
Optionally scans for recoverable runs left incomplete after a previous restart

The service only reports ready after persistence and core execution dependencies are initialized.

For agent runtimes, startup behavior matters because process restarts are part of normal operations. A self-healing runtime should be able to identify unfinished work, decide whether it is resumable, and avoid duplicating already-completed side effects.

Failure modes

DB unavailable: the service fails readiness, rejects new runs, and cannot safely resume existing ones.
LLM planner unavailable: new runs that require planning fail or queue based on policy; existing runs may still resume if no re-planning is needed.
Tool dependency unavailable: the current step fails and is routed through recovery policy, typically retry or pause-for-resume depending on criticality.
Schema validation failure: the runtime attempts deterministic repair or LLM-assisted repair if allowed; otherwise the step fails as a permanent input error.
Unknown side-effect state: if the runtime cannot determine whether an external action already happened, the step is paused for operator review rather than blindly retried.
Process restart mid-run: incomplete runs are loaded from durable state and evaluated for safe resumption.

Those behaviors should be deliberate. The most dangerous failure mode in an agent runtime is not “stopping.” It is “continuing unsafely.”

Retry and timeout behavior

Recommended defaults:

Planner call timeout: 5s to 12s
Argument repair timeout: 3s to 8s
Tool call timeout: policy-based per tool, typically 2s to 30s
DB transaction timeout: 1s to 3s for state transitions

Retry policy:

Planner: retry once for transient provider failure if the run has not yet started execution
Tool step: bounded retries based on tool policy and side-effect classification
Argument repair: at most one deterministic repair and one model-assisted repair attempt
No automatic retry for business-rule rejections, authorization failures, or duplicate-effect ambiguity

Timeouts should be enforced per step and per dependency, not only at the run level. That keeps slow or degraded dependencies from consuming the entire run budget without triggering recovery.

Circuit breaking should be applied to unstable external tools so the runtime can fail quickly and move into a recovery or pause state.

Observability hooks

Structured logs: run_id, tenant_id, user_id, task_type, step_key, tool_name, attempt_no, error_class, error_code, recovery_action, idempotency_key, latency_ms, outcome

OpenTelemetry traces:

agent.run: root span for the end-to-end task lifecycle
agent.plan: plan generation span
agent.execute_step: current step execution span
agent.call_tool: external tool dispatch span
agent.detect_failure: failure classification span
agent.repair_args: deterministic or model-assisted argument repair span
agent.recover_step: retry, re-plan, skip, or pause decision span
agent.persist_state: durable state transition span

These hooks are essential. When a self-healing agent misbehaves, operators need to answer:

Did the planner create a bad step?
Did the tool fail transiently or permanently?
Did argument repair help or make things worse?
Was the recovery action appropriate?
Did idempotency protections prevent duplicate effects?
Why did the run stop where it did?

Without structured observability, those questions are nearly impossible to answer.

8. Local Execution

Prerequisites

Docker Desktop with Compose v2
JDK 17
Available ports: 8080, 5432

Environment variables

SPRING_DATASOURCE_URL=jdbc:postgresql://localhost:5432/agentdb
SPRING_DATASOURCE_USERNAME=agent
SPRING_DATASOURCE_PASSWORD=agent

LLM_API_BASE_URL=http://host.docker.internal:11434/v1
LLM_API_KEY=dummy
LLM_PLANNER_MODEL=gpt-4.1-mini
LLM_REPAIR_MODEL=gpt-4.1-mini

APP_AGENT_MAX_RUN_RETRIES=3
APP_AGENT_DEFAULT_STEP_TIMEOUT_MS=10000
APP_AGENT_RESUME_ON_STARTUP=true

Docker Compose usage

docker compose up -d --build

Verification steps

Health check:

curl -s http://localhost:8080/actuator/health

Submit a run:

curl -s -X POST http://localhost:8080/api/agent/runs \
  -H "Authorization: Bearer <TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "taskType": "report_generation",
    "input": {
      "customerId": "C-1001",
      "period": "2026-04"
    }
  }'

Inspect run status:

curl -s http://localhost:8080/api/agent/runs/<RUN_ID> \
  -H "Authorization: Bearer <TOKEN>"

Inspect recovery history:

curl -s http://localhost:8080/api/agent/runs/<RUN_ID>/events \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

Resume a recoverable run:

curl -s -X POST http://localhost:8080/api/agent/runs/<RUN_ID>/resume \
  -H "Authorization: Bearer <ADMIN_TOKEN>"

A practical local setup should include:

at least one read-only tool,
at least one side-effecting tool protected by idempotency,
a seeded failure mode such as transient timeout or malformed arguments,
and a sample run that demonstrates recovery rather than simple success.

A good demo does not only show that the agent can complete a task. It shows that the runtime can survive a failure, choose a recovery path, and finish without duplicating side effects.

9. Evidence Pack

Checklist of included evidence artifacts:

[ ] Service startup logs showing schema migration success, tool registry load, and readiness transition
[ ] Successful POST /api/agent/runs invocation with run identifier and initial state
[ ] Database records after execution: agent_runs row showing terminal status and timestamps
[ ] Step execution proof: step_attempts rows showing attempt history and error classification
[ ] Recovery proof: recovery_actions rows showing retry, repair, re-plan, or pause decisions
[ ] Idempotency proof: idempotency_records row showing deduplicated side-effect execution
[ ] Test evidence: integration test output for retry, resume, and duplicate-effect prevention

A strong evidence pack makes this type of solution credible. It shows that the runtime is not only architecturally plausible, but actually durable and replay-safe.

For a public-facing solution post, the most convincing artifacts are usually:

a startup log showing ready state,
a run submission and final completion example,
a failure-and-recovery example for one step,
and a database query proving that the side-effecting step was executed once even when retried.

Those proof points are also useful for regression control when the runtime evolves.

10. Known Limitations

The solution improves recovery behavior, but it does not make bad tools or weak plans inherently safe; poor tool definitions and ambiguous business rules still limit runtime quality.
Recovery policy is only as good as failure classification. A misclassified permanent failure can waste retry budget, and a misclassified transient failure can terminate work too early.
Durable execution adds operational complexity compared with a simple prompt loop. For trivial tasks, that overhead may not be justified.

There are additional practical limitations worth stating directly.

First, idempotency is only as strong as the external systems the runtime integrates with. If a dependency cannot provide a durable external reference or deduplication key, the runtime may need to pause for operator review instead of retrying automatically.

Second, re-planning can recover from some classes of failure, but it can also introduce plan drift if not bounded carefully. The runtime should constrain when re-planning is permitted and what state must be preserved.

Third, self-healing does not mean fully autonomous. High-risk steps may still require human approval, especially when external side effects are financially or legally significant.

11. Extension Points

Replace PostgreSQL with a workflow engine or event-driven orchestration layer when you need very high concurrency, distributed scheduling, or native timer semantics.
Add graph-based planning for more complex dependencies, branching, and conditional execution paths.
Add human approval checkpoints for high-risk side-effecting steps.
Add policy simulation and chaos testing for failure injection and recovery validation.
Production hardening: add circuit breakers, dead-letter queues, compensating actions, run TTL policies, and offline recovery analytics for stricter operational guarantees.

Several extensions are especially natural for real deployments.

Alternative tool routing can choose a backup provider when the primary tool is degraded.

Compensation workflows can undo or neutralize partial side effects when a later critical step fails.

Run priority and fairness controls can prevent one tenant or task class from starving the runtime.

Offline replay analysis can identify repeated failure clusters and turn them into better repair prompts or tool schemas.

Closing Notes

Self-Healing Agents are not about making the model more optimistic. They are about making the runtime more disciplined when optimistic execution fails.

A basic agent implementation assumes the model will select the right tools, produce valid arguments, and complete the task in one pass. A self-healing runtime assumes the opposite: failures are expected, tools are unreliable, external dependencies degrade, and side effects must be guarded.

That difference is small in a demo, but large in production behavior. It is the difference between:

an agent that looks impressive until the first real failure,
and an agent runtime that can survive partial failure, recover deliberately, and complete work safely.

If your current Spring Boot agent already supports task planning and tool calling, the next meaningful step is not necessarily to add more tools or a larger model. In many cases, the biggest upgrade is to add the missing execution controls: