Retry Policy & Failure Recovery in OpenClaw: What Should Retry, What Should Fail Fast, and How to Recover Without Duplicates

Every automation looks smart on a good day. The real test is what it does on a bad one.

That is where retry policy earns its keep. Not as a magic “try again” button, but as the rulebook for when OpenClaw should wait, retry, switch lanes, or stop before it makes a mess.

The official OpenClaw retry docs keep the core principle refreshingly simple: retry per HTTP request, not per multi-step flow. That one line saves people from a classic mistake, rerunning an entire workflow just because one request near the end got unlucky.

What retry policy is really for

A retry policy exists to absorb temporary failures without pretending all failures are temporary.

Think of it like suspension in a car. It is there to smooth out potholes, not to make driving into a wall a valid strategy. Timeouts, rate limits, and short transport hiccups are potholes. Bad input, broken permissions, or a malformed request are the wall.

That distinction matters because retries are not free. Every retry adds delay, load, and the risk of duplicate side effects if you retry the wrong thing in the wrong place.

What should retry, and what should fail fast

OpenClaw's current docs point to a clear pattern.

Good retry candidates

HTTP 429 rate limits where the provider tells you to slow down
Timeouts and short-lived network failures
HTTP 5xx responses from unstable upstream services
Transient transport issues like connection resets or fetch failures

For model providers, the docs say OpenClaw lets provider SDKs handle normal short retries. For Discord and Telegram, the retry layer explicitly covers transient failures and uses provider-specific retry_after values when available.

Fail-fast candidates

Malformed requests that will keep failing unchanged
Permission or auth mistakes that need operator action
Parse or formatting errors that are deterministic
Non-idempotent side effects that would be dangerous to duplicate blindly

The Telegram docs call out one good example: Markdown parse errors are not retried. They fall back to plain text instead. That is a healthy pattern. Do not hammer the same bad payload and hope the universe gets more forgiving.

Where retries belong

This is where many automations quietly go off the rails.

The official docs say retries apply per request, such as a message send, media upload, reaction, poll, or sticker. Composite flows do not retry completed steps. In plain English, OpenClaw tries to recover the request that failed, not rewind the whole story.

That is the right boundary because failure recovery should preserve ordering while avoiding duplicate work. If step four fails, the answer is usually “recover step four,” not “pretend steps one through three never happened.”

A useful rule of thumb:

Provider layer: short retries for transient request failures
Workflow layer: branch, notify, or escalate when the request still fails
Operator layer: inspect, audit, and decide whether manual recovery is safer

Default cooldown patterns in the current docs

OpenClaw does not treat every retry as immediate spam. The current retry page lists these defaults:

Attempts: 3
Max delay cap: 30,000 ms
Jitter: 10 percent
Telegram minimum delay: 400 ms
Discord minimum delay: 500 ms

That combination matters. A cap prevents endless waiting, and jitter keeps many clients from retrying in lockstep like a crowd all trying the same locked door at once.

There is also a sharper rule on the model side. For Stainless-based SDKs such as Anthropic and OpenAI, the docs say that if a retry-after wait exceeds 60 seconds, OpenClaw injects x-should-retry: false so the SDK surfaces the error immediately and model failover can take over. That is a strong design choice: stop sleeping forever inside one provider when another route may recover faster.

Failure recovery is bigger than retries

A real recovery strategy does more than say “try three times.” It decides what happens after the retries are gone.

This is where the background-tasks model helps. OpenClaw task records move through states like queued, running, succeeded, failed, timed_out, cancelled, and lost. That means detached work can fail visibly instead of just disappearing into log fog.

The docs also emphasize push-based completion. Detached work should notify directly or wake the requester session when it finishes. In other words, the healthy shape is usually this:

retry the current request if the failure looks temporary
finalize the task state honestly if recovery still fails
surface the result to the right chat, session, or operator

That is much better than nervous polling loops and much better than silent failure.

Common recovery mistakes

Retrying the whole flow instead of the broken request

This is how duplicates happen. A message gets posted twice. A file uploads again. A long chain of work reruns because one endpoint coughed near the end. OpenClaw's per-request retry model exists to stop that.

Ignoring idempotency

If a step has side effects, ask whether repeating it is safe. A health check is usually safe. A purchase, publish step, or notification blast may not be.

Sleeping too long in the wrong layer

If a provider tells you to wait a long time, that can be a signal to fail over, not just to sit there. The model failover docs make this explicit for long retry-after waits on supported SDKs.

Hiding failures behind “best effort” optimism

Best effort is fine for optional nice-to-have actions. It is terrible for pretending a critical path succeeded when it did not. A failed publish should look failed. A broken delivery should surface clearly. Operators need truth, not vibes.

A practical config example

The retry docs show policy configured per provider in ~/.openclaw/openclaw.json:

{
  channels: {
    telegram: {
      retry: {
        attempts: 3,
        minDelayMs: 400,
        maxDelayMs: 30000,
        jitter: 0.1,
      },
    },
    discord: {
      retry: {
        attempts: 3,
        minDelayMs: 500,
        maxDelayMs: 30000,
        jitter: 0.1,
      },
    },
  },
}

The interesting part is not the syntax. It is the intent. Keep retries close to the request surface that knows the failure type, then keep workflow recovery honest above it.

How to design calmer automations

Retry transient request failures, not whole stories
Fail fast on bad inputs, permissions, and deterministic formatting errors
Respect provider retry hints and add jitter
Use failover when long waits are worse than switching paths
Record final task outcomes so humans can inspect reality later
Report failures clearly instead of hiding them behind silent retries

Most flaky automation is not broken because the API had one bad minute. It is broken because nobody decided, in advance, what kind of failure they were looking at.

The practical takeaway

Use retries for turbulence, not denial. Put them at the request layer, cap them, jitter them, and keep them away from completed workflow steps. Then make failure recovery explicit with failover, task states, and honest reporting.

That is not glamorous. It is just how reliable systems avoid becoming expensive slot machines.

Need help from people who already use this stuff?

Building OpenClaw automations that need to survive real-world API mess?

Join My AI Agent Profit Lab if you want help choosing between retries, failover, task notifications, and manual recovery before duplicate side effects bite you.

Join My AI Agent Profit Lab See the community page

FAQ

Does OpenClaw retry an entire workflow when one step fails?

No. The official retry docs are explicit: retries happen per HTTP request, not per multi-step flow. OpenClaw retries the current request or step instead of replaying completed work.

What kinds of failures usually deserve a retry?

Short-lived failures such as rate limits, timeouts, HTTP 5xx responses, and transient transport problems are the usual retry candidates. Bad inputs, parse mistakes, permission mistakes, and other deterministic failures should usually fail fast.

Why is retrying non-idempotent work risky?

Because a retry can create duplicates. If a step sends a message, uploads media, or triggers a side effect, blindly replaying the whole flow can do the same action twice.

What cooldown behavior does OpenClaw use by default?

The current docs list three attempts, a 30,000 ms max delay cap, and 10 percent jitter. Telegram defaults to a 400 ms minimum delay and Discord defaults to 500 ms.

How should I report failures in durable automation?

Use the background-task layer and push-based completion path. Detached work should surface a real status record and notify or wake the requester session instead of relying on someone manually watching logs forever.

Retry Policy & Failure Recovery