A good failover setup is like the backup power system in a hospital. Nobody brags about it on a normal day. Then the lights flicker, something upstream breaks, and suddenly the whole point is obvious.
That is the job of model failover in OpenClaw. Your agent should not freeze just because one provider times out, one auth profile hits a quota wall, or one model starts returning junk. A resilient setup keeps moving.
This guide shows how OpenClaw handles failover today, what gets retried first, when it switches models, and how to configure a fallback chain that helps instead of making a mess.
What OpenClaw does before it gives up
Current OpenClaw docs describe failover as a two-stage process:
- Auth profile rotation inside the current provider
- Model fallback to the next model in your configured chain
That order matters. If your Anthropic setup has multiple auth profiles, OpenClaw tries the healthier path there first before it jumps to a different provider or model. This is cleaner, cheaper, and often faster than abandoning the provider immediately.
Why this design is smarter than brute-force retrying
The early internet became resilient because packets could route around broken nodes instead of insisting on one perfect path. OpenClaw applies the same idea to model execution. It does not assume your first route will always be healthy. It keeps an ordered escape path ready.
That is a better pattern than blind retry loops. Blind retries waste time, burn rate limits, and leave users staring at a stalled session. Ordered failover makes a judgment call: try the nearby safe option first, then move on.
The runtime flow in plain English
For a normal text run, the current docs describe this sequence:
- Resolve the session's active model and auth preference
- Build the candidate chain from your primary model plus configured fallbacks
- Try the current provider with auth-profile rotation rules
- Advance to the next model only when the provider path is exhausted by a failover-worthy error
- Persist the chosen fallback override before retrying, so the rest of the session sees the same safe model
- Roll back only the fallback-owned fields if that fallback candidate also fails
The subtle bit is the persistence. OpenClaw does not just switch for one reply and forget. It can mark the fallback as an automatic override for the session, which avoids repeatedly touching a known-bad primary on the next turn.
Basic fallback configuration
If you want failover, define it on purpose. The core pattern is simple:
{
agents: {
defaults: {
model: {
primary: "anthropic/claude-sonnet-4-6",
fallbacks: [
"openai/gpt-5.5",
"openrouter/moonshotai/kimi-k2"
],
},
},
},
}Think of the order as a business decision, not a decoration.
- Fallback 1: the model you trust most when the primary goes sideways
- Fallback 2: the model that keeps the service alive if both premium paths fail
- Later entries: only if you have a real reason and have tested them
More options are not automatically better. Long chains can hide problems and make behavior harder to predict.
Auth rotation comes first
OpenClaw separates provider auth from model selection. That sounds boring. It is not. If you run multiple API keys or OAuth-backed profiles for the same provider, OpenClaw can rotate between them before it abandons the provider entirely.
The docs also note a session-stickiness rule: once OpenClaw picks an auth profile for a session, it tends to keep using that profile until the session resets, a compaction changes the state, or the profile enters cooldown. That keeps provider-side caches warmer and avoids pointless churn.
When to care about auth order
If one profile is your paid production path and another is your backup, make that explicit. Do not rely on wishful thinking.
{
auth: {
order: {
anthropic: [
"anthropic:team-primary",
"anthropic:backup-key"
],
},
},
}What usually triggers failover
Based on the current official docs, failover-worthy errors are broader than plain HTTP 429s. OpenClaw can rotate or fall back on:
- Rate limits: including concurrency caps and temporary usage windows
- Transient timeouts: when the provider path looks overloaded or unstable
- Auth failures: expired or unusable credentials
- Some format or stop-reason errors: when the provider path is clearly unhealthy for the current request
- Billing disables: when a profile is effectively out of service
OpenClaw then applies exponential cooldowns. The current doc states 1 minute, 5 minutes, 25 minutes, and then a 1 hour cap. That is exactly what you want. A bad profile should cool off. It should not be punched in the face every turn.
Strict overrides are intentionally strict
This catches people. If you manually select a model for the session with /model, the current docs say OpenClaw treats that as a user override, not a polite suggestion.
In other words:
- Configured default: can walk the fallback chain
- Automatic runtime fallback: can keep walking the configured chain
- User-picked session model: fails visibly if that exact model is unavailable
That is the right behavior. If you explicitly asked for one model, silent substitution would be confusing and sometimes dangerous.
Cron jobs follow a slightly different rule
Cron model selection is treated more like a job primary than a manual user override. The current docs say a cron payload model still uses configured fallbacks unless you explicitly make the run strict.
{
model: "openai/gpt-5.5",
fallbacks: [],
}That tiny empty array matters. It tells OpenClaw, "do not rescue this run with another model." That is useful for tests, audits, and jobs where exact reproducibility matters more than continuity.
How to inspect your failover state
If you are not sure what OpenClaw will do, check instead of guessing:
openclaw models status
openclaw models fallbacks list
openclaw models listThe current CLI docs describe openclaw models status as the place to inspect the resolved default model, fallback chain, and auth overview. Add probing only when you genuinely need live checks, because probes are real requests and may consume tokens.
Practical design rules for better fallback chains
1. Do not make your fallback weaker in the wrong way
Cheaper is fine. Too weak for tool use is not. A fallback that cannot handle your normal prompt shape is not resilience. It is a delayed failure.
2. Cross providers when uptime matters
If your primary and first fallback share the same provider failure domain, you have built a backup that lives in the same burning building.
3. Keep the prompt contract compatible
If one model in the chain handles tools, long context, or images very differently, test it with real workloads. Fallback is not just about availability. It is about surviving the same job with tolerable output quality.
4. Decide when strict is better
For evaluations, regulated workflows, or side-by-side tests, strict runs are cleaner. For support channels and user-facing agents, continuity usually wins.
Common mistakes
- No fallbacks at all: one provider wobble becomes a full outage
- Too many fallbacks: debugging turns into archaeology
- Only same-provider fallbacks: better than nothing, weaker than it looks
- Forgetting auth rotation: sometimes the provider is fine and one credential is the real problem
- Assuming manual model picks will auto-recover: they usually will not, by design
FAQ
What is the difference between auth rotation and model fallback?
Auth rotation happens first inside the current provider. OpenClaw tries another auth profile for that provider before it moves to the next model in your fallback chain. Model fallback is the wider escape hatch when the whole provider path is no longer usable.
Does OpenClaw always fall back automatically?
No. Configured defaults and cron job primaries can use fallbacks. A manual session choice through /model or the model picker is treated as strict. If that exact model fails, OpenClaw reports the failure instead of quietly switching to something else.
Which failures usually trigger failover?
Rate limits, auth failures, transient timeouts, some provider-side format errors, and billing-style disables can all trigger rotation or fallback. The exact classification is provider-aware, but the pattern is simple: retry what is likely transient, then move on before the session stalls.
How long do cooldowns last?
Current OpenClaw docs describe exponential cooldowns of 1 minute, 5 minutes, 25 minutes, and then a 1 hour cap. That lets unhealthy profiles cool off instead of getting hammered on every turn.
Can I make a cron job strict instead of using fallbacks?
Yes. A cron job model still uses configured fallbacks by default, but you can make that run strict by sending an empty fallbacks array in the job payload.
Summary
Model failover is not there to make your setup look sophisticated. It is there so your agent keeps functioning when real systems behave like real systems.
Set a strong primary. Add a short, deliberate fallback chain. Test it with real workloads. And follow the old proverb often attributed to China: dig the well before you are thirsty.
Need help from people who already use this stuff?
Want a fallback chain you can trust at 2:13 AM?
Join My AI Agent Profit Lab for tested OpenClaw configs, outage lessons, and model-routing setups that survive contact with reality.