Timeout-Driven Auth Rotation Prematurely Triggers Provider Fallback
Generic request timeouts are incorrectly treated as rate-limit signals, causing aggressive auth profile cooldown and rotation that cascades into provider/model fallback even when the provider is temporarily slow.
๐ Symptoms
Primary Error Messages
When a request timeout occurs on a provider supporting auth.profiles, the embedded runner emits cascading failure messages:
Profile openai-codex:default timed out (possible rate limit). Trying next account...
No available auth profile for openai-codex (all in cooldown or unavailable).
... provider=openai model=gpt-5.2 ... # fallback triggered
Observable Behavior
- Premature profile exhaustion: A single timeout on one profile causes immediate rotation to the next available profile
- Cooldown state accumulation: Each timeout writes a cooldown entry with exponential backoff (~1m โ 5m โ 25m โ 1h cap)
- Unnecessary model fallback: When all profiles enter cooldown, the system proceeds to configured model fallbacks even if the original provider is operational
- Log noise: Repeated `timed out (possible rate limit)` messages create confusion about actual rate-limiting status
Reproduction Scenario
bash
Trigger: A single request exceeds timeoutSeconds threshold
openclaw run –agent ./my-agent.ts –timeout-seconds 30
Observed: Immediate auth profile rotation without retry
Expected: At least one retry with backoff before rotation
Affected Components
| Component | File Path | Failure Point |
|---|---|---|
| Embedded Runner | src/agents/pi-embedded-runner/run.ts | Timeout โ markAuthProfileFailure() โ advanceAuthProfile() |
| Auth Profiles | src/agents/auth-profiles/usage.ts | Uniform cooldown schedule for timeout and rate-limit reasons |
๐ง Root Cause
Architectural Analysis
The auth-profile failover loop in the embedded runner conflates two distinct failure modes:
- Strong rate-limit signals: HTTP 429, provider-specific error codes (e.g.,
error.code === "rate_limit_exceeded") - Weak transient signals: Generic request timeouts (network blip, slow streaming, SDK latency spike)
Code Path Breakdown
File: src/agents/pi-embedded-runner/run.ts
The timeout handler executes without a retry gate:
typescript // Simplified flow (line numbers approximate) async function executeWithAuthProfile(provider, profile, request) { try { const result = await executeRequest(request, { timeout: timeoutMs }); return result; } catch (error) { if (isTimeout(error)) { // โ No retry gate - immediate failure marking markAuthProfileFailure(profile, { reason: “timeout” }); advanceAuthProfile(provider); // โ Triggers rotation throw new NoAvailableAuthProfileError(provider); }
if (isRateLimit(error)) {
// โ Correct: strong signal warrants immediate cooldown
markAuthProfileFailure(profile, { reason: "rate_limit" });
advanceAuthProfile(provider);
throw new NoAvailableAuthProfileError(provider);
}
} }
File: src/agents/auth-profiles/usage.ts
Cooldown calculation applies identical exponential schedule for all failure reasons:
typescript function calculateAuthProfileCooldownMs(errorCount: number): number { // ~1m โ 5m โ 25m โ 1h cap const baseMs = 60_000; const cooldown = baseMs * Math.pow(5, Math.min(errorCount - 1, 3)); return Math.min(cooldown, 3_600_000); // 1-hour cap }
// Called identically for “timeout” and “rate_limit” reasons
Failure Cascade Sequence
- Request timeout occurs
- markAuthProfileFailure(reason: “timeout”) writes cooldown entry
- advanceAuthProfile() rotates to next profile
- If all profiles unavailable: a. NoAvailableAuthProfileError raised b. Check agents.defaults.model.fallbacks c. Proceed to fallback model/provider โ Premature!
- If no fallbacks: Request fails entirely
Why This Is Incorrect
| Signal Type | Reliability | Appropriate Response |
|---|---|---|
| HTTP 429 | High | Immediate cooldown + rotate |
| Provider error code | High | Immediate cooldown + rotate |
| Generic timeout | Low (transient) | Retry with backoff before cooldown |
Generic timeouts are indistinguishable from:
- Temporary network latency spikes
- Slow streaming response initiation
- SDK connection overhead
- Temporary provider-side load
Configuration Gap
No configuration exists to control per-reason retry behavior:
typescript // Current: No retrySameProfileOnTimeout config exists agents: { defaults: { timeoutSeconds: 30, modelFailover: { // Missing: retrySameProfileOnTimeout, retryBackoffMs } } }
๐ ๏ธ Step-by-Step Fix
Recommended: Minimal Retry Gate Addition
This fix adds a per-reason retry gate for timeout failures before triggering cooldown and rotation.
Step 1: Extend Configuration Schema
File: src/config/schema.ts
Add new fields to the model failover configuration:
typescript // Before interface ModelFailoverConfig { fallbacks: string[]; }
// After interface ModelFailoverConfig { fallbacks: string[]; retrySameProfileOnTimeout: number; // Default: 1 retryBackoffMs: [number, number]; // Default: [300, 1200] ms (min, max jitter) }
Step 2: Implement Retry Gate in Embedded Runner
File: src/agents/pi-embedded-runner/run.ts
Modify the timeout handling to include retry logic:
typescript // Before async function executeWithAuthProfile(provider, profile, request) { try { return await executeRequest(request, { timeout: timeoutMs }); } catch (error) { if (isTimeout(error)) { markAuthProfileFailure(profile, { reason: “timeout” }); advanceAuthProfile(provider); throw new NoAvailableAuthProfileError(provider); } // … rate limit handling } }
// After async function executeWithAuthProfile(provider, profile, request, options = {}) { const config = getConfig(); const { retrySameProfileOnTimeout = 1, retryBackoffMs = [300, 1200] } = config.agents?.defaults?.modelFailover ?? {};
// Track retries per-profile per-session const retryState = getOrCreateRetryState(profile.id);
try { return await executeRequest(request, { timeout: timeoutMs }); } catch (error) { if (isTimeout(error)) { const maxRetries = retrySameProfileOnTimeout; const currentRetries = retryState.consecutiveTimeouts;
if (currentRetries < maxRetries) {
// Retry same profile with jittered backoff
const [minDelay, maxDelay] = retryBackoffMs;
const delay = minDelay + Math.random() * (maxDelay - minDelay);
console.log(
`Profile ${profile.id} timed out. ` +
`Retry ${currentRetries + 1}/${maxRetries} in ${Math.round(delay)}ms...`
);
retryState.consecutiveTimeouts++;
await sleep(delay);
// Re-execute on same profile (no cooldown written)
return await executeWithAuthProfile(
provider, profile, request,
{ ...options, isRetry: true }
);
}
// Retries exhausted: apply cooldown + rotate
console.log(
`Profile ${profile.id} timed out (${maxRetries} retries exhausted). ` +
`Trying next account...`
);
markAuthProfileFailure(profile, { reason: "timeout" });
clearRetryState(profile.id); // Reset retry counter
advanceAuthProfile(provider);
throw new NoAvailableAuthProfileError(provider);
}
// Rate-limit handling unchanged (immediate cooldown)
if (isRateLimit(error)) {
markAuthProfileFailure(profile, { reason: "rate_limit" });
clearRetryState(profile.id);
advanceAuthProfile(provider);
throw new NoAvailableAuthProfileError(provider);
}
throw error;
} }
Step 3: Add Retry State Management
File: src/agents/auth-profiles/retry-state.ts (new file)
typescript interface RetryState { consecutiveTimeouts: number; lastRetryTimestamp: number; }
const retryStateMap = new Map<string, RetryState>();
export function getOrCreateRetryState(profileId: string): RetryState { if (!retryStateMap.has(profileId)) { retryStateMap.set(profileId, { consecutiveTimeouts: 0, lastRetryTimestamp: 0 }); } return retryStateMap.get(profileId)!; }
export function clearRetryState(profileId: string): void { retryStateMap.delete(profileId); }
export function clearAllRetryStates(): void { retryStateMap.clear(); }
Step 4: Update Default Configuration
File: src/config/defaults.ts
typescript // Before export const defaultAgentsConfig = { defaults: { timeoutSeconds: 30, modelFailover: { fallbacks: [] } } };
// After export const defaultAgentsConfig = { defaults: { timeoutSeconds: 30, modelFailover: { fallbacks: [], retrySameProfileOnTimeout: 1, retryBackoffMs: [300, 1200] } } };
Configuration After Fix
json5 { “agents”: { “defaults”: { “timeoutSeconds”: 30, “modelFailover”: { “fallbacks”: [“gpt-4-turbo”, “claude-3-opus”], “retrySameProfileOnTimeout”: 1, // Retries before cooldown (0 = disabled) “retryBackoffMs”: [300, 1200] // [min, max] jittered delay in ms } } } }
Optional: Per-Reason Cooldown Schedules
For a more sophisticated fix, differentiate cooldown schedules by reason:
File: src/agents/auth-profiles/usage.ts
typescript const COOLDOWN_SCHEDULES = { timeout: { baseMs: 10_000, // 10 seconds (vs 60s for rate-limit) multiplier: 2, // 10s โ 20s โ 40s โ 80s capMs: 300_000 // 5 minutes cap (vs 1 hour) }, rate_limit: { baseMs: 60_000, multiplier: 5, // 60s โ 5m โ 25m โ 1h capMs: 3_600_000 // 1 hour cap } };
export function calculateAuthProfileCooldownMs( errorCount: number, reason: ’timeout’ | ‘rate_limit’ ): number { const schedule = COOLDOWN_SCHEDULES[reason]; const cooldown = schedule.baseMs * Math.pow(schedule.multiplier, Math.min(errorCount - 1, 3)); return Math.min(cooldown, schedule.capMs); }
๐งช Verification
Unit Test: Single Timeout Retries Same Profile
File: src/agents/pi-embedded-runner/__tests__/timeout-retry.test.ts
typescript describe(‘Timeout retry behavior’, () => { const mockProfile = { id: ’test-profile’, provider: ‘openai-codex’ };
beforeEach(() => { clearAllRetryStates(); });
test(‘single timeout retries same profile without cooldown’, async () => { const executeRequest = jest.fn() .mockRejectedValueOnce(new TimeoutError()) .mockResolvedValueOnce({ data: ‘success’ });
const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();
await executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
executeRequest,
markAuthProfileFailure,
advanceAuthProfile,
config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
});
// Verify retry occurred
expect(executeRequest).toHaveBeenCalledTimes(2);
// Verify NO cooldown was written
expect(markAuthProfileFailure).not.toHaveBeenCalled();
// Verify NO rotation occurred
expect(advanceAuthProfile).not.toHaveBeenCalled();
});
test(’exhausted retries triggers cooldown and rotation’, async () => { const executeRequest = jest.fn() .mockRejectedValue(new TimeoutError());
const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();
await expect(
executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
executeRequest,
markAuthProfileFailure,
advanceAuthProfile,
config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
})
).rejects.toThrow(NoAvailableAuthProfileError);
// Verify retry exhausted
expect(executeRequest).toHaveBeenCalledTimes(2);
// Verify cooldown WAS written
expect(markAuthProfileFailure).toHaveBeenCalledWith(
mockProfile,
{ reason: 'timeout' }
);
// Verify rotation occurred
expect(advanceAuthProfile).toHaveBeenCalledWith('openai-codex');
});
test(‘rate-limit triggers immediate cooldown (no retry)’, async () => { const executeRequest = jest.fn().mockRejectedValue({ status: 429, code: ‘rate_limit_exceeded’ });
const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();
await expect(
executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
executeRequest,
markAuthProfileFailure,
advanceAuthProfile,
config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
})
).rejects.toThrow(NoAvailableAuthProfileError);
// Verify NO retry for rate-limit
expect(executeRequest).toHaveBeenCalledTimes(1);
expect(markAuthProfileFailure).toHaveBeenCalledWith(
mockProfile,
{ reason: 'rate_limit' }
);
}); });
Integration Test: Multiple Profiles + Intermittent Timeouts
typescript test(‘intermittent timeouts do not exhaust all profiles’, async () => { const profiles = [ { id: ‘profile-1’, provider: ‘openai-codex’ }, { id: ‘profile-2’, provider: ‘openai-codex’ }, { id: ‘profile-3’, provider: ‘openai-codex’ } ];
// Profile 1: timeout โ retry โ success // Profile 2: timeout โ retry โ timeout โ cooldown // Profile 3: success const executeRequest = jest.fn() .mockImplementation(({ profile }) => { if (profile.id === ‘profile-1’) return Promise.resolve({ data: ‘ok’ }); if (profile.id === ‘profile-2’) return Promise.reject(new TimeoutError()); if (profile.id === ‘profile-3’) return Promise.resolve({ data: ‘ok’ }); });
const result = await runWithAuthProfiles(profiles, mockRequest, { executeRequest, config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] } });
// Should succeed using profile-1 or profile-3 expect(result).toBeDefined();
// profile-2 cooldown should be recorded expect(getProfileCooldown(‘profile-2’)).toBeDefined(); expect(getProfileCooldown(‘profile-3’)).toBeUndefined(); });
Manual Verification Steps
bash
1. Enable debug logging
export OPENCLAW_LOG_LEVEL=debug
2. Run agent with known timeout-prone scenario
openclaw run –agent ./test-agent.ts –timeout-seconds 5
3. Expected log output with fix:
[DEBUG] Profile openai-codex:default timed out. Retry 1/1 in 847ms…
[DEBUG] Request succeeded on retry
NOT: “Trying next account…” on first timeout
4. After fix, when retries exhausted:
[INFO] Profile openai-codex:default timed out (1 retries exhausted). Trying next account…
[INFO] Cooldown applied: 10000ms for timeout reason
Verification Checklist
| Criterion | Test Method | Expected Result |
|---|---|---|
| Single timeout retries same profile | Unit test | 2 executeRequest calls, 0 cooldown writes |
| Retries exhausted โ cooldown | Unit test | markAuthProfileFailure called with reason: “timeout” |
| Rate-limit bypasses retry | Unit test | 1 executeRequest call, immediate cooldown |
| Log output correct | Manual test | Retry count + delay shown before cooldown |
| Profile exhaustion prevention | Integration test | 3 intermittent timeouts use 2 profiles max |
โ ๏ธ Common Pitfalls
Edge Cases and Environment-Specific Traps
- Jitter range too narrow: If
retryBackoffMsis too small (e.g.,[1, 10]), retries may hit the same transient issue immediately. Recommended minimum:[300, 1200] - Infinite retry loop risk: If
retrySameProfileOnTimeoutis set very high without a global timeout, requests may hang indefinitely. Always pair withtimeoutSeconds - Retry state leakage between sessions: Ensure
clearRetryState()is called on profile rotation success to prevent stale retry counts - Memory pressure in long-running processes: The retry state map should use WeakMap or explicit cleanup for profile objects
macOS-Specific Considerations
bash
Network latency simulation may differ
Test with: sudo scutil –set InitialTSR 5000
Docker-Specific Considerations
bash
Container network timeouts may vary by resource constraints
Ensure container has adequate resources for timeout handling:
docker run –memory=512m –cpus=1 …
Windows-Specific Considerations
powershell
PowerShell sleep precision differs from Unix
Ensure sleep implementation uses monotonic clock:
[System.Diagnostics.Stopwatch]::GetTimestamp()
Configuration Pitfalls
typescript // โ WRONG: retryBackoffMs reversed (min > max) { retryBackoffMs: [1200, 300] }
// โ CORRECT: [min, max] { retryBackoffMs: [300, 1200] }
// โ WRONG: retrySameProfileOnTimeout = 0 disables all timeout handling // (should be “retry on timeout disabled”, not “infinite retries”) { retrySameProfileOnTimeout: 0 }
// โ CORRECT: To disable, use a large backoff or separate config { retrySameProfileOnTimeout: 0, timeoutRetriesEnabled: false }
Interaction with Existing Fallback Behavior
When agents.defaults.model.fallbacks is configured, the retry behavior applies per-provider:
json5 { “agents”: { “defaults”: { “modelFailover”: { “fallbacks”: [“gpt-4”, “claude-3”], “retrySameProfileOnTimeout”: 1, “retryBackoffMs”: [500, 2000] } } } }
Sequence with fix:
- Request to
gpt-4-turbowithopenai-codex:profile-1times out - Retry same profile-1 (no cooldown written)
- Retry fails โ cooldown + rotate to
profile-2 - If profile-2 also exhausted โ fallback to
gpt-4(fresh profiles)
๐ Related Errors
Contextually Connected Error Codes and Historical Issues
| Error / Issue | Description | Relationship |
|---|---|---|
NoAvailableAuthProfileError | Thrown when all profiles are in cooldown | Primary symptom of aggressive timeout handling |
Profile ${id} timed out (possible rate limit) | Misleading log message | Implies rate-limit where only timeout occurred |
MARK_AUTH_PROFILE_FAILURE | Auth profile failure tracking | Core mechanism that needs retry gate |
| HTTP 429 | Explicit rate-limit signal | Correct trigger for cooldown (should remain unchanged) |
error.code === “insufficient_quota” | Provider-specific quota error | Strong signal, should bypass retry |
Related Configuration Parameters
| Parameter | Current Behavior | Issue |
|---|---|---|
agents.defaults.timeoutSeconds | Triggers profile rotation | Too aggressive for transient timeouts |
agents.defaults.modelFailover.fallbacks | Triggered when all profiles exhausted | Unnecessarily triggered by single timeout |
agents.defaults.maxConcurrentRequests | May compound timeout issues | High concurrency + timeouts = faster profile exhaustion |
Historical Context
This issue manifests differently based on configuration:
- High traffic deployments: Multiple simultaneous timeouts can exhaust all profiles quickly
- Low traffic deployments: Single timeout may be the only signal, yet still causes fallback
- Shared infrastructure: One team's timeout affects other teams' profile availability
References to Related OpenClaw Components
src/agents/pi-embedded-runner/run.ts- Embedded runner auth loopsrc/agents/auth-profiles/usage.ts- Cooldown calculationsrc/agents/auth-profiles/cooldown-store.ts- Persistent cooldown statesrc/config/schema.ts- Configuration type definitionssrc/errors/auth-profile-errors.ts- Error class definitions