April 20, 2026 โ€ข Version: v2.4.x

Timeout-Driven Auth Rotation Prematurely Triggers Provider Fallback

Generic request timeouts are incorrectly treated as rate-limit signals, causing aggressive auth profile cooldown and rotation that cascades into provider/model fallback even when the provider is temporarily slow.

๐Ÿ” Symptoms

Primary Error Messages

When a request timeout occurs on a provider supporting auth.profiles, the embedded runner emits cascading failure messages:

Profile openai-codex:default timed out (possible rate limit). Trying next account...
No available auth profile for openai-codex (all in cooldown or unavailable).
... provider=openai model=gpt-5.2 ...   # fallback triggered

Observable Behavior

  • Premature profile exhaustion: A single timeout on one profile causes immediate rotation to the next available profile
  • Cooldown state accumulation: Each timeout writes a cooldown entry with exponential backoff (~1m โ†’ 5m โ†’ 25m โ†’ 1h cap)
  • Unnecessary model fallback: When all profiles enter cooldown, the system proceeds to configured model fallbacks even if the original provider is operational
  • Log noise: Repeated `timed out (possible rate limit)` messages create confusion about actual rate-limiting status

Reproduction Scenario

bash

Trigger: A single request exceeds timeoutSeconds threshold

openclaw run –agent ./my-agent.ts –timeout-seconds 30

Observed: Immediate auth profile rotation without retry

Expected: At least one retry with backoff before rotation

Affected Components

ComponentFile PathFailure Point
Embedded Runnersrc/agents/pi-embedded-runner/run.tsTimeout โ†’ markAuthProfileFailure() โ†’ advanceAuthProfile()
Auth Profilessrc/agents/auth-profiles/usage.tsUniform cooldown schedule for timeout and rate-limit reasons

๐Ÿง  Root Cause

Architectural Analysis

The auth-profile failover loop in the embedded runner conflates two distinct failure modes:

  1. Strong rate-limit signals: HTTP 429, provider-specific error codes (e.g., error.code === "rate_limit_exceeded")
  2. Weak transient signals: Generic request timeouts (network blip, slow streaming, SDK latency spike)

Code Path Breakdown

File: src/agents/pi-embedded-runner/run.ts

The timeout handler executes without a retry gate:

typescript // Simplified flow (line numbers approximate) async function executeWithAuthProfile(provider, profile, request) { try { const result = await executeRequest(request, { timeout: timeoutMs }); return result; } catch (error) { if (isTimeout(error)) { // โŒ No retry gate - immediate failure marking markAuthProfileFailure(profile, { reason: “timeout” }); advanceAuthProfile(provider); // โ† Triggers rotation throw new NoAvailableAuthProfileError(provider); }

if (isRateLimit(error)) {
  // โœ“ Correct: strong signal warrants immediate cooldown
  markAuthProfileFailure(profile, { reason: "rate_limit" });
  advanceAuthProfile(provider);
  throw new NoAvailableAuthProfileError(provider);
}

} }

File: src/agents/auth-profiles/usage.ts

Cooldown calculation applies identical exponential schedule for all failure reasons:

typescript function calculateAuthProfileCooldownMs(errorCount: number): number { // ~1m โ†’ 5m โ†’ 25m โ†’ 1h cap const baseMs = 60_000; const cooldown = baseMs * Math.pow(5, Math.min(errorCount - 1, 3)); return Math.min(cooldown, 3_600_000); // 1-hour cap }

// Called identically for “timeout” and “rate_limit” reasons

Failure Cascade Sequence

  1. Request timeout occurs
  2. markAuthProfileFailure(reason: “timeout”) writes cooldown entry
  3. advanceAuthProfile() rotates to next profile
  4. If all profiles unavailable: a. NoAvailableAuthProfileError raised b. Check agents.defaults.model.fallbacks c. Proceed to fallback model/provider โ† Premature!
  5. If no fallbacks: Request fails entirely

Why This Is Incorrect

Signal TypeReliabilityAppropriate Response
HTTP 429HighImmediate cooldown + rotate
Provider error codeHighImmediate cooldown + rotate
Generic timeoutLow (transient)Retry with backoff before cooldown

Generic timeouts are indistinguishable from:

  • Temporary network latency spikes
  • Slow streaming response initiation
  • SDK connection overhead
  • Temporary provider-side load

Configuration Gap

No configuration exists to control per-reason retry behavior:

typescript // Current: No retrySameProfileOnTimeout config exists agents: { defaults: { timeoutSeconds: 30, modelFailover: { // Missing: retrySameProfileOnTimeout, retryBackoffMs } } }

๐Ÿ› ๏ธ Step-by-Step Fix

This fix adds a per-reason retry gate for timeout failures before triggering cooldown and rotation.

Step 1: Extend Configuration Schema

File: src/config/schema.ts

Add new fields to the model failover configuration:

typescript // Before interface ModelFailoverConfig { fallbacks: string[]; }

// After interface ModelFailoverConfig { fallbacks: string[]; retrySameProfileOnTimeout: number; // Default: 1 retryBackoffMs: [number, number]; // Default: [300, 1200] ms (min, max jitter) }

Step 2: Implement Retry Gate in Embedded Runner

File: src/agents/pi-embedded-runner/run.ts

Modify the timeout handling to include retry logic:

typescript // Before async function executeWithAuthProfile(provider, profile, request) { try { return await executeRequest(request, { timeout: timeoutMs }); } catch (error) { if (isTimeout(error)) { markAuthProfileFailure(profile, { reason: “timeout” }); advanceAuthProfile(provider); throw new NoAvailableAuthProfileError(provider); } // … rate limit handling } }

// After async function executeWithAuthProfile(provider, profile, request, options = {}) { const config = getConfig(); const { retrySameProfileOnTimeout = 1, retryBackoffMs = [300, 1200] } = config.agents?.defaults?.modelFailover ?? {};

// Track retries per-profile per-session const retryState = getOrCreateRetryState(profile.id);

try { return await executeRequest(request, { timeout: timeoutMs }); } catch (error) { if (isTimeout(error)) { const maxRetries = retrySameProfileOnTimeout; const currentRetries = retryState.consecutiveTimeouts;

  if (currentRetries < maxRetries) {
    // Retry same profile with jittered backoff
    const [minDelay, maxDelay] = retryBackoffMs;
    const delay = minDelay + Math.random() * (maxDelay - minDelay);
    
    console.log(
      `Profile ${profile.id} timed out. ` +
      `Retry ${currentRetries + 1}/${maxRetries} in ${Math.round(delay)}ms...`
    );
    
    retryState.consecutiveTimeouts++;
    await sleep(delay);
    
    // Re-execute on same profile (no cooldown written)
    return await executeWithAuthProfile(
      provider, profile, request, 
      { ...options, isRetry: true }
    );
  }
  
  // Retries exhausted: apply cooldown + rotate
  console.log(
    `Profile ${profile.id} timed out (${maxRetries} retries exhausted). ` +
    `Trying next account...`
  );
  
  markAuthProfileFailure(profile, { reason: "timeout" });
  clearRetryState(profile.id);  // Reset retry counter
  advanceAuthProfile(provider);
  throw new NoAvailableAuthProfileError(provider);
}

// Rate-limit handling unchanged (immediate cooldown)
if (isRateLimit(error)) {
  markAuthProfileFailure(profile, { reason: "rate_limit" });
  clearRetryState(profile.id);
  advanceAuthProfile(provider);
  throw new NoAvailableAuthProfileError(provider);
}

throw error;

} }

Step 3: Add Retry State Management

File: src/agents/auth-profiles/retry-state.ts (new file)

typescript interface RetryState { consecutiveTimeouts: number; lastRetryTimestamp: number; }

const retryStateMap = new Map<string, RetryState>();

export function getOrCreateRetryState(profileId: string): RetryState { if (!retryStateMap.has(profileId)) { retryStateMap.set(profileId, { consecutiveTimeouts: 0, lastRetryTimestamp: 0 }); } return retryStateMap.get(profileId)!; }

export function clearRetryState(profileId: string): void { retryStateMap.delete(profileId); }

export function clearAllRetryStates(): void { retryStateMap.clear(); }

Step 4: Update Default Configuration

File: src/config/defaults.ts

typescript // Before export const defaultAgentsConfig = { defaults: { timeoutSeconds: 30, modelFailover: { fallbacks: [] } } };

// After export const defaultAgentsConfig = { defaults: { timeoutSeconds: 30, modelFailover: { fallbacks: [], retrySameProfileOnTimeout: 1, retryBackoffMs: [300, 1200] } } };

Configuration After Fix

json5 { “agents”: { “defaults”: { “timeoutSeconds”: 30, “modelFailover”: { “fallbacks”: [“gpt-4-turbo”, “claude-3-opus”], “retrySameProfileOnTimeout”: 1, // Retries before cooldown (0 = disabled) “retryBackoffMs”: [300, 1200] // [min, max] jittered delay in ms } } } }

Optional: Per-Reason Cooldown Schedules

For a more sophisticated fix, differentiate cooldown schedules by reason:

File: src/agents/auth-profiles/usage.ts

typescript const COOLDOWN_SCHEDULES = { timeout: { baseMs: 10_000, // 10 seconds (vs 60s for rate-limit) multiplier: 2, // 10s โ†’ 20s โ†’ 40s โ†’ 80s capMs: 300_000 // 5 minutes cap (vs 1 hour) }, rate_limit: { baseMs: 60_000, multiplier: 5, // 60s โ†’ 5m โ†’ 25m โ†’ 1h capMs: 3_600_000 // 1 hour cap } };

export function calculateAuthProfileCooldownMs( errorCount: number, reason: ’timeout’ | ‘rate_limit’ ): number { const schedule = COOLDOWN_SCHEDULES[reason]; const cooldown = schedule.baseMs * Math.pow(schedule.multiplier, Math.min(errorCount - 1, 3)); return Math.min(cooldown, schedule.capMs); }

๐Ÿงช Verification

Unit Test: Single Timeout Retries Same Profile

File: src/agents/pi-embedded-runner/__tests__/timeout-retry.test.ts

typescript describe(‘Timeout retry behavior’, () => { const mockProfile = { id: ’test-profile’, provider: ‘openai-codex’ };

beforeEach(() => { clearAllRetryStates(); });

test(‘single timeout retries same profile without cooldown’, async () => { const executeRequest = jest.fn() .mockRejectedValueOnce(new TimeoutError()) .mockResolvedValueOnce({ data: ‘success’ });

const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();

await executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
  executeRequest,
  markAuthProfileFailure,
  advanceAuthProfile,
  config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
});

// Verify retry occurred
expect(executeRequest).toHaveBeenCalledTimes(2);

// Verify NO cooldown was written
expect(markAuthProfileFailure).not.toHaveBeenCalled();

// Verify NO rotation occurred
expect(advanceAuthProfile).not.toHaveBeenCalled();

});

test(’exhausted retries triggers cooldown and rotation’, async () => { const executeRequest = jest.fn() .mockRejectedValue(new TimeoutError());

const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();

await expect(
  executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
    executeRequest,
    markAuthProfileFailure,
    advanceAuthProfile,
    config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
  })
).rejects.toThrow(NoAvailableAuthProfileError);

// Verify retry exhausted
expect(executeRequest).toHaveBeenCalledTimes(2);

// Verify cooldown WAS written
expect(markAuthProfileFailure).toHaveBeenCalledWith(
  mockProfile, 
  { reason: 'timeout' }
);

// Verify rotation occurred
expect(advanceAuthProfile).toHaveBeenCalledWith('openai-codex');

});

test(‘rate-limit triggers immediate cooldown (no retry)’, async () => { const executeRequest = jest.fn().mockRejectedValue({ status: 429, code: ‘rate_limit_exceeded’ });

const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();

await expect(
  executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
    executeRequest,
    markAuthProfileFailure,
    advanceAuthProfile,
    config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
  })
).rejects.toThrow(NoAvailableAuthProfileError);

// Verify NO retry for rate-limit
expect(executeRequest).toHaveBeenCalledTimes(1);
expect(markAuthProfileFailure).toHaveBeenCalledWith(
  mockProfile, 
  { reason: 'rate_limit' }
);

}); });

Integration Test: Multiple Profiles + Intermittent Timeouts

typescript test(‘intermittent timeouts do not exhaust all profiles’, async () => { const profiles = [ { id: ‘profile-1’, provider: ‘openai-codex’ }, { id: ‘profile-2’, provider: ‘openai-codex’ }, { id: ‘profile-3’, provider: ‘openai-codex’ } ];

// Profile 1: timeout โ†’ retry โ†’ success // Profile 2: timeout โ†’ retry โ†’ timeout โ†’ cooldown // Profile 3: success const executeRequest = jest.fn() .mockImplementation(({ profile }) => { if (profile.id === ‘profile-1’) return Promise.resolve({ data: ‘ok’ }); if (profile.id === ‘profile-2’) return Promise.reject(new TimeoutError()); if (profile.id === ‘profile-3’) return Promise.resolve({ data: ‘ok’ }); });

const result = await runWithAuthProfiles(profiles, mockRequest, { executeRequest, config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] } });

// Should succeed using profile-1 or profile-3 expect(result).toBeDefined();

// profile-2 cooldown should be recorded expect(getProfileCooldown(‘profile-2’)).toBeDefined(); expect(getProfileCooldown(‘profile-3’)).toBeUndefined(); });

Manual Verification Steps

bash

1. Enable debug logging

export OPENCLAW_LOG_LEVEL=debug

2. Run agent with known timeout-prone scenario

openclaw run –agent ./test-agent.ts –timeout-seconds 5

3. Expected log output with fix:

[DEBUG] Profile openai-codex:default timed out. Retry 1/1 in 847ms…

[DEBUG] Request succeeded on retry

NOT: “Trying next account…” on first timeout

4. After fix, when retries exhausted:

[INFO] Profile openai-codex:default timed out (1 retries exhausted). Trying next account…

[INFO] Cooldown applied: 10000ms for timeout reason

Verification Checklist

CriterionTest MethodExpected Result
Single timeout retries same profileUnit test2 executeRequest calls, 0 cooldown writes
Retries exhausted โ†’ cooldownUnit testmarkAuthProfileFailure called with reason: “timeout”
Rate-limit bypasses retryUnit test1 executeRequest call, immediate cooldown
Log output correctManual testRetry count + delay shown before cooldown
Profile exhaustion preventionIntegration test3 intermittent timeouts use 2 profiles max

โš ๏ธ Common Pitfalls

Edge Cases and Environment-Specific Traps

  • Jitter range too narrow: If retryBackoffMs is too small (e.g., [1, 10]), retries may hit the same transient issue immediately. Recommended minimum: [300, 1200]
  • Infinite retry loop risk: If retrySameProfileOnTimeout is set very high without a global timeout, requests may hang indefinitely. Always pair with timeoutSeconds
  • Retry state leakage between sessions: Ensure clearRetryState() is called on profile rotation success to prevent stale retry counts
  • Memory pressure in long-running processes: The retry state map should use WeakMap or explicit cleanup for profile objects

macOS-Specific Considerations

bash

Network latency simulation may differ

Test with: sudo scutil –set InitialTSR 5000

Docker-Specific Considerations

bash

Container network timeouts may vary by resource constraints

Ensure container has adequate resources for timeout handling:

docker run –memory=512m –cpus=1 …

Windows-Specific Considerations

powershell

PowerShell sleep precision differs from Unix

Ensure sleep implementation uses monotonic clock:

[System.Diagnostics.Stopwatch]::GetTimestamp()

Configuration Pitfalls

typescript // โŒ WRONG: retryBackoffMs reversed (min > max) { retryBackoffMs: [1200, 300] }

// โœ… CORRECT: [min, max] { retryBackoffMs: [300, 1200] }

// โŒ WRONG: retrySameProfileOnTimeout = 0 disables all timeout handling // (should be “retry on timeout disabled”, not “infinite retries”) { retrySameProfileOnTimeout: 0 }

// โœ… CORRECT: To disable, use a large backoff or separate config { retrySameProfileOnTimeout: 0, timeoutRetriesEnabled: false }

Interaction with Existing Fallback Behavior

When agents.defaults.model.fallbacks is configured, the retry behavior applies per-provider:

json5 { “agents”: { “defaults”: { “modelFailover”: { “fallbacks”: [“gpt-4”, “claude-3”], “retrySameProfileOnTimeout”: 1, “retryBackoffMs”: [500, 2000] } } } }

Sequence with fix:

  1. Request to gpt-4-turbo with openai-codex:profile-1 times out
  2. Retry same profile-1 (no cooldown written)
  3. Retry fails โ†’ cooldown + rotate to profile-2
  4. If profile-2 also exhausted โ†’ fallback to gpt-4 (fresh profiles)

Contextually Connected Error Codes and Historical Issues

Error / IssueDescriptionRelationship
NoAvailableAuthProfileErrorThrown when all profiles are in cooldownPrimary symptom of aggressive timeout handling
Profile ${id} timed out (possible rate limit)Misleading log messageImplies rate-limit where only timeout occurred
MARK_AUTH_PROFILE_FAILUREAuth profile failure trackingCore mechanism that needs retry gate
HTTP 429Explicit rate-limit signalCorrect trigger for cooldown (should remain unchanged)
error.code === “insufficient_quota”Provider-specific quota errorStrong signal, should bypass retry
ParameterCurrent BehaviorIssue
agents.defaults.timeoutSecondsTriggers profile rotationToo aggressive for transient timeouts
agents.defaults.modelFailover.fallbacksTriggered when all profiles exhaustedUnnecessarily triggered by single timeout
agents.defaults.maxConcurrentRequestsMay compound timeout issuesHigh concurrency + timeouts = faster profile exhaustion

Historical Context

This issue manifests differently based on configuration:

  • High traffic deployments: Multiple simultaneous timeouts can exhaust all profiles quickly
  • Low traffic deployments: Single timeout may be the only signal, yet still causes fallback
  • Shared infrastructure: One team's timeout affects other teams' profile availability
  • src/agents/pi-embedded-runner/run.ts - Embedded runner auth loop
  • src/agents/auth-profiles/usage.ts - Cooldown calculation
  • src/agents/auth-profiles/cooldown-store.ts - Persistent cooldown state
  • src/config/schema.ts - Configuration type definitions
  • src/errors/auth-profile-errors.ts - Error class definitions

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.