April 20, 2026 • Version: v2.4.x

Timeout-Driven Auth Rotation Prematurely Triggers Provider Fallback

Generic request timeouts are incorrectly treated as rate-limit signals, causing aggressive auth profile cooldown and rotation that cascades into provider/model fallback even when the provider is temporarily slow.

🔍 Symptoms

Primary Error Messages

When a request timeout occurs on a provider supporting auth.profiles, the embedded runner emits cascading failure messages:

Profile openai-codex:default timed out (possible rate limit). Trying next account...
No available auth profile for openai-codex (all in cooldown or unavailable).
... provider=openai model=gpt-5.2 ...   # fallback triggered

Observable Behavior

Premature profile exhaustion: A single timeout on one profile causes immediate rotation to the next available profile
Cooldown state accumulation: Each timeout writes a cooldown entry with exponential backoff (~1m → 5m → 25m → 1h cap)
Unnecessary model fallback: When all profiles enter cooldown, the system proceeds to configured model fallbacks even if the original provider is operational
Log noise: Repeated `timed out (possible rate limit)` messages create confusion about actual rate-limiting status

Reproduction Scenario

bash

Trigger: A single request exceeds timeoutSeconds threshold

openclaw run –agent ./my-agent.ts –timeout-seconds 30

Observed: Immediate auth profile rotation without retry

Expected: At least one retry with backoff before rotation

Affected Components

Component	File Path	Failure Point
Embedded Runner	`src/agents/pi-embedded-runner/run.ts`	Timeout → `markAuthProfileFailure()` → `advanceAuthProfile()`
Auth Profiles	`src/agents/auth-profiles/usage.ts`	Uniform cooldown schedule for timeout and rate-limit reasons

🧠 Root Cause

Architectural Analysis

The auth-profile failover loop in the embedded runner conflates two distinct failure modes:

Strong rate-limit signals: HTTP 429, provider-specific error codes (e.g., error.code === "rate_limit_exceeded")
Weak transient signals: Generic request timeouts (network blip, slow streaming, SDK latency spike)

Code Path Breakdown

File: src/agents/pi-embedded-runner/run.ts

The timeout handler executes without a retry gate:

typescript // Simplified flow (line numbers approximate) async function executeWithAuthProfile(provider, profile, request) { try { const result = await executeRequest(request, { timeout: timeoutMs }); return result; } catch (error) { if (isTimeout(error)) { // ❌ No retry gate - immediate failure marking markAuthProfileFailure(profile, { reason: “timeout” }); advanceAuthProfile(provider); // ← Triggers rotation throw new NoAvailableAuthProfileError(provider); }

if (isRateLimit(error)) {
  // ✓ Correct: strong signal warrants immediate cooldown
  markAuthProfileFailure(profile, { reason: "rate_limit" });
  advanceAuthProfile(provider);
  throw new NoAvailableAuthProfileError(provider);
}

} }

File: src/agents/auth-profiles/usage.ts

Cooldown calculation applies identical exponential schedule for all failure reasons:

typescript function calculateAuthProfileCooldownMs(errorCount: number): number { // ~1m → 5m → 25m → 1h cap const baseMs = 60_000; const cooldown = baseMs * Math.pow(5, Math.min(errorCount - 1, 3)); return Math.min(cooldown, 3_600_000); // 1-hour cap }

// Called identically for “timeout” and “rate_limit” reasons

Failure Cascade Sequence

Request timeout occurs
markAuthProfileFailure(reason: “timeout”) writes cooldown entry
advanceAuthProfile() rotates to next profile
If all profiles unavailable: a. NoAvailableAuthProfileError raised b. Check agents.defaults.model.fallbacks c. Proceed to fallback model/provider ← Premature!
If no fallbacks: Request fails entirely

Why This Is Incorrect

Signal Type	Reliability	Appropriate Response
HTTP 429	High	Immediate cooldown + rotate
Provider error code	High	Immediate cooldown + rotate
Generic timeout	Low (transient)	Retry with backoff before cooldown

Generic timeouts are indistinguishable from:

Temporary network latency spikes
Slow streaming response initiation
SDK connection overhead
Temporary provider-side load

Configuration Gap

No configuration exists to control per-reason retry behavior:

typescript // Current: No retrySameProfileOnTimeout config exists agents: { defaults: { timeoutSeconds: 30, modelFailover: { // Missing: retrySameProfileOnTimeout, retryBackoffMs } } }

🛠️ Step-by-Step Fix

Recommended: Minimal Retry Gate Addition

This fix adds a per-reason retry gate for timeout failures before triggering cooldown and rotation.

Step 1: Extend Configuration Schema

File: src/config/schema.ts

Add new fields to the model failover configuration:

typescript // Before interface ModelFailoverConfig { fallbacks: string[]; }

// After interface ModelFailoverConfig { fallbacks: string[]; retrySameProfileOnTimeout: number; // Default: 1 retryBackoffMs: [number, number]; // Default: [300, 1200] ms (min, max jitter) }

Step 2: Implement Retry Gate in Embedded Runner

File: src/agents/pi-embedded-runner/run.ts

Modify the timeout handling to include retry logic:

typescript // Before async function executeWithAuthProfile(provider, profile, request) { try { return await executeRequest(request, { timeout: timeoutMs }); } catch (error) { if (isTimeout(error)) { markAuthProfileFailure(profile, { reason: “timeout” }); advanceAuthProfile(provider); throw new NoAvailableAuthProfileError(provider); } // … rate limit handling } }

// After async function executeWithAuthProfile(provider, profile, request, options = {}) { const config = getConfig(); const { retrySameProfileOnTimeout = 1, retryBackoffMs = [300, 1200] } = config.agents?.defaults?.modelFailover ?? {};

// Track retries per-profile per-session const retryState = getOrCreateRetryState(profile.id);

try { return await executeRequest(request, { timeout: timeoutMs }); } catch (error) { if (isTimeout(error)) { const maxRetries = retrySameProfileOnTimeout; const currentRetries = retryState.consecutiveTimeouts;

  if (currentRetries < maxRetries) {
    // Retry same profile with jittered backoff
    const [minDelay, maxDelay] = retryBackoffMs;
    const delay = minDelay + Math.random() * (maxDelay - minDelay);
    
    console.log(
      `Profile ${profile.id} timed out. ` +
      `Retry ${currentRetries + 1}/${maxRetries} in ${Math.round(delay)}ms...`
    );
    
    retryState.consecutiveTimeouts++;
    await sleep(delay);
    
    // Re-execute on same profile (no cooldown written)
    return await executeWithAuthProfile(
      provider, profile, request, 
      { ...options, isRetry: true }
    );
  }
  
  // Retries exhausted: apply cooldown + rotate
  console.log(
    `Profile ${profile.id} timed out (${maxRetries} retries exhausted). ` +
    `Trying next account...`
  );
  
  markAuthProfileFailure(profile, { reason: "timeout" });
  clearRetryState(profile.id);  // Reset retry counter
  advanceAuthProfile(provider);
  throw new NoAvailableAuthProfileError(provider);
}

// Rate-limit handling unchanged (immediate cooldown)
if (isRateLimit(error)) {
  markAuthProfileFailure(profile, { reason: "rate_limit" });
  clearRetryState(profile.id);
  advanceAuthProfile(provider);
  throw new NoAvailableAuthProfileError(provider);
}

throw error;

} }

Step 3: Add Retry State Management

File: src/agents/auth-profiles/retry-state.ts (new file)

typescript interface RetryState { consecutiveTimeouts: number; lastRetryTimestamp: number; }

const retryStateMap = new Map<string, RetryState>();

export function getOrCreateRetryState(profileId: string): RetryState { if (!retryStateMap.has(profileId)) { retryStateMap.set(profileId, { consecutiveTimeouts: 0, lastRetryTimestamp: 0 }); } return retryStateMap.get(profileId)!; }

export function clearRetryState(profileId: string): void { retryStateMap.delete(profileId); }

export function clearAllRetryStates(): void { retryStateMap.clear(); }

Step 4: Update Default Configuration

File: src/config/defaults.ts

typescript // Before export const defaultAgentsConfig = { defaults: { timeoutSeconds: 30, modelFailover: { fallbacks: [] } } };

// After export const defaultAgentsConfig = { defaults: { timeoutSeconds: 30, modelFailover: { fallbacks: [], retrySameProfileOnTimeout: 1, retryBackoffMs: [300, 1200] } } };

Configuration After Fix

json5 { “agents”: { “defaults”: { “timeoutSeconds”: 30, “modelFailover”: { “fallbacks”: [“gpt-4-turbo”, “claude-3-opus”], “retrySameProfileOnTimeout”: 1, // Retries before cooldown (0 = disabled) “retryBackoffMs”: [300, 1200] // [min, max] jittered delay in ms } } } }

Optional: Per-Reason Cooldown Schedules

For a more sophisticated fix, differentiate cooldown schedules by reason:

File: src/agents/auth-profiles/usage.ts

typescript const COOLDOWN_SCHEDULES = { timeout: { baseMs: 10_000, // 10 seconds (vs 60s for rate-limit) multiplier: 2, // 10s → 20s → 40s → 80s capMs: 300_000 // 5 minutes cap (vs 1 hour) }, rate_limit: { baseMs: 60_000, multiplier: 5, // 60s → 5m → 25m → 1h capMs: 3_600_000 // 1 hour cap } };

export function calculateAuthProfileCooldownMs( errorCount: number, reason: ’timeout’ | ‘rate_limit’ ): number { const schedule = COOLDOWN_SCHEDULES[reason]; const cooldown = schedule.baseMs * Math.pow(schedule.multiplier, Math.min(errorCount - 1, 3)); return Math.min(cooldown, schedule.capMs); }

🧪 Verification

Unit Test: Single Timeout Retries Same Profile

File: src/agents/pi-embedded-runner/__tests__/timeout-retry.test.ts

typescript describe(‘Timeout retry behavior’, () => { const mockProfile = { id: ’test-profile’, provider: ‘openai-codex’ };

beforeEach(() => { clearAllRetryStates(); });

test(‘single timeout retries same profile without cooldown’, async () => { const executeRequest = jest.fn() .mockRejectedValueOnce(new TimeoutError()) .mockResolvedValueOnce({ data: ‘success’ });

const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();

await executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
  executeRequest,
  markAuthProfileFailure,
  advanceAuthProfile,
  config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
});

// Verify retry occurred
expect(executeRequest).toHaveBeenCalledTimes(2);

// Verify NO cooldown was written
expect(markAuthProfileFailure).not.toHaveBeenCalled();

// Verify NO rotation occurred
expect(advanceAuthProfile).not.toHaveBeenCalled();

});

test(’exhausted retries triggers cooldown and rotation’, async () => { const executeRequest = jest.fn() .mockRejectedValue(new TimeoutError());

const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();

await expect(
  executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
    executeRequest,
    markAuthProfileFailure,
    advanceAuthProfile,
    config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
  })
).rejects.toThrow(NoAvailableAuthProfileError);

// Verify retry exhausted
expect(executeRequest).toHaveBeenCalledTimes(2);

// Verify cooldown WAS written
expect(markAuthProfileFailure).toHaveBeenCalledWith(
  mockProfile, 
  { reason: 'timeout' }
);

// Verify rotation occurred
expect(advanceAuthProfile).toHaveBeenCalledWith('openai-codex');

});

test(‘rate-limit triggers immediate cooldown (no retry)’, async () => { const executeRequest = jest.fn().mockRejectedValue({ status: 429, code: ‘rate_limit_exceeded’ });

const markAuthProfileFailure = jest.fn();
const advanceAuthProfile = jest.fn();

await expect(
  executeWithAuthProfile('openai-codex', mockProfile, mockRequest, {
    executeRequest,
    markAuthProfileFailure,
    advanceAuthProfile,
    config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] }
  })
).rejects.toThrow(NoAvailableAuthProfileError);

// Verify NO retry for rate-limit
expect(executeRequest).toHaveBeenCalledTimes(1);
expect(markAuthProfileFailure).toHaveBeenCalledWith(
  mockProfile, 
  { reason: 'rate_limit' }
);

}); });

Integration Test: Multiple Profiles + Intermittent Timeouts

typescript test(‘intermittent timeouts do not exhaust all profiles’, async () => { const profiles = [ { id: ‘profile-1’, provider: ‘openai-codex’ }, { id: ‘profile-2’, provider: ‘openai-codex’ }, { id: ‘profile-3’, provider: ‘openai-codex’ } ];

// Profile 1: timeout → retry → success // Profile 2: timeout → retry → timeout → cooldown // Profile 3: success const executeRequest = jest.fn() .mockImplementation(({ profile }) => { if (profile.id === ‘profile-1’) return Promise.resolve({ data: ‘ok’ }); if (profile.id === ‘profile-2’) return Promise.reject(new TimeoutError()); if (profile.id === ‘profile-3’) return Promise.resolve({ data: ‘ok’ }); });

const result = await runWithAuthProfiles(profiles, mockRequest, { executeRequest, config: { retrySameProfileOnTimeout: 1, retryBackoffMs: [0, 10] } });

// Should succeed using profile-1 or profile-3 expect(result).toBeDefined();

// profile-2 cooldown should be recorded expect(getProfileCooldown(‘profile-2’)).toBeDefined(); expect(getProfileCooldown(‘profile-3’)).toBeUndefined(); });

Manual Verification Steps

bash

1. Enable debug logging

export OPENCLAW_LOG_LEVEL=debug

2. Run agent with known timeout-prone scenario

openclaw run –agent ./test-agent.ts –timeout-seconds 5

3. Expected log output with fix:

[DEBUG] Profile openai-codex:default timed out. Retry 1/1 in 847ms…

[DEBUG] Request succeeded on retry

NOT: “Trying next account…” on first timeout

4. After fix, when retries exhausted:

[INFO] Profile openai-codex:default timed out (1 retries exhausted). Trying next account…

[INFO] Cooldown applied: 10000ms for timeout reason

Verification Checklist

Criterion	Test Method	Expected Result
Single timeout retries same profile	Unit test	2 executeRequest calls, 0 cooldown writes
Retries exhausted → cooldown	Unit test	markAuthProfileFailure called with reason: “timeout”
Rate-limit bypasses retry	Unit test	1 executeRequest call, immediate cooldown
Log output correct	Manual test	Retry count + delay shown before cooldown
Profile exhaustion prevention	Integration test	3 intermittent timeouts use 2 profiles max

⚠️ Common Pitfalls

Edge Cases and Environment-Specific Traps

Jitter range too narrow: If retryBackoffMs is too small (e.g., [1, 10]), retries may hit the same transient issue immediately. Recommended minimum: [300, 1200]
Infinite retry loop risk: If retrySameProfileOnTimeout is set very high without a global timeout, requests may hang indefinitely. Always pair with timeoutSeconds
Retry state leakage between sessions: Ensure clearRetryState() is called on profile rotation success to prevent stale retry counts
Memory pressure in long-running processes: The retry state map should use WeakMap or explicit cleanup for profile objects

macOS-Specific Considerations

bash

Network latency simulation may differ

Test with: sudo scutil –set InitialTSR 5000

Docker-Specific Considerations

bash

Container network timeouts may vary by resource constraints

Ensure container has adequate resources for timeout handling:

docker run –memory=512m –cpus=1 …

Windows-Specific Considerations

powershell

PowerShell sleep precision differs from Unix

Ensure sleep implementation uses monotonic clock:

[System.Diagnostics.Stopwatch]::GetTimestamp()

Configuration Pitfalls

typescript // ❌ WRONG: retryBackoffMs reversed (min > max) { retryBackoffMs: [1200, 300] }

// ✅ CORRECT: [min, max] { retryBackoffMs: [300, 1200] }

// ❌ WRONG: retrySameProfileOnTimeout = 0 disables all timeout handling // (should be “retry on timeout disabled”, not “infinite retries”) { retrySameProfileOnTimeout: 0 }

// ✅ CORRECT: To disable, use a large backoff or separate config { retrySameProfileOnTimeout: 0, timeoutRetriesEnabled: false }

Interaction with Existing Fallback Behavior

When agents.defaults.model.fallbacks is configured, the retry behavior applies per-provider:

json5 { “agents”: { “defaults”: { “modelFailover”: { “fallbacks”: [“gpt-4”, “claude-3”], “retrySameProfileOnTimeout”: 1, “retryBackoffMs”: [500, 2000] } } } }

Sequence with fix:

Request to gpt-4-turbo with openai-codex:profile-1 times out
Retry same profile-1 (no cooldown written)
Retry fails → cooldown + rotate to profile-2
If profile-2 also exhausted → fallback to gpt-4 (fresh profiles)

Contextually Connected Error Codes and Historical Issues

Error / Issue	Description	Relationship
`NoAvailableAuthProfileError`	Thrown when all profiles are in cooldown	Primary symptom of aggressive timeout handling
`Profile ${id} timed out (possible rate limit)`	Misleading log message	Implies rate-limit where only timeout occurred
`MARK_AUTH_PROFILE_FAILURE`	Auth profile failure tracking	Core mechanism that needs retry gate
HTTP 429	Explicit rate-limit signal	Correct trigger for cooldown (should remain unchanged)
`error.code === “insufficient_quota”`	Provider-specific quota error	Strong signal, should bypass retry

Parameter	Current Behavior	Issue
`agents.defaults.timeoutSeconds`	Triggers profile rotation	Too aggressive for transient timeouts
`agents.defaults.modelFailover.fallbacks`	Triggered when all profiles exhausted	Unnecessarily triggered by single timeout
`agents.defaults.maxConcurrentRequests`	May compound timeout issues	High concurrency + timeouts = faster profile exhaustion

Historical Context

This issue manifests differently based on configuration:

High traffic deployments: Multiple simultaneous timeouts can exhaust all profiles quickly
Low traffic deployments: Single timeout may be the only signal, yet still causes fallback
Shared infrastructure: One team's timeout affects other teams' profile availability

src/agents/pi-embedded-runner/run.ts - Embedded runner auth loop
src/agents/auth-profiles/usage.ts - Cooldown calculation
src/agents/auth-profiles/cooldown-store.ts - Persistent cooldown state
src/config/schema.ts - Configuration type definitions
src/errors/auth-profile-errors.ts - Error class definitions

🔍 Symptoms

Primary Error Messages

Observable Behavior

Reproduction Scenario

Trigger: A single request exceeds timeoutSeconds threshold

Observed: Immediate auth profile rotation without retry

Expected: At least one retry with backoff before rotation

Affected Components

🧠 Root Cause

Architectural Analysis

Code Path Breakdown

Failure Cascade Sequence

Why This Is Incorrect

Configuration Gap

🛠️ Step-by-Step Fix

Recommended: Minimal Retry Gate Addition

Step 1: Extend Configuration Schema

Step 2: Implement Retry Gate in Embedded Runner

Step 3: Add Retry State Management

Step 4: Update Default Configuration

Configuration After Fix

Optional: Per-Reason Cooldown Schedules

🧪 Verification

Unit Test: Single Timeout Retries Same Profile

Integration Test: Multiple Profiles + Intermittent Timeouts

Manual Verification Steps

1. Enable debug logging

2. Run agent with known timeout-prone scenario

3. Expected log output with fix:

[DEBUG] Profile openai-codex:default timed out. Retry 1/1 in 847ms…

[DEBUG] Request succeeded on retry

NOT: “Trying next account…” on first timeout

4. After fix, when retries exhausted:

[INFO] Profile openai-codex:default timed out (1 retries exhausted). Trying next account…

[INFO] Cooldown applied: 10000ms for timeout reason

Verification Checklist

⚠️ Common Pitfalls

Edge Cases and Environment-Specific Traps

macOS-Specific Considerations

Network latency simulation may differ

Test with: sudo scutil –set InitialTSR 5000

Docker-Specific Considerations

Container network timeouts may vary by resource constraints

Ensure container has adequate resources for timeout handling:

docker run –memory=512m –cpus=1 …

Windows-Specific Considerations

PowerShell sleep precision differs from Unix

Ensure sleep implementation uses monotonic clock:

[System.Diagnostics.Stopwatch]::GetTimestamp()

Configuration Pitfalls

Interaction with Existing Fallback Behavior

🔗 Related Errors

Contextually Connected Error Codes and Historical Issues

Related Configuration Parameters

Historical Context

References to Related OpenClaw Components