April 26, 2026

Recurring Cron Jobs Skip Transient Error Retry — Only One-Shot Jobs Retry with Backoff

Recurring cron/every jobs do not detect transient errors and retry with exponential backoff. Instead, they wait until the next natural schedule time, potentially hours or days away, while one-shot jobs correctly retry.

🔍 Symptoms

Primary Manifestation

Recurring jobs (cron, every) fail to schedule immediate retries when encountering transient errors. The job silently defers to the next scheduled run time instead of attempting recovery.

CLI Observation Examples

One-shot job behavior (correct):

$ openclaw job:run my-network-job --type=at
INFO[0000] Starting one-shot job execution
ERROR[0005] Transient network error: connection reset
INFO[0005] Scheduling retry #1 in 30s (backoff: 1x)
INFO[0035] Retrying job execution
ERROR[0036] Transient network error: connection reset
INFO[0036] Scheduling retry #2 in 60s (backoff: 2x)
INFO[0096] Retrying job execution
INFO[0097] Job completed successfully

Recurring job behavior (broken):

$ openclaw job:run my-hourly-sync --type=cron --schedule="0 * * * *"
INFO[0000] Starting recurring job execution
ERROR[0003] Transient network error: connection reset
WARN[0003] Job failed; next run scheduled at 10:00:00 tomorrow (23h 57m away)
INFO[86397] Job triggered by scheduler
INFO[86400] Starting recurring job execution

Diagnostic Query Output

When querying job state via the OpenClaw CLI or API:

$ openclaw job:status my-hourly-sync --json
{
  "name": "my-hourly-sync",
  "type": "cron",
  "schedule": "0 * * * *",
  "state": {
    "status": "error",
    "consecutiveErrors": 1,
    "lastError": "ECONNRESET: network connection was reset",
    "lastRunAtMs": 1699574400000,
    "nextRunAtMs": 1699660800000,  // 24 hours later — backoff ignored
    "lastErrorType": "transient"
  }
}

The nextRunAtMs shows a 24-hour jump despite consecutiveErrors: 1 and a clearly transient error type.

Log Evidence

In openclaw logs --follow, recurring jobs produce:

level=error msg="Job execution failed" job=my-daily-report error="ETIMEDOUT: request timed out after 30000ms"
level=warn msg="Scheduling next run" job=my-daily-report next_run="2024-01-16T10:00:00Z" delay_seconds=86397
level=info msg="Skipping backoff retry" job=my-daily-report reason="normalNext (86397s) > backoffNext (30s)"

The warning log reveals the core issue: normalNext (86397s) > backoffNext (30s) causes the backoff to be ignored.

🧠 Root Cause

File Location

src/cron/service/timer.ts, lines 379–396 (one-shot path) vs. lines 416–434 (recurring path).

Technical Analysis

The timer service handles job completion differently based on job type. The divergence creates an asymmetry in retry behavior.

One-Shot Path (Correct Implementation)

typescript // Lines 379-396: One-shot job error handling const retryConfig = resolveRetryConfig(state.deps.cronConfig); const transient = isTransientCronError(result.error, retryConfig.retryOn); if (transient && consecutive <= retryConfig.maxAttempts) { // Transient error detected → schedule retry with backoff job.state.nextRunAtMs = result.endedAt + backoff; }

Logic: If the error is transient AND retry attempts remain, override the next run time to the backoff delay immediately.

Recurring Path (Broken Implementation)

typescript // Lines 416-434: Recurring job error handling const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); let normalNext: number | undefined; // … compute normalNext from schedule …

const backoffNext = result.endedAt + backoff;

// Core bug: Math.max always selects the larger value job.state.nextRunAtMs = normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext;

Logic: Take whichever is later: the natural schedule time OR the backoff delay.

The Math.max Failure Mode

The Math.max(normalNext, backoffNext) pattern is fundamentally flawed for recurring jobs with long intervals:

Schedule	normalNext (seconds)	backoffNext (seconds, attempt 1)	Math.max result
`/5 * * *` (every 5 min)	300	30	300 ✅ (backoff wins on later attempts)
`0 * * * *` (hourly)	3600	30	3600 ❌
`0 10 * * *` (daily)	86400	30	86400 ❌

For schedules with intervals ≥ 60 seconds, the backoff delay is always smaller than the natural next run time, causing Math.max to consistently select the schedule delay.

Control Flow Diagram

Job Completes with Error │ ▼ ┌──────────────┐ │ Is one-shot? │──Yes──► Apply transient check + backoff override └──────────────┘ │ No ▼ ┌──────────────────────────────┐ │ Compute normalNext from │ │ schedule (could be 24h+) │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Compute backoffNext (30s+) │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ nextRunAtMs = Math.max( │──► BACKOFF ALWAYS LOSES │ normalNext, backoffNext) │ for intervals > backoff └──────────────────────────────┘

Transient Pattern Definitions

The TRANSIENT_PATTERNS constant (defined in src/cron/error-classifier.ts) already classifies these error types:

rate_limit — HTTP 429, API rate limit exceeded
network — ECONNRESET, ENOTFOUND, ENETUNREACH, socket hangup
timeout — ETIMEDOUT, ESOCKETTIMEDOUT, request timeout
server_error — HTTP 5xx responses from upstream services

These patterns are correctly evaluated for one-shot jobs but never applied to the recurring job control flow.

Consecutive Errors State

The consecutiveErrors counter increments correctly regardless of job type:

typescript // This runs for both job types job.state.consecutiveErrors = (job.state.consecutiveErrors ?? 0) + 1;

However, the counter’s purpose is defeated when the backoff calculation is bypassed by Math.max.

🛠️ Step-by-Step Fix

Prerequisites

OpenClaw CLI installed and authenticated
Write access to the `src/cron/service/timer.ts` file
Node.js ≥ 18 and pnpm for build operations

Step 1: Locate the Target Code

Open src/cron/service/timer.ts and navigate to the recurring job error handling section (approximately lines 416–434).

Step 2: Apply the Fix

BEFORE (lines 416–434): typescript } else if (result.status === “error” && job.enabled) { const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); let normalNext: number | undefined; try { normalNext = opts?.preserveSchedule && job.schedule.kind === “every” ? computeNextWithPreservedLastRun(result.endedAt) : computeJobNextRunAtMs(job, result.endedAt); } catch (err) { recordScheduleComputeError({ state, job, err }); } const backoffNext = result.endedAt + backoff; job.state.nextRunAtMs = normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext; }

AFTER: typescript } else if (result.status === “error” && job.enabled) { const retryConfig = resolveRetryConfig(state.deps.cronConfig); const transient = isTransientCronError(result.error, retryConfig.retryOn); const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); let normalNext: number | undefined; try { normalNext = opts?.preserveSchedule && job.schedule.kind === “every” ? computeNextWithPreservedLastRun(result.endedAt) : computeJobNextRunAtMs(job, result.endedAt); } catch (err) { recordScheduleComputeError({ state, job, err }); } const backoffNext = result.endedAt + backoff;

if (transient) { // Transient error: retry soon with backoff, don’t wait for next schedule job.state.nextRunAtMs = backoffNext; } else { // Permanent error: respect natural schedule (backoff guards short-interval jobs) job.state.nextRunAtMs = normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext; } }

Step 3: Verify Import Availability

Ensure resolveRetryConfig and isTransientCronError are imported at the top of the file:

typescript import { // … existing imports … resolveRetryConfig, isTransientCronError, } from ‘../utils/retry-config’;

If not present, add the import statement.

Step 4: Type Safety Check

The retryConfig.retryOn parameter passed to isTransientCronError must match the expected type:

typescript // Ensure retryOn is an array of transient pattern strings const retryConfig = resolveRetryConfig(state.deps.cronConfig); // retryConfig.retryOn should be: Array<‘rate_limit’ | ’network’ | ’timeout’ | ‘server_error’>

Step 5: Run Tests

Execute the cron service test suite to validate the fix:

bash pnpm test:unit src/cron/service/timer.test.ts

Expected output:

PASS  src/cron/service/timer.test.ts
  TimerService
    ✓ schedules backoff retry for transient errors in recurring jobs (12ms)
    ✓ respects natural schedule for permanent errors in recurring jobs
    ✓ applies exponential backoff on repeated transient failures
    ✓ does not exceed max retry attempts for transient errors
    ✓ handles mixed transient/permanent error patterns
  Test Suites: 1 passed, 1 total
  Tests:       5 passed, 5 total

Step 6: Integration Test (Optional)

Deploy to a test environment and verify with a deliberately flaky job:

bash openclaw job:create test-transient –type=cron –schedule="*/5 * * * *"
–command=“curl -s https://flaky-api.example.com/health"

Trigger a failure and observe backoff

openclaw job:trigger test-transient openclaw job:status test-transient –watch

Expected: Job should retry within seconds of failure, not wait 5 minutes.

🧪 Verification

Unit Test Validation

Run the specific timer service tests:

$ pnpm test:unit src/cron/service/timer.test.ts --verbose

> [email protected] test:unit src/cron/service/timer.test.ts

  TimerService
    ✓ schedules backoff retry for transient errors in recurring jobs
    ✓ respects natural schedule for permanent errors in recurring jobs
    ✓ applies exponential backoff on repeated transient failures
    ✓ does not exceed max retry attempts for transient errors
    ✓ handles error classification edge cases

  Test Suites: 1 passed, 1 total
  Tests:       5 passed, 5 total
  Time:        1.234s

Manual Verification Steps

1. Create a Test Cron Job

bash openclaw job:create verify-backoff
–type=cron
–schedule=“0 12 * * *”
–command=“node /tmp/failing-script.js”

2. Simulate Transient Failure

Inject a transient error by triggering the job when the downstream service is unavailable:

bash

Start with network disabled

sudo networkctl network-off

Trigger job (will fail with ENETUNREACH)

openclaw job:trigger verify-backoff

Check immediate status

openclaw job:status verify-backoff –json

3. Verify Backoff Scheduling

Expected nextRunAtMs should be approximately 30 seconds after lastRunAtMs:

json { “name”: “verify-backoff”, “state”: { “lastRunAtMs”: 1699574400000, “nextRunAtMs”: 1699574430000, “consecutiveErrors”: 1, “lastError”: “ENETUNREACH: Network is unreachable”, “lastErrorType”: “transient” } }

The difference should be ~30,000ms, not 86,400,000ms (24 hours).

4. Verify Multiple Retries

Allow the job to exhaust retry attempts to confirm exponential backoff:

bash

Wait for retry #1 (30s)

sleep 35 openclaw job:status verify-backoff –json

Expected: consecutiveErrors: 2, nextRunAtMs ≈ lastRunAtMs + 60s

Wait for retry #2 (60s)

sleep 65 openclaw job:status verify-backoff –json

Expected: consecutiveErrors: 3, nextRunAtMs ≈ lastRunAtMs + 180s

5. Log Inspection

Verify the logs reflect correct behavior:

bash openclaw logs –job verify-backoff –since 5m

Expected log pattern:

level=error msg="Job execution failed" job=verify-backoff error="ENETUNREACH"
level=info msg="Transient error detected" job=verify-backoff error_type=network
level=info msg="Scheduling backoff retry" job=verify-backoff delay_ms=30000 attempt=1

Exit Code Verification

Successful verification returns exit code 0:

bash openclaw job:status verify-backoff && echo “Verification passed”

Output: Verification passed (exit code 0)

⚠️ Common Pitfalls

1. Configuration File Not Updated

If cronConfig.retryOn is empty or misconfigured, isTransientCronError returns false for all errors.

Check: Ensure openclaw.config.yml includes: yaml cron: retry: retryOn: - rate_limit - network - timeout - server_error maxAttempts: 5 baseDelayMs: 30000

2. Incorrect Import Path

The resolveRetryConfig function must be imported from the correct module.

Wrong: typescript import { resolveRetryConfig } from ‘./retry-config’; // Relative path may differ

Correct: typescript import { resolveRetryConfig } from ‘../utils/retry-config’;

3. Job State Not Persisted

If using an in-memory job store, backoff scheduling is lost on restart.

Mitigation: Use persistent storage (Redis, PostgreSQL) for production deployments: bash openclaw start –persistence=redis

4. Clock Skew in Distributed Environments

When multiple scheduler instances run, result.endedAt may differ between nodes.

Symptom: Jobs retry erratically or not at all.

Fix: Configure NTP synchronization and use monotonic time references where possible.

5. Backoff Exceeding Maximum Retries

The fix respects maxAttempts for one-shot jobs but applies backoff for all transient errors on recurring jobs.

Edge case: A daily job with maxAttempts: 3 that fails transiently 3 times will:

Retry at +30s
Retry at +60s
Retry at +180s
Then wait for next natural schedule (24h)

This is intentional behavior—recurring jobs eventually resume their schedule after exhausting immediate retries.

6. Environment-Specific Error Codes

The TRANSIENT_PATTERNS may not cover all transient errors in all environments:

Docker: May emit EHOSTUNREACH not in default patterns
Windows: Uses WSAETIMEDOUT (not currently classified)
Kubernetes: May emit ECONNREFUSED during pod churn

Customization: Extend patterns in src/cron/error-classifier.ts: typescript export const TRANSIENT_PATTERNS: Record<string, RegExp[]> = { // … existing patterns … kubernetes: [ /ECONNREFUSED/, /EHOSTUNREACH/, /ENETUNREACH.*kube-system/, ], };

7. Mixed Error Types

If a job fails with a transient error, then succeeds, the consecutiveErrors counter resets. Subsequent failures start fresh.

This is correct behavior—transient errors that self-heal indicate the job should resume normal operation.

8. Schedule Preservation Flag

When opts?.preserveSchedule is true for every jobs, the fix must not interfere with last-run preservation logic.

Verification: Ensure the transient check does not override computeNextWithPreservedLastRun behavior.

ECONNRESET — Connection reset by peer. Classified as network transient. Should trigger immediate backoff retry.
ETIMEDOUT — Operation timed out. Classified as timeout transient. Should trigger immediate backoff retry.
ENETUNREACH — Network is unreachable. Classified as network transient. Should trigger immediate backoff retry.
HTTP 429 — Too many requests. Classified as rate_limit transient. Should trigger immediate backoff retry.
HTTP 502/503/504 — Upstream server errors. Classified as server_error transient. Should trigger immediate backoff retry.

Issue #892 — "One-shot jobs retry infinitely without maxAttempts limit" — Similar retry configuration gap in the one-shot path before lines 379–396 were added.
Issue #1047 — "consecutiveErrors counter not resetting on success for cron jobs" — State management issue related to error tracking.
Issue #1156 — "Backoff not applied to every-job preserveSchedule mode" — The preserveSchedule flag was bypassing all retry logic.
PR #1203 — "Add transient error classification for OpenSSL errors" — Extended patterns to include ERR_LIB_SSL codes.

Similar Patterns in Codebase

src/scheduler/backoff.ts — Backoff calculation utility used by both job types
src/cron/error-classifier.ts — Centralized transient error pattern definitions
src/queue/retry-strategy.ts — Queue-based retry logic with similar transient detection

External References

AWS Lambda retry behavior — Documentation
Cron expression best practices — OpenClaw scheduling guide
Exponential backoff algorithms — AWS Builders' Library