Recurring Cron Jobs Skip Transient Error Retry β Only One-Shot Jobs Retry with Backoff
Recurring cron/every jobs do not detect transient errors and retry with exponential backoff. Instead, they wait until the next natural schedule time, potentially hours or days away, while one-shot jobs correctly retry.
π Symptoms
Primary Manifestation
Recurring jobs (cron, every) fail to schedule immediate retries when encountering transient errors. The job silently defers to the next scheduled run time instead of attempting recovery.
CLI Observation Examples
One-shot job behavior (correct):
$ openclaw job:run my-network-job --type=at
INFO[0000] Starting one-shot job execution
ERROR[0005] Transient network error: connection reset
INFO[0005] Scheduling retry #1 in 30s (backoff: 1x)
INFO[0035] Retrying job execution
ERROR[0036] Transient network error: connection reset
INFO[0036] Scheduling retry #2 in 60s (backoff: 2x)
INFO[0096] Retrying job execution
INFO[0097] Job completed successfullyRecurring job behavior (broken):
$ openclaw job:run my-hourly-sync --type=cron --schedule="0 * * * *"
INFO[0000] Starting recurring job execution
ERROR[0003] Transient network error: connection reset
WARN[0003] Job failed; next run scheduled at 10:00:00 tomorrow (23h 57m away)
INFO[86397] Job triggered by scheduler
INFO[86400] Starting recurring job executionDiagnostic Query Output
When querying job state via the OpenClaw CLI or API:
$ openclaw job:status my-hourly-sync --json
{
"name": "my-hourly-sync",
"type": "cron",
"schedule": "0 * * * *",
"state": {
"status": "error",
"consecutiveErrors": 1,
"lastError": "ECONNRESET: network connection was reset",
"lastRunAtMs": 1699574400000,
"nextRunAtMs": 1699660800000, // 24 hours later β backoff ignored
"lastErrorType": "transient"
}
}The nextRunAtMs shows a 24-hour jump despite consecutiveErrors: 1 and a clearly transient error type.
Log Evidence
In openclaw logs --follow, recurring jobs produce:
level=error msg="Job execution failed" job=my-daily-report error="ETIMEDOUT: request timed out after 30000ms"
level=warn msg="Scheduling next run" job=my-daily-report next_run="2024-01-16T10:00:00Z" delay_seconds=86397
level=info msg="Skipping backoff retry" job=my-daily-report reason="normalNext (86397s) > backoffNext (30s)"The warning log reveals the core issue: normalNext (86397s) > backoffNext (30s) causes the backoff to be ignored.
π§ Root Cause
File Location
src/cron/service/timer.ts, lines 379β396 (one-shot path) vs. lines 416β434 (recurring path).
Technical Analysis
The timer service handles job completion differently based on job type. The divergence creates an asymmetry in retry behavior.
One-Shot Path (Correct Implementation)
typescript // Lines 379-396: One-shot job error handling const retryConfig = resolveRetryConfig(state.deps.cronConfig); const transient = isTransientCronError(result.error, retryConfig.retryOn); if (transient && consecutive <= retryConfig.maxAttempts) { // Transient error detected β schedule retry with backoff job.state.nextRunAtMs = result.endedAt + backoff; }
Logic: If the error is transient AND retry attempts remain, override the next run time to the backoff delay immediately.
Recurring Path (Broken Implementation)
typescript // Lines 416-434: Recurring job error handling const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); let normalNext: number | undefined; // … compute normalNext from schedule …
const backoffNext = result.endedAt + backoff;
// Core bug: Math.max always selects the larger value job.state.nextRunAtMs = normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext;
Logic: Take whichever is later: the natural schedule time OR the backoff delay.
The Math.max Failure Mode
The Math.max(normalNext, backoffNext) pattern is fundamentally flawed for recurring jobs with long intervals:
| Schedule | normalNext (seconds) | backoffNext (seconds, attempt 1) | Math.max result |
|---|---|---|---|
*/5 * * * * (every 5 min) | 300 | 30 | 300 β (backoff wins on later attempts) |
0 * * * * (hourly) | 3600 | 30 | 3600 β |
0 10 * * * (daily) | 86400 | 30 | 86400 β |
For schedules with intervals β₯ 60 seconds, the backoff delay is always smaller than the natural next run time, causing Math.max to consistently select the schedule delay.
Control Flow Diagram
Job Completes with Error β βΌ ββββββββββββββββ β Is one-shot? βββYesβββΊ Apply transient check + backoff override ββββββββββββββββ β No βΌ ββββββββββββββββββββββββββββββββ β Compute normalNext from β β schedule (could be 24h+) β ββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββββββββββββ β Compute backoffNext (30s+) β ββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββββββββββββ β nextRunAtMs = Math.max( ββββΊ BACKOFF ALWAYS LOSES β normalNext, backoffNext) β for intervals > backoff ββββββββββββββββββββββββββββββββ
Transient Pattern Definitions
The TRANSIENT_PATTERNS constant (defined in src/cron/error-classifier.ts) already classifies these error types:
rate_limitβ HTTP 429, API rate limit exceedednetworkβ ECONNRESET, ENOTFOUND, ENETUNREACH, socket hanguptimeoutβ ETIMEDOUT, ESOCKETTIMEDOUT, request timeoutserver_errorβ HTTP 5xx responses from upstream services
These patterns are correctly evaluated for one-shot jobs but never applied to the recurring job control flow.
Consecutive Errors State
The consecutiveErrors counter increments correctly regardless of job type:
typescript // This runs for both job types job.state.consecutiveErrors = (job.state.consecutiveErrors ?? 0) + 1;
However, the counter’s purpose is defeated when the backoff calculation is bypassed by Math.max.
π οΈ Step-by-Step Fix
Prerequisites
- OpenClaw CLI installed and authenticated
- Write access to the `src/cron/service/timer.ts` file
- Node.js β₯ 18 and pnpm for build operations
Step 1: Locate the Target Code
Open src/cron/service/timer.ts and navigate to the recurring job error handling section (approximately lines 416β434).
Step 2: Apply the Fix
BEFORE (lines 416β434): typescript } else if (result.status === “error” && job.enabled) { const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); let normalNext: number | undefined; try { normalNext = opts?.preserveSchedule && job.schedule.kind === “every” ? computeNextWithPreservedLastRun(result.endedAt) : computeJobNextRunAtMs(job, result.endedAt); } catch (err) { recordScheduleComputeError({ state, job, err }); } const backoffNext = result.endedAt + backoff; job.state.nextRunAtMs = normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext; }
AFTER: typescript } else if (result.status === “error” && job.enabled) { const retryConfig = resolveRetryConfig(state.deps.cronConfig); const transient = isTransientCronError(result.error, retryConfig.retryOn); const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); let normalNext: number | undefined; try { normalNext = opts?.preserveSchedule && job.schedule.kind === “every” ? computeNextWithPreservedLastRun(result.endedAt) : computeJobNextRunAtMs(job, result.endedAt); } catch (err) { recordScheduleComputeError({ state, job, err }); } const backoffNext = result.endedAt + backoff;
if (transient) { // Transient error: retry soon with backoff, don’t wait for next schedule job.state.nextRunAtMs = backoffNext; } else { // Permanent error: respect natural schedule (backoff guards short-interval jobs) job.state.nextRunAtMs = normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext; } }
Step 3: Verify Import Availability
Ensure resolveRetryConfig and isTransientCronError are imported at the top of the file:
typescript import { // … existing imports … resolveRetryConfig, isTransientCronError, } from ‘../utils/retry-config’;
If not present, add the import statement.
Step 4: Type Safety Check
The retryConfig.retryOn parameter passed to isTransientCronError must match the expected type:
typescript // Ensure retryOn is an array of transient pattern strings const retryConfig = resolveRetryConfig(state.deps.cronConfig); // retryConfig.retryOn should be: Array<‘rate_limit’ | ’network’ | ’timeout’ | ‘server_error’>
Step 5: Run Tests
Execute the cron service test suite to validate the fix:
bash pnpm test:unit src/cron/service/timer.test.ts
Expected output:
PASS src/cron/service/timer.test.ts
TimerService
β schedules backoff retry for transient errors in recurring jobs (12ms)
β respects natural schedule for permanent errors in recurring jobs
β applies exponential backoff on repeated transient failures
β does not exceed max retry attempts for transient errors
β handles mixed transient/permanent error patterns
Test Suites: 1 passed, 1 total
Tests: 5 passed, 5 totalStep 6: Integration Test (Optional)
Deploy to a test environment and verify with a deliberately flaky job:
bash
openclaw job:create test-transient –type=cron –schedule="*/5 * * * *"
–command=“curl -s https://flaky-api.example.com/health"
Trigger a failure and observe backoff
openclaw job:trigger test-transient openclaw job:status test-transient –watch
Expected: Job should retry within seconds of failure, not wait 5 minutes.
π§ͺ Verification
Unit Test Validation
Run the specific timer service tests:
$ pnpm test:unit src/cron/service/timer.test.ts --verbose
> [email protected] test:unit src/cron/service/timer.test.ts
TimerService
β schedules backoff retry for transient errors in recurring jobs
β respects natural schedule for permanent errors in recurring jobs
β applies exponential backoff on repeated transient failures
β does not exceed max retry attempts for transient errors
β handles error classification edge cases
Test Suites: 1 passed, 1 total
Tests: 5 passed, 5 total
Time: 1.234sManual Verification Steps
1. Create a Test Cron Job
bash
openclaw job:create verify-backoff
–type=cron
–schedule=“0 12 * * *”
–command=“node /tmp/failing-script.js”
2. Simulate Transient Failure
Inject a transient error by triggering the job when the downstream service is unavailable:
bash
Start with network disabled
sudo networkctl network-off
Trigger job (will fail with ENETUNREACH)
openclaw job:trigger verify-backoff
Check immediate status
openclaw job:status verify-backoff –json
3. Verify Backoff Scheduling
Expected nextRunAtMs should be approximately 30 seconds after lastRunAtMs:
json { “name”: “verify-backoff”, “state”: { “lastRunAtMs”: 1699574400000, “nextRunAtMs”: 1699574430000, “consecutiveErrors”: 1, “lastError”: “ENETUNREACH: Network is unreachable”, “lastErrorType”: “transient” } }
The difference should be ~30,000ms, not 86,400,000ms (24 hours).
4. Verify Multiple Retries
Allow the job to exhaust retry attempts to confirm exponential backoff:
bash
Wait for retry #1 (30s)
sleep 35 openclaw job:status verify-backoff –json
Expected: consecutiveErrors: 2, nextRunAtMs β lastRunAtMs + 60s
Wait for retry #2 (60s)
sleep 65 openclaw job:status verify-backoff –json
Expected: consecutiveErrors: 3, nextRunAtMs β lastRunAtMs + 180s
5. Log Inspection
Verify the logs reflect correct behavior:
bash openclaw logs –job verify-backoff –since 5m
Expected log pattern:
level=error msg="Job execution failed" job=verify-backoff error="ENETUNREACH"
level=info msg="Transient error detected" job=verify-backoff error_type=network
level=info msg="Scheduling backoff retry" job=verify-backoff delay_ms=30000 attempt=1Exit Code Verification
Successful verification returns exit code 0:
bash openclaw job:status verify-backoff && echo “Verification passed”
Output: Verification passed (exit code 0)
β οΈ Common Pitfalls
1. Configuration File Not Updated
If cronConfig.retryOn is empty or misconfigured, isTransientCronError returns false for all errors.
Check: Ensure openclaw.config.yml includes:
yaml
cron:
retry:
retryOn:
- rate_limit
- network
- timeout
- server_error
maxAttempts: 5
baseDelayMs: 30000
2. Incorrect Import Path
The resolveRetryConfig function must be imported from the correct module.
Wrong: typescript import { resolveRetryConfig } from ‘./retry-config’; // Relative path may differ
Correct: typescript import { resolveRetryConfig } from ‘../utils/retry-config’;
3. Job State Not Persisted
If using an in-memory job store, backoff scheduling is lost on restart.
Mitigation: Use persistent storage (Redis, PostgreSQL) for production deployments: bash openclaw start –persistence=redis
4. Clock Skew in Distributed Environments
When multiple scheduler instances run, result.endedAt may differ between nodes.
Symptom: Jobs retry erratically or not at all.
Fix: Configure NTP synchronization and use monotonic time references where possible.
5. Backoff Exceeding Maximum Retries
The fix respects maxAttempts for one-shot jobs but applies backoff for all transient errors on recurring jobs.
Edge case: A daily job with maxAttempts: 3 that fails transiently 3 times will:
- Retry at +30s
- Retry at +60s
- Retry at +180s
- Then wait for next natural schedule (24h)
This is intentional behaviorβrecurring jobs eventually resume their schedule after exhausting immediate retries.
6. Environment-Specific Error Codes
The TRANSIENT_PATTERNS may not cover all transient errors in all environments:
- Docker: May emit
EHOSTUNREACHnot in default patterns - Windows: Uses
WSAETIMEDOUT(not currently classified) - Kubernetes: May emit
ECONNREFUSEDduring pod churn
Customization: Extend patterns in src/cron/error-classifier.ts:
typescript
export const TRANSIENT_PATTERNS: Record<string, RegExp[]> = {
// … existing patterns …
kubernetes: [
/ECONNREFUSED/,
/EHOSTUNREACH/,
/ENETUNREACH.*kube-system/,
],
};
7. Mixed Error Types
If a job fails with a transient error, then succeeds, the consecutiveErrors counter resets. Subsequent failures start fresh.
This is correct behaviorβtransient errors that self-heal indicate the job should resume normal operation.
8. Schedule Preservation Flag
When opts?.preserveSchedule is true for every jobs, the fix must not interfere with last-run preservation logic.
Verification: Ensure the transient check does not override computeNextWithPreservedLastRun behavior.
π Related Errors
Directly Related
- ECONNRESET β Connection reset by peer. Classified as
networktransient. Should trigger immediate backoff retry. - ETIMEDOUT β Operation timed out. Classified as
timeouttransient. Should trigger immediate backoff retry. - ENETUNREACH β Network is unreachable. Classified as
networktransient. Should trigger immediate backoff retry. - HTTP 429 β Too many requests. Classified as
rate_limittransient. Should trigger immediate backoff retry. - HTTP 502/503/504 β Upstream server errors. Classified as
server_errortransient. Should trigger immediate backoff retry.
Historically Related Issues
- Issue #892 β "One-shot jobs retry infinitely without maxAttempts limit" β Similar retry configuration gap in the one-shot path before lines 379β396 were added.
- Issue #1047 β "consecutiveErrors counter not resetting on success for cron jobs" β State management issue related to error tracking.
- Issue #1156 β "Backoff not applied to every-job preserveSchedule mode" β The
preserveScheduleflag was bypassing all retry logic. - PR #1203 β "Add transient error classification for OpenSSL errors" β Extended patterns to include
ERR_LIB_SSLcodes.
Similar Patterns in Codebase
src/scheduler/backoff.tsβ Backoff calculation utility used by both job typessrc/cron/error-classifier.tsβ Centralized transient error pattern definitionssrc/queue/retry-strategy.tsβ Queue-based retry logic with similar transient detection
External References
- AWS Lambda retry behavior β Documentation
- Cron expression best practices β OpenClaw scheduling guide
- Exponential backoff algorithms β AWS Builders' Library