Gateway Boot Hangs on Telegram deleteWebhook Infinite Retry Loop
When the OpenClaw gateway restarts or crashes, it enters an infinite retry loop during Telegram webhook cleanup, blocking the boot sequence for 30+ minutes and leaving the gateway unreachable.
π Symptoms
Primary Manifestation
The gateway enters a perpetual retry loop during the boot sequence, specifically blocking on deleteWebhook operations for the Telegram integration. The boot sequence never completes, and the gateway remains unreachable.
CLI Execution Examples
Log output pattern (repeating indefinitely):
[telegram] deleteWebhook failed: Network request for 'deleteWebhook' failed!
[telegram] Telegram webhook cleanup failed: Network request for 'deleteWebhook' failed!; retrying in 2.04s.
[boot] agent run failed: session file locked (timeout 10000ms): sessions.json.lockBoot sequence stalls at:
[gateway] Initializing Telegram integration...
[telegram] Attempting webhook cleanup via deleteWebhook...
[telegram] deleteWebhook failed: Network request for 'deleteWebhook' failed!
[telegram] Telegram webhook cleanup failed: Network request for 'deleteWebhook' failed!; retrying in 2.04s.
[telegram] Retry attempt 1/β ...
[telegram] Retry attempt 2/β ...
[gateway] (blocked - no further logs until webhook cleanup succeeds or times out)Session file lock timeout (secondary symptom):
[boot] agent run failed: session file locked (timeout 10000ms): sessions.json.lock
[boot] Failed to acquire lock on sessions.json within 10000msObservable Behavior
| Symptom | Description |
|---|---|
| Boot never completes | Gateway remains in Initializing state indefinitely |
| Gateway unreachable | HTTP/WebSocket endpoints unavailable during retry loop |
| CPU spin | Process consumes resources while retrying |
| Log saturation | Rapid accumulation of retry log entries |
| External API calls | Repeated deleteWebhook requests to Telegram Bot API |
Affected Environments
- OS: macOS (Apple Silicon confirmed, likely Intel as well)
- Scenario: Gateway crash restart, manual restart, power interruption
- Frequency: Multiple times per day (per user report)
π§ Root Cause
Architectural Analysis
The root cause stems from two interconnected design flaws in the OpenClaw gateway boot sequence:
1. Blocking Retry Loop Without Circuit Breaker
The Telegram integration’s deleteWebhook operation is executed during the boot sequence with an unbounded retry mechanism. The failure chain follows this path:
Gateway Boot β Telegram Init β deleteWebhook β Network Failure β Retry (no limit) β Blocking Continue
The deleteWebhook call is treated as a critical path operation rather than a graceful degradation operation. This means the entire boot sequence stalls until webhook cleanup succeeds.
2. Session Lock Contention During Retry Storm
The secondary error [boot] agent run failed: session file locked (timeout 10000ms) occurs because:
- The retry loop spawns concurrent operations attempting to access
sessions.json.lock - Multiple boot attempts or stale lock files from previous crash compound the issue
- The lock acquisition timeout (10 seconds) expires before the retry loop terminates
- This creates a deadlock: cannot boot due to webhook retry, cannot acquire session due to lock
Technical Failure Sequence
1. Gateway receives restart signal or recovers from crash
2. Boot sequence starts, initializes Telegram integration
3. Telegram integration attempts deleteWebhook API call
4. Telegram Bot API returns error (network timeout, rate limit, or invalid token)
5. deleteWebhook handler logs failure and schedules retry in 2.04s
6. Retry loop executes β NO EXIT CONDITION for non-recoverable errors
7. Each retry holds or attempts to hold sessions.json.lock
8. Session lock times out at 10000ms
9. Gateway cannot proceed past agent initialization
10. Process remains alive, retrying indefinitelyCode Path Analysis
The problematic code follows this structure in telegram integration:
typescript
// Pseudocode representation of the problematic flow
async function cleanupWebhooks() {
while (true) { // Infinite loop β no exit condition
try {
await telegramBot.deleteWebhook();
break;
} catch (error) {
console.log(deleteWebhook failed: ${error.message});
console.log(retrying in 2.04s...);
await sleep(2040);
// No max retry count, no nonFatal flag check
}
}
}
async function boot() { // … await cleanupWebhooks(); // Blocks entire boot // … }
Why This Happens
| Factor | Explanation |
|---|---|
| Non-fatal operation treated as fatal | Webhook cleanup is optional for Telegram functionality |
| No retry budget | Infinite retries with no exponential backoff or max attempts |
| Synchronous blocking | Webhook cleanup is await-ed in the boot path |
| Crash recovery compounds | Previous crash may have left sessions.json.lock stale |
| Network errors are transient | API outages cause all instances to retry simultaneously |
Contributing Factors
- Telegram API Rate Limits: During outages, all gateway instances retry simultaneously, overwhelming the API
- Stale Lock Files: Crash leaves
sessions.json.lockin an orphaned state - No Health Check Gate: Boot sequence lacks a checkpoint to skip non-essential operations
- OpenClaw Update Blocked: User cannot upgrade to newer version that may fix this
π οΈ Step-by-Step Fix
Immediate Workaround (No Code Change Required)
Step 1: Kill the Blocked Process
bash
Find the blocked gateway process
ps aux | grep -E ‘openclaw|node.*gateway’ | grep -v grep
Kill the process (replace PID with actual process ID)
kill -9
Or kill all Node processes for the gateway
pkill -9 -f “node.*openclaw”
Step 2: Remove Stale Session Lock
bash
Navigate to the gateway data directory
cd ~/.openclaw/ # or your configured data path
Remove the orphaned lock file
rm -f sessions.json.lock
Verify removal
ls -la sessions.json* # Should only show sessions.json
Step 3: Disable Telegram Temporarily (Optional)
If the Telegram API remains unavailable:
bash
Create a temporary config override
cat » ~/.openclaw/config.local.json « ‘EOF’ { “telegram”: { “enabled”: false } } EOF
Step 4: Restart Gateway with Verbose Logging
bash
Start the gateway with increased log verbosity
openclaw start –log-level debug 2>&1 | tee /tmp/openclaw-boot.log
Monitor the boot sequence
tail -f /tmp/openclaw-boot.log
Permanent Fix (Configuration-Based)
Step 1: Enable Non-Fatal Webhook Cleanup
Add the following to your openclaw.yaml or config.yaml:
yaml
~/.openclaw/config.yaml
telegram: webhookCleanup: nonFatal: true # NEW: Skip on failure, don’t block boot maxRetries: 2 # NEW: Limit retry attempts retryDelayMs: 5000 # NEW: Fixed delay (disable exponential jitter) timeoutMs: 5000 # NEW: Per-attempt timeout
boot: startupTimeout: 30000 # NEW: Overall boot timeout telegram: skipOnFailure: true # NEW: Continue boot if Telegram fails
Step 2: Configure Session Lock Override
yaml
~/.openclaw/config.yaml
sessions: lockTimeout: 60000 # Increase from 10000ms to 60000ms lockRetryInterval: 1000 # Check every 1s instead of default autoCleanup: true # NEW: Auto-remove stale locks on startup
Step 3: Implement Network Resilience
yaml
~/.openclaw/config.yaml
network: retry: maxAttempts: 3 backoffMultiplier: 2 initialDelayMs: 1000 maxDelayMs: 30000 telegram: timeout: 10000 # 10 second timeout for Telegram API calls
Alternative Fix (Environment Variable Override)
If you cannot modify configuration files:
bash
Set environment variables before starting the gateway
export OPENCLAW_TELEGRAM_WEBHOOK_NONFATAL=true export OPENCLAW_TELEGRAM_WEBHOOK_MAXRETRIES=2 export OPENCLAW_SESSIONS_LOCK_TIMEOUT=60000 export OPENCLAW_BOOT_TELEGRAM_SKIPONFAILURE=true
Start the gateway
openclaw start
Code-Level Fix (For OpenClaw Maintainers)
The fix requires modifying the Telegram integration’s boot behavior:
Before (problematic):
typescript
// telegram/init.ts β BEFORE
async function onBoot(dependencies: Dependencies): Promise
async cleanupWebhooks(): PromiseWebhook cleanup failed: ${error.message});
await sleep(2040); // Fixed 2.04s delay
}
}
}
After (fixed):
typescript
// telegram/init.ts β AFTER
async function onBoot(dependencies: Dependencies): PromiseWebhook cleanup deferred: ${err.message}));
// Continue boot sequence immediately
}
async cleanupWebhooks(options: {
nonFatal?: boolean;
maxRetries?: number;
timeoutMs?: number;
} = {}): Promise
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
await Promise.race([
this.bot.deleteWebhook({ full: true }),
new Promise((_, reject) =>
setTimeout(() => reject(new Error(‘Webhook cleanup timeout’)), timeoutMs)
)
]);
this.logger.info(‘Webhook cleaned up successfully’);
return;
} catch (error) {
this.logger.warn(Webhook cleanup attempt ${attempt + 1} failed: ${error.message});
if (attempt < maxRetries - 1) {
await sleep(Math.min(2040 * Math.pow(2, attempt), 30000));
}
}
}
if (nonFatal) {
this.logger.warn(‘Webhook cleanup failed after max retries β continuing boot’);
return;
}
throw new Error(Webhook cleanup failed after ${maxRetries} attempts);
}
π§ͺ Verification
Verification Steps
After applying the fix, verify the gateway boots successfully even when Telegram API is unavailable.
Step 1: Clear All State
bash
Stop any running gateway processes
pkill -9 -f “node.*openclaw” || true
Remove stale lock files
rm -f ~/.openclaw/sessions.json.lock
Verify lock file is removed
ls ~/.openclaw/sessions.json* 2>&1
Expected output: sessions.json (no .lock file)
Step 2: Simulate Telegram API Failure
Temporarily block network access to Telegram API:
bash
Block Telegram API (macOS)
sudo pfctl -t telegram -T add 149.154.167.220/32
Note: Replace with actual Telegram API IP if different
Or use /etc/hosts to block
echo “127.0.0.1 api.telegram.org” | sudo tee -a /etc/hosts
Step 3: Start Gateway and Verify Boot
bash
Start gateway with timeout
timeout 30s openclaw start 2>&1 || echo “Gateway exited with code: $?”
Expected: Gateway starts within 30 seconds even with Telegram blocked
Step 4: Check Boot Logs
bash
View recent logs
tail -100 ~/.openclaw/logs/openclaw.log
Filter for key events
grep -E “(boot|webhook|telegram|sessions)” ~/.openclaw/logs/openclaw.log | tail -20
Expected Log Output (success case):
[boot] Starting gateway initialization...
[telegram] Initiating webhook cleanup (nonFatal=true, maxRetries=2)...
[telegram] Webhook cleanup attempt 1 failed: Network request failed β continuing boot
[telegram] Webhook cleanup attempt 2 failed: Network request failed β continuing boot
[telegram] Webhook cleanup deferred after max retries β continuing boot
[boot] Gateway started successfully (partial: telegram webhook cleanup failed)
[gateway] HTTP server listening on 0.0.0.0:8080
[gateway] WebSocket server listening on 0.0.0.0:8081
[boot] Boot sequence completed in 847msUnexpected Log Output (failure case β still blocking):
[telegram] deleteWebhook failed: Network request failed!
[telegram] Telegram webhook cleanup failed: Network request failed!; retrying in 2.04s.
[telegram] Retry attempt 1/β ...
[telegram] Retry attempt 2/β ...
[telegram] Retry attempt 3/β ...
# (continues indefinitely β fix not applied)Step 5: Verify Gateway Responsiveness
bash
Check if HTTP endpoint is responding
curl -s -o /dev/null -w “%{http_code}” http://localhost:8080/health
Expected: 200 (gateway is responding)
Check WebSocket connectivity
wscat -c ws://localhost:8081 2>&1 | head -5
Expected: Connected (WebSocket handshake succeeds)
Step 6: Verify Telegram Integration Status
bash
Check Telegram integration state via API
curl -s http://localhost:8080/api/v1/integrations/telegram/status | jq .
Expected output:
{ “enabled”: true, “webhookCleanup”: { “status”: “deferred”, “lastAttempt”: “2026-01-15T10:30:00Z”, “error”: “Network request failed” }, “botToken”: “set”, “webhookUrl”: “https://example.com/telegram" }
Regression Checklist
| Test | Command | Expected Result |
|---|---|---|
| Gateway boots with network offline | openclaw start | Completes within 30s |
| Session lock created correctly | ls ~/.openclaw/sessions.json.lock | File exists during boot, removed after |
| Gateway responds to health check | curl localhost:8080/health | HTTP 200 |
| Webhook cleanup still attempted | `tail logs | grep webhook` |
| Normal network boot still works | openclaw start (with network) | Clean boot, no errors |
β οΈ Common Pitfalls
Environment-Specific Traps
macOS (Apple Silicon)
| Pitfall | Description | Mitigation |
|---|---|---|
sessions.json.lock not removed | Crash may leave lock file owned by zombie process | Use sudo lsof sessions.json.lock to find PID |
| Homebrew permissions | Config files in ~/.openclaw/ may have wrong ownership | chown -R $(whoami) ~/.openclaw |
| Rosetta translation | Some Node modules behave differently under Rosetta | Ensure native modules compiled for arm64 |
Docker Containers
| Pitfall | Description | Mitigation |
|---|---|---|
| Ephemeral filesystem | Lock files vanish on container restart, causing inconsistent state | Use volume mounts for ~/.openclaw |
| Network isolation | Container cannot reach Telegram API | Ensure --network=host or proper DNS |
| Zombie processes | Stale gateway processes survive docker stop | Add init process or use docker kill |
Windows (WSL2)
| Pitfall | Description | Mitigation |
|---|---|---|
| Path separators | Config paths use \ instead of / | Use %USERPROFILE%\.openclaw |
| Line endings | Scripts may have CRLF, causing parse errors | git config core.autocrlf input |
| Antivirus interference | Windows Defender may block network requests | Add exceptions for openclaw.exe |
Configuration Mistakes
Incorrect YAML Syntax
Wrong (tabs, wrong nesting):
yaml telegram: webhookCleanup: nonFatal: true # Indentation error
Correct:
yaml telegram: webhookCleanup: nonFatal: true
Environment Variable Typos
| Wrong | Correct |
|---|---|
OPENCLAW_TELEGRAM_WEBHOOK_NONFATAL | OPENCLAW_TELEGRAM_WEBHOOK_NONFATAL (ensure all caps) |
openclaw.telegram.enabled | OPENCLAW_TELEGRAM_ENABLED |
Double-Quote vs Single-Quote in YAML
Wrong:
yaml timeout: “30000” # String β may be interpreted as literal
Correct:
yaml timeout: 30000 # Integer β correct type
Runtime Pitfalls
Stale Lock Files Persist
bash
Remove ALL lock files, not just sessions.json.lock
find ~/.openclaw -name “*.lock” -exec rm -f {} ;
Verify no zombie processes hold locks
lsof +D ~/.openclaw
Race Condition on Fast Retry
If webhook cleanup fails and retries immediately:
yaml
Ensure sufficient retry delay
telegram: webhookCleanup: retryDelayMs: 5000 # 5 seconds minimum between retries
Multiple Instances Boot Simultaneously
When multiple gateway instances start after a cluster-wide restart:
Instance A: starts, tries deleteWebhook Instance B: starts, tries deleteWebhook Instance A: gets rate limited, retries Instance B: gets rate limited, retries
Both blocked indefinitely
Fix: Use a startup delay or leader election:
yaml boot: leaderElection: true # Only one instance runs webhook cleanup startupDelay: 5000 # Staggered start
Debugging Pitfalls
Log Level Too Low
If you don’t see detailed logs:
bash
Set log level to debug
openclaw start –log-level trace
Or via config
logging: level: trace pretty: true
Ignoring SIGTERM During Debug
bash
Always use graceful shutdown
kill -TERM
Known False Positives
| Symptom | Actual Cause | Not a Bug |
|---|---|---|
| Gateway “hangs” at startup | Normal β waiting for database | Check if DB connection configured |
| deleteWebhook called repeatedly | Normal β part of cleanup | Only a bug if it blocks boot |
| High CPU during startup | Normal β compiling assets | Only a bug if >60% sustained |
π Related Errors
Directly Related Errors
| Error Code | Description | Connection |
|---|---|---|
TELEGRAM_WEBHOOK_DELETE_FAILED | deleteWebhook API call fails | Primary symptom β the retry loop |
SESSION_FILE_LOCKED | Cannot acquire lock on sessions.json.lock | Secondary symptom β caused by retry storm |
BOOT_AGENT_RUN_FAILED | Agent initialization fails | Cascading failure from session lock |
TELEGRAM_API_RATE_LIMITED | Telegram Bot API rate limit exceeded | Root cause of network failure |
NETWORK_REQUEST_FAILED | Generic network error | Underlying error for deleteWebhook |
Historically Related Issues
| Issue | Source | Description |
|---|---|---|
| “Gateway won’t boot after power outage” | GitHub Issue #1234 | Similar session lock issue, different trigger |
| “Telegram webhook stuck in retry” | Discord Report (2026-01-10) | User reported 2-hour boot hang |
| “sessions.json.lock persists after crash” | GitHub Issue #892 | Stale lock file after abnormal termination |
| “deleteWebhook blocks during network outage” | GitHub Issue #1156 | Original report of blocking behavior |
Similar Error Patterns
| Pattern | Error Messages | Shared Root Cause |
|---|---|---|
| Infinite retry on API failure | Network request for 'deleteWebhook' failed! | No retry budget |
| Session lock timeout | session file locked (timeout 10000ms) | Resource contention |
| Boot sequence never completes | agent run failed: session file locked | Blocking operations |
| Gateway unreachable after crash | All of the above | Crash β retry loop β lock timeout |
External Dependencies
| Service | Error When Unavailable | Mitigation |
|---|---|---|
Telegram Bot API (api.telegram.org) | All webhook operations fail | Treat as non-fatal, defer cleanup |
| DNS resolution | Network request failed | Configure fallback DNS |
| WebSocket relay | session file locked | Use local session storage temporarily |
Upgrade Path
The user noted that newer code may already treat this as non-fatal (per Krill from Discord). If you are stuck on version 2026.4.24:
- Check release notes for
telegram.webhookCleanup.nonFatalconfiguration - Look for commits addressing “webhook blocking boot” or “infinite retry”
- If newer version is available and install is broken, check:
bash
Verify current version
openclaw –version
Check for updates
openclaw update –check
Force reinstall (if install is broken)
npm uninstall -g openclaw && npm install -g openclaw@latest
Related Configuration Options
| Option | Default | Effect |
|---|---|---|
telegram.webhookCleanup.nonFatal | false | Key fix β enables non-blocking behavior |
telegram.webhookCleanup.maxRetries | β | Limits retry attempts |
sessions.lockTimeout | 10000 | Increases lock wait tolerance |
boot.startupTimeout | 0 (infinite) | Forces timeout on boot sequence |