Task State Loss After OpenClaw Service Restarts
OpenClaw does not persist task state across service restarts, causing pending configuration tasks and their results to be lost, requiring manual follow-up to determine task completion status.
π Symptoms
Direct User Experience Symptoms
After instructing OpenClaw to perform configuration tasks that trigger a system reboot or require the agent to restart, users observe the following behavioral symptoms:
- Silent task termination: The agent completes its assigned work and initiates a reboot/restart, but provides no automated follow-up communication after coming back online.
- Forced manual status inquiry: Users must explicitly ask the agent to check the status of previous tasks after a restart, disrupting the continuous automation flow.
- State amnesia: Upon restart, the agent has no memory of pending tasks, expected outcomes, or the context of work that was in progress.
Technical Manifestations
From a systems perspective, the absence of state persistence manifests as:
# User initiates a task requiring reboot
$ openclaw execute --task "update-kernel-parameter" --params '{"param": "net.ipv4.tcp_timestamps", "value": "1"}'
# Agent acknowledges and begins execution
[OpenClaw Agent] Task accepted. Applying kernel parameter and initiating system reboot...
# After reboot, user must manually inquire
$ openclaw status
[OpenClaw Agent] No active tasks. Ready for new instructions.
# Agent has no record of the previous task or its pending state
Affected Use Cases
- Kernel parameter modifications requiring a system reboot to take effect
- Core component updates that restart the OpenClaw service mid-task
- Multi-stage provisioning workflows interrupted by host restarts
- Firewall rule changes that trigger SSH connection drops and reconnections
Intended vs. Actual Behavior
| Scenario | Expected Behavior | Actual Behavior |
|---|---|---|
| Task initiates reboot | Persist task context before reboot | Task context lost |
| Agent restarts | Resume pending tasks automatically | Agent starts with clean state |
| Task completion | Proactive result notification | User must query manually |
| System rollback needed | Detect and report partial failure | No automatic verification |
π§ Root Cause
Architectural Root Cause: Stateless Task Execution Model
The underlying issue stems from OpenClaw’s current execution model, which operates as a stateless request-response system rather than a stateful workflow engine. This architectural decision, while simplifying initial implementation, creates a fundamental gap in handling long-running or restart-dependent tasks.
Failure Sequence Analysis
The following sequence diagram illustrates the point of failure:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β TASK EXECUTION FAILURE SEQUENCE β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β [Client] ββββββ> [OpenClaw Agent] ββββββ> [System/Systemd] β β β β β β β β submit task β β β β ββββββββββββββββ>β β β β β β execute_task() β β β β ββββββββββββββββ> β β β β β β modify_config() β β β β βββββββββββββββ> β β β β β β β β β initiate_reboot() β β β β ββββββββββββββββ> β β β β β β [SYSTEM REBOOTS] β β β β *** AGENT PROCESS β β β β β TERMINATED *** β β β β β β β β β NO STATE β NO STATE β [SYSTEM ONLINE] β β β PRESERVED β PRESERVED β β β β β β β β β manual_query β clean_start() β β β ββββββββββββββββ>β β β β β β “No active tasks” β β β β β<βββββββββββββββ β β β β β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Technical Debt: Missing State Persistence Layer
The absence of the following components constitutes the root technical debt:
1. Absence of Checkpoint Mechanism
python
Current (simplified) task execution flow
def execute_task(task): # No checkpoint before potentially terminating operations apply_configuration(task.config) if task.requires_reboot: system_reboot() # Agent process terminates here - state lost return SUCCESS # Never reached after reboot
2. No Persistent State Store
The agent lacks a mechanism to serialize and persist:
- Current task state (
PENDING,IN_PROGRESS,AWAITING_REBOOT,VERIFICATION) - Expected outcomes and acceptance criteria
- Task metadata (timestamps, retry counts, dependency chains)
- User notification flags
3. Missing Recovery/Resume Logic
Upon restart, the agent initializes with:
- No awareness of previously submitted tasks
- No verification of whether changes took effect
- No proactive notification capability
python
Current (simplified) agent startup
def on_agent_start(): # Fresh initialization - no recovery logic initialize_extensions() register_command_handlers() enter_idle_loop() # Previous tasks unknown
Environmental Dependencies
The state persistence failure is exacerbated by:
- Containerized environments: Docker containers with restart policies lose all in-memory state on restart
- Systemd-managed services: Standard service units do not provide application-level state awareness
- Cloud-init scenarios: VMs that snapshot/restore without agent coordination
- Network interruptions: Prolonged disconnections that trigger timeout-based restarts
π οΈ Step-by-Step Fix
Implementation Overview
The recommended fix implements a checkpoint-based state persistence system that saves task context before any potentially terminating operation and automatically recovers and verifies pending tasks upon restart.
Phase 1: Define State Persistence Schema
Create a structured state store schema for task persistence:
json { “task_id”: “uuid-v4-string”, “task_type”: “CONFIG_UPDATE | PACKAGE_INSTALL | KERNEL_PARAM | …”, “state”: “PENDING | IN_PROGRESS | AWAITING_VERIFICATION | COMPLETED | FAILED”, “created_at”: “ISO-8601-timestamp”, “checkpoint_at”: “ISO-8601-timestamp”, “config_snapshot”: { “intended_changes”: {}, “rollback_plan”: {} }, “verification_criteria”: [ { “check_type”: “COMMAND_OUTPUT”, “command”: “sysctl net.ipv4.tcp_timestamps”, “expected”: “net.ipv4.tcp_timestamps = 1” }, { “check_type”: “FILE_EXISTS”, “path”: “/etc/sysctl.d/99-custom.conf” } ], “notification_flags”: { “user_id”: “user-handle”, “channel”: “slack | email | webhook”, “pending”: true }, “retry_policy”: { “max_attempts”: 3, “current_attempt”: 1, “backoff_seconds”: 30 }, “metadata”: { “parent_task_id”: null, “correlation_id”: “correlation-uuid”, “tags”: [“kernel”, “network”, “reboot-required”] } }
Phase 2: Implement Persistence Layer
Step 2.1: Create State Store Abstraction
python
openclaw/state/persistence.py
from abc import ABC, abstractmethod from enum import Enum from pathlib import Path from typing import Optional, List, Dict, Any import json import aiofiles import sqlite3 from contextlib import asynccontextmanager
class StateStore(ABC): “““Abstract base for state persistence backends.”””
@abstractmethod
async def save_task(self, task_id: str, state: Dict[str, Any]) -> None:
pass
@abstractmethod
async def load_task(self, task_id: str) -> Optional[Dict[str, Any]]:
pass
@abstractmethod
async def load_pending_tasks(self) -> List[Dict[str, Any]]:
pass
@abstractmethod
async def delete_task(self, task_id: str) -> None:
pass
class SQLiteStateStore(StateStore): “““SQLite-based state persistence for production use.”””
def __init__(self, db_path: Path):
self.db_path = db_path
self._init_database()
def _init_database(self):
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS task_state (
task_id TEXT PRIMARY KEY,
state_json TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_task_state_status
ON task_state(state_json)
""")
async def save_task(self, task_id: str, state: Dict[str, Any]) -> None:
state_json = json.dumps(state)
async with aiofiles.open(self.db_path, 'r+') as f:
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
INSERT OR REPLACE INTO task_state
(task_id, state_json, updated_at)
VALUES (?, ?, CURRENT_TIMESTAMP)
""", (task_id, state_json))
async def load_task(self, task_id: str) -> Optional[Dict[str, Any]]:
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
"SELECT state_json FROM task_state WHERE task_id = ?",
(task_id,)
)
row = cursor.fetchone()
return json.loads(row[0]) if row else None
async def load_pending_tasks(self) -> List[Dict[str, Any]]:
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute("""
SELECT state_json FROM task_state
WHERE json_extract(state_json, '$.state')
IN ('PENDING', 'IN_PROGRESS', 'AWAITING_VERIFICATION')
ORDER BY created_at ASC
""")
return [json.loads(row[0]) for row in cursor.fetchall()]
Step 2.2: Integrate Checkpoint into Task Execution
python
openclaw/task/executor.py
from openclaw.state.persistence import StateStore, SQLiteStateStore from enum import Enum from typing import Callable, Any import asyncio
class TaskState(Enum): PENDING = “PENDING” IN_PROGRESS = “IN_PROGRESS” AWAITING_VERIFICATION = “AWAITING_VERIFICATION” COMPLETED = “COMPLETED” FAILED = “FAILED”
class CheckpointableExecutor: “““Task executor with automatic checkpointing.”””
def __init__(self, state_store: StateStore):
self.state_store = state_store
async def execute_with_checkpoint(
self,
task_id: str,
task_config: Dict[str, Any],
verification_criteria: List[Dict],
requires_reboot: bool = False
) -> Dict[str, Any]:
# INITIAL CHECKPOINT: Persist initial task state
initial_state = {
"task_id": task_id,
"state": TaskState.IN_PROGRESS.value,
"config_snapshot": task_config,
"verification_criteria": verification_criteria,
"requires_reboot": requires_reboot,
"checkpoint_at": self._timestamp()
}
await self.state_store.save_task(task_id, initial_state)
try:
# EXECUTE: Apply configuration
result = await self._apply_configuration(task_config)
if requires_reboot:
# CHECKPOINT BEFORE REBOOT: Mark as awaiting verification
await self.state_store.save_task(task_id, {
**initial_state,
"state": TaskState.AWAITING_VERIFICATION.value,
"pre_reboot_result": result,
"checkpoint_at": self._timestamp()
})
# INITIATE REBOOT (agent process terminates)
await self._initiate_reboot()
# This code only executes after restart during recovery
return await self._post_restart_verification(task_id)
except Exception as e:
await self.state_store.save_task(task_id, {
**initial_state,
"state": TaskState.FAILED.value,
"error": str(e),
"checkpoint_at": self._timestamp()
})
raise
async def _post_restart_verification(self, task_id: str) -> Dict[str, Any]:
"""Called after agent restart to verify task completion."""
task_state = await self.state_store.load_task(task_id)
if not task_state:
raise ValueError(f"No persisted state found for task {task_id}")
if task_state.get("state") != TaskState.AWAITING_VERIFICATION.value:
return task_state
# Run verification checks
verification_results = await self._run_verification(
task_state["verification_criteria"]
)
all_passed = all(r["passed"] for r in verification_results)
final_state = {
**task_state,
"state": TaskState.COMPLETED.value if all_passed else TaskState.FAILED.value,
"verification_results": verification_results,
"completed_at": self._timestamp()
}
await self.state_store.save_task(task_id, final_state)
# Trigger proactive notification
await self._send_proactive_notification(final_state)
return final_state
Step 2.3: Implement Recovery Logic on Agent Startup
python
openclaw/agent/startup.py
from openclaw.state.persistence import StateStore from openclaw.task.executor import CheckpointableExecutor
class AgentRecoveryManager: “““Handles automatic recovery of pending tasks on agent startup.”””
def __init__(self, state_store: StateStore, executor: CheckpointableExecutor):
self.state_store = state_store
self.executor = executor
async def on_agent_startup(self) -> List[Dict[str, Any]]:
"""
Called when the OpenClaw agent starts.
Recovers and processes all pending tasks.
"""
pending_tasks = await self.state_store.load_pending_tasks()
recovery_results = []
for task in pending_tasks:
try:
result = await self.executor._post_restart_verification(task["task_id"])
recovery_results.append({
"task_id": task["task_id"],
"status": "recovered",
"result": result
})
except Exception as e:
recovery_results.append({
"task_id": task["task_id"],
"status": "recovery_failed",
"error": str(e)
})
return recovery_results
Step 2.4: Implement Proactive Notification Service
python
openclaw/notifications/proactive.py
from typing import Dict, Any, List from abc import ABC, abstractmethod import httpx
class NotificationChannel(ABC): @abstractmethod async def send(self, message: str, metadata: Dict[str, Any]) -> bool: pass
class WebhookNotification(NotificationChannel): def init(self, webhook_url: str): self.webhook_url = webhook_url
async def send(self, message: str, metadata: Dict[str, Any]) -> bool:
async with httpx.AsyncClient() as client:
response = await client.post(
self.webhook_url,
json={
"text": message,
"attachments": [{
"color": "#36a64f" if metadata.get("success") else "#ff0000",
"fields": [
{"title": k, "value": str(v), "short": True}
for k, v in metadata.items()
]
}]
},
timeout=10.0
)
return response.status_code == 200
class ProactiveNotificationService: “““Sends proactive notifications after task completion or verification.”””
def __init__(self, channels: List[NotificationChannel]):
self.channels = channels
async def notify_task_completion(
self,
task_state: Dict[str, Any]
) -> None:
success = task_state.get("state") == "COMPLETED"
verification = task_state.get("verification_results", [])
message = (
f"β
Task `{task_state['task_id']}` completed successfully. "
if success else
f"β Task `{task_state['task_id']}` verification failed."
)
metadata = {
"success": success,
"task_type": task_state.get("task_type"),
"verification_checks": len(verification),
"checks_passed": sum(1 for v in verification if v.get("passed")),
"completed_at": task_state.get("completed_at")
}
for channel in self.channels:
try:
await channel.send(message, metadata)
except Exception as e:
# Log but don't fail - notification is best-effort
logger.warning(f"Failed to notify via {channel}: {e}")
Phase 3: Configuration Integration
Add persistence configuration to openclaw.yaml:
yaml
openclaw.yaml (partial)
persistence: enabled: true backend: “sqlite” # or “file”, “etcd”, “postgres” path: “/var/lib/openclaw/state.db”
recovery: auto_recover_pending_tasks: true max_recovery_attempts: 3 recovery_delay_seconds: 5
notifications: proactive: enabled: true channels: - type: “webhook” url: “${OPENCLAW_NOTIFICATION_WEBHOOK}” - type: “log” level: “info” include_verification_details: true
Before vs. After Comparison
| Aspect | Before Implementation | After Implementation |
|---|---|---|
| Pre-reboot state | Lost immediately | Persisted to SQLite with full context |
| Post-reboot awareness | Zero - fresh start | Loads and processes pending tasks |
| Verification | Manual query required | Automatic verification on restart |
| User notification | Reactive (user asks) | Proactive (agent reports) |
| Task continuity | Broken across restarts | Seamless continuation |
| Failure detection | Delayed, manual | Immediate, automated |
π§ͺ Verification
Unit Test Verification
Test 1: State Persistence Across Simulated Restart
python
tests/unit/test_state_persistence.py
import pytest import asyncio from pathlib import Path from openclaw.state.persistence import SQLiteStateStore from openclaw.task.executor import CheckpointableExecutor, TaskState
@pytest.fixture def temp_db(tmp_path): return tmp_path / “test_state.db”
@pytest.fixture
def state_store(temp_db):
return SQLiteStateStore(temp_db)
@pytest.fixture def executor(state_store): return CheckpointableExecutor(state_store)
@pytest.mark.asyncio async def test_task_state_persisted_before_reboot(executor, state_store): “““Verify task state is correctly saved before reboot simulation.””” task_id = “test-task-001” config = {“param”: “net.ipv4.tcp_timestamps”, “value”: “1”} verification = [ {“check_type”: “COMMAND_OUTPUT”, “command”: “echo test”, “expected”: “test”} ]
# Execute with checkpoint
await executor.execute_with_checkpoint(
task_id=task_id,
task_config=config,
verification_criteria=verification,
requires_reboot=True
)
# Simulate restart by creating new store instance
restarted_store = SQLiteStateStore(state_store.db_path)
loaded_state = await restarted_store.load_task(task_id)
assert loaded_state is not None
assert loaded_state["task_id"] == task_id
assert loaded_state["state"] == TaskState.AWAITING_VERIFICATION.value
assert loaded_state["config_snapshot"] == config
@pytest.mark.asyncio async def test_pending_tasks_loaded_on_restart(state_store, executor): “““Verify all pending tasks are recovered after restart.””” # Create multiple pending tasks for i in range(3): await state_store.save_task(f"task-{i}", { “task_id”: f"task-{i}", “state”: TaskState.AWAITING_VERIFICATION.value, “created_at”: “2024-01-01T00:00:00Z” })
# Simulate restart
restarted_store = SQLiteStateStore(state_store.db_path)
pending = await restarted_store.load_pending_tasks()
assert len(pending) == 3
assert all(t["task_id"].startswith("task-") for t in pending)
Test 2: Verification Execution on Recovery
python @pytest.mark.asyncio async def test_verification_runs_on_recovery(executor): “““Verify that verification criteria are executed after restart.””” task_id = “verify-task-001”
# Manually set task to awaiting verification
await executor.state_store.save_task(task_id, {
"task_id": task_id,
"state": TaskState.AWAITING_VERIFICATION.value,
"verification_criteria": [
{
"check_type": "COMMAND_OUTPUT",
"command": "echo 'success'",
"expected": "success"
}
],
"created_at": "2024-01-01T00:00:00Z"
})
# Simulate recovery
result = await executor._post_restart_verification(task_id)
assert result["state"] == TaskState.COMPLETED.value
assert len(result["verification_results"]) == 1
assert result["verification_results"][0]["passed"] is True
Integration Test Verification
Test 3: Full Reboot Cycle Simulation
bash
integration-tests/test_reboot_persistence.sh
#!/bin/bash set -e
TASK_ID=“integration-test-$(date +%s)” OPENCLAW_ENDPOINT="${OPENCLAW_ENDPOINT:-http://localhost:8080}"
echo “=== Step 1: Submit reboot-requiring task ===”
RESPONSE=$(curl -s -X POST “${OPENCLAW_ENDPOINT}/api/v1/tasks”
-H “Content-Type: application/json”
-d “{
"task_id": "${TASK_ID}",
"type": "kernel_param",
"config": {
"param": "fs.file-max",
"value": "65536"
},
"requires_reboot": true,
"verification": [
{
"check_type": "command",
"command": "sysctl fs.file-max",
"expected": "65536"
}
]
}”)
echo “Response: $RESPONSE” TASK_STATE=$(echo “$RESPONSE” | jq -r ‘.state’) assert_equal “IN_PROGRESS” “$TASK_STATE”
echo “=== Step 2: Verify state persisted to database ===” SQLITE_DB="/var/lib/openclaw/state.db" PERSISTED_STATE=$(sqlite3 “$SQLITE_DB” “SELECT state_json FROM task_state WHERE task_id=’${TASK_ID}’”) echo “Persisted: $PERSISTED_STATE”
echo “=== Step 3: Simulate agent restart ===” systemctl restart openclaw-agent
echo “=== Step 4: Verify agent recovered task on startup ===” sleep 2 RECOVERY_LOG=$(journalctl -u openclaw-agent –since “1 minute ago” | grep “recovered task” || true) echo “Recovery log: $RECOVERY_LOG”
echo “=== Step 5: Verify task completed after recovery ===” FINAL_STATE=$(curl -s “${OPENCLAW_ENDPOINT}/api/v1/tasks/${TASK_ID}” | jq -r ‘.state’) assert_equal “COMPLETED” “$FINAL_STATE”
echo “=== Step 6: Verify proactive notification sent ===” NOTIFICATION_LOG=$(grep “notification sent” /var/log/openclaw/notifications.log | tail -1) echo “Notification: $NOTIFICATION_LOG”
echo “=== All integration tests passed ===”
Manual Verification Checklist
Execute this checklist to verify the implementation in a live environment:
VERIFICATION CHECKLIST
βββββββββββββββββββββ
β‘ 1. State Store Initialization
$ ls -la /var/lib/openclaw/state.db
Expected: File exists with correct permissions (0600)
β‘ 2. Task State Checkpointing
$ sqlite3 /var/lib/openclaw/state.db \
"SELECT task_id, state FROM task_state"
Before reboot: Should show task in AWAITING_VERIFICATION state
After reboot: Should show task in COMPLETED or FAILED state
β‘ 3. Agent Startup Recovery
$ journalctl -u openclaw-agent -n 50 | grep -i "recovery\|pending"
Expected: Log lines showing pending tasks being processed
β‘ 4. Verification Execution
$ sqlite3 /var/lib/openclaw/state.db \
"SELECT json_extract(state_json, '$.verification_results') FROM task_state"
Expected: JSON array of verification check results
β‘ 5. Notification Dispatch
$ tail -f /var/log/openclaw/notifications.log
Expected: Outgoing webhook calls to configured notification endpoint
β‘ 6. End-to-End Latency
Reboot cycle should complete verification within 30 seconds of restart
β οΈ Common Pitfalls
Implementation Pitfalls
- Race Condition During Checkpoint Write
Symptom: Task state saved incompletely, leaving corrupted or partial records in the state store.
Mitigation: Use atomic write operations (write to temp file, fsync, then rename) or database transactions with WAL mode.# INCORRECT - susceptible to corruption async def save_task(task_id, state): with open(f"/tmp/{task_id}.json", "w") as f: json.dump(state, f) # Crash here = corrupted stateCORRECT - atomic write
async def save_task(task_id, state): temp_path = f"/tmp/{task_id}.json.tmp" final_path = f"/var/lib/openclaw/{task_id}.json" async with aiofiles.temp_path, mode=‘w’) as f: await f.write(json.dumps(state)) await f.flush() os.fsync(f.fileno()) os.rename(temp_path, final_path)
- State Store Lock Contention
Symptom: Agent hangs or times out when accessing state store under high concurrency.
Mitigation: Configure appropriate SQLite busy timeout and use connection pooling for production backends.# Configure SQLite for concurrent access conn.execute("PRAGMA busy_timeout = 5000") # 5 second timeout conn.execute("PRAGMA journal_mode = WAL") # Write-Ahead Logging - Verification Heisenbugs
Symptom: Verification passes in test but fails intermittently in production due to timing or system state.
Mitigation: Implement retry logic for transient verification failures and add jitter to avoid thundering herd.async def verify_with_retry(criteria, max_attempts=3, base_delay=1): for attempt in range(max_attempts): try: if await run_verification(criteria): return True except TransientError: pass await asyncio.sleep(base_delay * (2 ** attempt) + random.uniform(0, 1)) return False
Configuration Pitfalls
- Missing Notification Channel Configuration
Symptom: Proactive notifications silently fail because webhook URL is not set.
Mitigation: Validate channel configuration at startup and fail fast if required channels are misconfigured.# Validate at startup if config.notifications.enabled: for channel in config.notifications.channels: if channel.type == "webhook" and not channel.url: raise ConfigurationError( "Webhook URL required for proactive notifications" ) - State Store Path Permissions
Symptom: Agent cannot write state store, tasks lost after restart.
Mitigation: Document required permissions and create directories with correct ownership during installation.# Installation script snippet mkdir -p /var/lib/openclaw chown openclaw:openclaw /var/lib/openclaw chmod 0700 /var/lib/openclaw
Environment-Specific Pitfalls
- Docker Volume Persistence
Symptom: State lost in Docker Compose environment when containers restart with default volume behavior.
Fix: Explicitly mount state directory to persistent volume.# docker-compose.yml services: openclaw-agent: image: openclaw/agent:latest volumes: - openclaw-state:/var/lib/openclaw - /var/run:/var/run # For systemd socket access if neededvolumes: openclaw-state: driver: local
- Kubernetes Pod Disruption
Symptom: Task state lost during pod eviction or node drain.
Fix: Use PersistentVolumeClaim for state store or external database backend (etcd, PostgreSQL).# kubernetes deployment with PVC spec: volumes: - name: openclaw-state persistentVolumeClaim: claimName: openclaw-state-pvc containers: - name: agent volumeMounts: - name: openclaw-state mountPath: /var/lib/openclaw - macOS Sandbox Restrictions
Symptom: State file write operations fail due to Application Sandbox entitlement restrictions.
Fix: Request explicit file access entitlements or use user defaults for state storage.# If using macOS app bundle, add to entitlements: com.apple.security.files.user-selected.read-write = true com.apple.security.files.bookmarks.app-scope = true
Operational Pitfalls
- State Store Growth (Unbounded)
Symptom: State database grows indefinitely, consuming disk space.
Fix: Implement TTL-based cleanup and archival policy.# Cleanup job - run daily DELETE FROM task_state WHERE json_extract(state_json, '$.state') IN ('COMPLETED', 'FAILED') AND updated_at < datetime('now', '-7 days');Vacuum to reclaim space
PRAGMA vacuum;
- Stuck Tasks (No Timeout)
Symptom: Tasks stuck in AWAITING_VERIFICATION indefinitely on systems that never reboot.
Fix: Implement maximum wait time and automatic resolution or escalation.MAX_AWAIT_SECONDS = 3600 # 1 hourasync def check_stuck_tasks(): for task in await store.load_pending_tasks(): elapsed = now() - task[“checkpoint_at”] if elapsed > MAX_AWAIT_SECONDS: await escalate_task(task)
π Related Errors
Logically Connected Error Patterns
E_OPENCLAW_TASK_NOT_FOUND
Description: Agent cannot locate task state in persistence store during recovery. Indicates checkpoint failure or manual database manipulation.
Related: State store corruption, disk full during checkpoint write.E_OPENCLAW_VERIFICATION_TIMEOUT
Description: Verification criteria check exceeded configured timeout. Common when verifying network-dependent configurations.
Related: Network interruption, firewall blocking required ports, service not yet started.E_OPENCLAW_STATE_STORE_LOCKED
Description: Concurrent access to state store results in SQLITE_BUSY errors. Requires busy timeout configuration or connection pooling.
Related: High concurrent task submission, SQLite misconfiguration.E_OPENCLAW_RECOVERY_FAILED
Description: Agent startup recovery process encountered unrecoverable error. Requires manual intervention.
Related: Schema migration failure, incompatible state format, corrupted verification criteria.E_OPENCLAW_NOTIFICATION_DELIVERY_FAILED
Description: Proactive notification dispatch failed. Task completed but user not informed.
Related: Webhook endpoint unreachable, invalid credentials, rate limiting.E_OPENCLAW_CHECKPOINT_INCOMPLETE
Description: Partial state written before crash. Detected during recovery validation.
Related: System crash during checkpoint, insufficient fsync, disk I/O errors.
Related GitHub Issues and Feature Requests
| Issue/PR | Title | Relationship |
|---|---|---|
| #142 | Support for long-running tasks with progress reporting | Parent feature request |
| #187 | Add checkpoint/resume capability to task executor | Direct implementation of this guide |
| #203 | Proactive notifications via webhook | Notification component |
| #156 | SQLite backend for state persistence | Persistence backend |
| #198 | etcd/KV store support for distributed agents | Alternative persistence |
| #215 | Kubernetes operator for OpenClaw agent lifecycle | K8s integration concern |
| #178 | Task deduplication across agent restarts | Related recovery concern |
External Dependencies
- SQLite 3.35+: Required for JSON table functions used in state queries
- aiofiles: Async file I/O for non-blocking state operations
- httpx: Async HTTP client for webhook notifications
- systemd: For service restart handling and watchdog integration