May 03, 2026 • Version: unspecified

Task State Loss After OpenClaw Service Restarts

OpenClaw does not persist task state across service restarts, causing pending configuration tasks and their results to be lost, requiring manual follow-up to determine task completion status.

🔍 Symptoms

Direct User Experience Symptoms

After instructing OpenClaw to perform configuration tasks that trigger a system reboot or require the agent to restart, users observe the following behavioral symptoms:

Silent task termination: The agent completes its assigned work and initiates a reboot/restart, but provides no automated follow-up communication after coming back online.
Forced manual status inquiry: Users must explicitly ask the agent to check the status of previous tasks after a restart, disrupting the continuous automation flow.
State amnesia: Upon restart, the agent has no memory of pending tasks, expected outcomes, or the context of work that was in progress.

Technical Manifestations

From a systems perspective, the absence of state persistence manifests as:


# User initiates a task requiring reboot
$ openclaw execute --task "update-kernel-parameter" --params '{"param": "net.ipv4.tcp_timestamps", "value": "1"}'

# Agent acknowledges and begins execution
[OpenClaw Agent] Task accepted. Applying kernel parameter and initiating system reboot...

# After reboot, user must manually inquire
$ openclaw status
[OpenClaw Agent] No active tasks. Ready for new instructions.

# Agent has no record of the previous task or its pending state

Affected Use Cases

Kernel parameter modifications requiring a system reboot to take effect
Core component updates that restart the OpenClaw service mid-task
Multi-stage provisioning workflows interrupted by host restarts
Firewall rule changes that trigger SSH connection drops and reconnections

Intended vs. Actual Behavior

Scenario	Expected Behavior	Actual Behavior
Task initiates reboot	Persist task context before reboot	Task context lost
Agent restarts	Resume pending tasks automatically	Agent starts with clean state
Task completion	Proactive result notification	User must query manually
System rollback needed	Detect and report partial failure	No automatic verification

🧠 Root Cause

Architectural Root Cause: Stateless Task Execution Model

The underlying issue stems from OpenClaw’s current execution model, which operates as a stateless request-response system rather than a stateful workflow engine. This architectural decision, while simplifying initial implementation, creates a fundamental gap in handling long-running or restart-dependent tasks.

Failure Sequence Analysis

The following sequence diagram illustrates the point of failure:

┌─────────────────────────────────────────────────────────────────────────────┐ │ TASK EXECUTION FAILURE SEQUENCE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ [Client] ──────> [OpenClaw Agent] ──────> [System/Systemd] │ │ │ │ │ │ │ │ submit task │ │ │ │ │───────────────>│ │ │ │ │ │ execute_task() │ │ │ │ │───────────────> │ │ │ │ │ │ modify_config() │ │ │ │ │──────────────> │ │ │ │ │ │ │ │ │ initiate_reboot() │ │ │ │ │───────────────> │ │ │ │ │ │ [SYSTEM REBOOTS] │ │ │ │ *** AGENT PROCESS │ │ │ │ │ TERMINATED *** │ │ │ │ │ │ │ │ │ NO STATE │ NO STATE │ [SYSTEM ONLINE] │ │ │ PRESERVED │ PRESERVED │ │ │ │ │ │ │ │ │ manual_query │ clean_start() │ │ │ │───────────────>│ │ │ │ │ │ “No active tasks” │ │ │ │ │<─────────────── │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

Technical Debt: Missing State Persistence Layer

The absence of the following components constitutes the root technical debt:

1. Absence of Checkpoint Mechanism

python

Current (simplified) task execution flow

def execute_task(task): # No checkpoint before potentially terminating operations apply_configuration(task.config) if task.requires_reboot: system_reboot() # Agent process terminates here - state lost return SUCCESS # Never reached after reboot

2. No Persistent State Store

The agent lacks a mechanism to serialize and persist:

Current task state (PENDING, IN_PROGRESS, AWAITING_REBOOT, VERIFICATION)
Expected outcomes and acceptance criteria
Task metadata (timestamps, retry counts, dependency chains)
User notification flags

3. Missing Recovery/Resume Logic

Upon restart, the agent initializes with:

No awareness of previously submitted tasks
No verification of whether changes took effect
No proactive notification capability

python

Current (simplified) agent startup

def on_agent_start(): # Fresh initialization - no recovery logic initialize_extensions() register_command_handlers() enter_idle_loop() # Previous tasks unknown

Environmental Dependencies

The state persistence failure is exacerbated by:

Containerized environments: Docker containers with restart policies lose all in-memory state on restart
Systemd-managed services: Standard service units do not provide application-level state awareness
Cloud-init scenarios: VMs that snapshot/restore without agent coordination
Network interruptions: Prolonged disconnections that trigger timeout-based restarts

🛠️ Step-by-Step Fix

Implementation Overview

The recommended fix implements a checkpoint-based state persistence system that saves task context before any potentially terminating operation and automatically recovers and verifies pending tasks upon restart.

Phase 1: Define State Persistence Schema

Create a structured state store schema for task persistence:

json { “task_id”: “uuid-v4-string”, “task_type”: “CONFIG_UPDATE | PACKAGE_INSTALL | KERNEL_PARAM | …”, “state”: “PENDING | IN_PROGRESS | AWAITING_VERIFICATION | COMPLETED | FAILED”, “created_at”: “ISO-8601-timestamp”, “checkpoint_at”: “ISO-8601-timestamp”, “config_snapshot”: { “intended_changes”: {}, “rollback_plan”: {} }, “verification_criteria”: [ { “check_type”: “COMMAND_OUTPUT”, “command”: “sysctl net.ipv4.tcp_timestamps”, “expected”: “net.ipv4.tcp_timestamps = 1” }, { “check_type”: “FILE_EXISTS”, “path”: “/etc/sysctl.d/99-custom.conf” } ], “notification_flags”: { “user_id”: “user-handle”, “channel”: “slack | email | webhook”, “pending”: true }, “retry_policy”: { “max_attempts”: 3, “current_attempt”: 1, “backoff_seconds”: 30 }, “metadata”: { “parent_task_id”: null, “correlation_id”: “correlation-uuid”, “tags”: [“kernel”, “network”, “reboot-required”] } }

Phase 2: Implement Persistence Layer

Step 2.1: Create State Store Abstraction

python

openclaw/state/persistence.py

from abc import ABC, abstractmethod from enum import Enum from pathlib import Path from typing import Optional, List, Dict, Any import json import aiofiles import sqlite3 from contextlib import asynccontextmanager

class StateStore(ABC): “““Abstract base for state persistence backends.”””

@abstractmethod
async def save_task(self, task_id: str, state: Dict[str, Any]) -> None:
    pass

@abstractmethod
async def load_task(self, task_id: str) -> Optional[Dict[str, Any]]:
    pass

@abstractmethod
async def load_pending_tasks(self) -> List[Dict[str, Any]]:
    pass

@abstractmethod
async def delete_task(self, task_id: str) -> None:
    pass

class SQLiteStateStore(StateStore): “““SQLite-based state persistence for production use.”””

def __init__(self, db_path: Path):
    self.db_path = db_path
    self._init_database()

def _init_database(self):
    with sqlite3.connect(self.db_path) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS task_state (
                task_id TEXT PRIMARY KEY,
                state_json TEXT NOT NULL,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        conn.execute("""
            CREATE INDEX IF NOT EXISTS idx_task_state_status 
            ON task_state(state_json)
        """)

async def save_task(self, task_id: str, state: Dict[str, Any]) -> None:
    state_json = json.dumps(state)
    async with aiofiles.open(self.db_path, 'r+') as f:
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                INSERT OR REPLACE INTO task_state 
                (task_id, state_json, updated_at)
                VALUES (?, ?, CURRENT_TIMESTAMP)
            """, (task_id, state_json))

async def load_task(self, task_id: str) -> Optional[Dict[str, Any]]:
    with sqlite3.connect(self.db_path) as conn:
        cursor = conn.execute(
            "SELECT state_json FROM task_state WHERE task_id = ?",
            (task_id,)
        )
        row = cursor.fetchone()
        return json.loads(row[0]) if row else None

async def load_pending_tasks(self) -> List[Dict[str, Any]]:
    with sqlite3.connect(self.db_path) as conn:
        cursor = conn.execute("""
            SELECT state_json FROM task_state
            WHERE json_extract(state_json, '$.state') 
            IN ('PENDING', 'IN_PROGRESS', 'AWAITING_VERIFICATION')
            ORDER BY created_at ASC
        """)
        return [json.loads(row[0]) for row in cursor.fetchall()]

Step 2.2: Integrate Checkpoint into Task Execution

python

openclaw/task/executor.py

from openclaw.state.persistence import StateStore, SQLiteStateStore from enum import Enum from typing import Callable, Any import asyncio

class TaskState(Enum): PENDING = “PENDING” IN_PROGRESS = “IN_PROGRESS” AWAITING_VERIFICATION = “AWAITING_VERIFICATION” COMPLETED = “COMPLETED” FAILED = “FAILED”

class CheckpointableExecutor: “““Task executor with automatic checkpointing.”””

def __init__(self, state_store: StateStore):
    self.state_store = state_store

async def execute_with_checkpoint(
    self, 
    task_id: str,
    task_config: Dict[str, Any],
    verification_criteria: List[Dict],
    requires_reboot: bool = False
) -> Dict[str, Any]:
    # INITIAL CHECKPOINT: Persist initial task state
    initial_state = {
        "task_id": task_id,
        "state": TaskState.IN_PROGRESS.value,
        "config_snapshot": task_config,
        "verification_criteria": verification_criteria,
        "requires_reboot": requires_reboot,
        "checkpoint_at": self._timestamp()
    }
    await self.state_store.save_task(task_id, initial_state)
    
    try:
        # EXECUTE: Apply configuration
        result = await self._apply_configuration(task_config)
        
        if requires_reboot:
            # CHECKPOINT BEFORE REBOOT: Mark as awaiting verification
            await self.state_store.save_task(task_id, {
                **initial_state,
                "state": TaskState.AWAITING_VERIFICATION.value,
                "pre_reboot_result": result,
                "checkpoint_at": self._timestamp()
            })
            
            # INITIATE REBOOT (agent process terminates)
            await self._initiate_reboot()
        
        # This code only executes after restart during recovery
        return await self._post_restart_verification(task_id)
        
    except Exception as e:
        await self.state_store.save_task(task_id, {
            **initial_state,
            "state": TaskState.FAILED.value,
            "error": str(e),
            "checkpoint_at": self._timestamp()
        })
        raise

async def _post_restart_verification(self, task_id: str) -> Dict[str, Any]:
    """Called after agent restart to verify task completion."""
    task_state = await self.state_store.load_task(task_id)
    
    if not task_state:
        raise ValueError(f"No persisted state found for task {task_id}")
    
    if task_state.get("state") != TaskState.AWAITING_VERIFICATION.value:
        return task_state
    
    # Run verification checks
    verification_results = await self._run_verification(
        task_state["verification_criteria"]
    )
    
    all_passed = all(r["passed"] for r in verification_results)
    final_state = {
        **task_state,
        "state": TaskState.COMPLETED.value if all_passed else TaskState.FAILED.value,
        "verification_results": verification_results,
        "completed_at": self._timestamp()
    }
    
    await self.state_store.save_task(task_id, final_state)
    
    # Trigger proactive notification
    await self._send_proactive_notification(final_state)
    
    return final_state

Step 2.3: Implement Recovery Logic on Agent Startup

python

openclaw/agent/startup.py

from openclaw.state.persistence import StateStore from openclaw.task.executor import CheckpointableExecutor

class AgentRecoveryManager: “““Handles automatic recovery of pending tasks on agent startup.”””

def __init__(self, state_store: StateStore, executor: CheckpointableExecutor):
    self.state_store = state_store
    self.executor = executor

async def on_agent_startup(self) -> List[Dict[str, Any]]:
    """
    Called when the OpenClaw agent starts.
    Recovers and processes all pending tasks.
    """
    pending_tasks = await self.state_store.load_pending_tasks()
    recovery_results = []
    
    for task in pending_tasks:
        try:
            result = await self.executor._post_restart_verification(task["task_id"])
            recovery_results.append({
                "task_id": task["task_id"],
                "status": "recovered",
                "result": result
            })
        except Exception as e:
            recovery_results.append({
                "task_id": task["task_id"],
                "status": "recovery_failed",
                "error": str(e)
            })
    
    return recovery_results

Step 2.4: Implement Proactive Notification Service

python

openclaw/notifications/proactive.py

from typing import Dict, Any, List from abc import ABC, abstractmethod import httpx

class NotificationChannel(ABC): @abstractmethod async def send(self, message: str, metadata: Dict[str, Any]) -> bool: pass

class WebhookNotification(NotificationChannel): def init(self, webhook_url: str): self.webhook_url = webhook_url

async def send(self, message: str, metadata: Dict[str, Any]) -> bool:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            self.webhook_url,
            json={
                "text": message,
                "attachments": [{
                    "color": "#36a64f" if metadata.get("success") else "#ff0000",
                    "fields": [
                        {"title": k, "value": str(v), "short": True}
                        for k, v in metadata.items()
                    ]
                }]
            },
            timeout=10.0
        )
        return response.status_code == 200

class ProactiveNotificationService: “““Sends proactive notifications after task completion or verification.”””

def __init__(self, channels: List[NotificationChannel]):
    self.channels = channels

async def notify_task_completion(
    self, 
    task_state: Dict[str, Any]
) -> None:
    success = task_state.get("state") == "COMPLETED"
    verification = task_state.get("verification_results", [])
    
    message = (
        f"✅ Task `{task_state['task_id']}` completed successfully. "
        if success else
        f"❌ Task `{task_state['task_id']}` verification failed."
    )
    
    metadata = {
        "success": success,
        "task_type": task_state.get("task_type"),
        "verification_checks": len(verification),
        "checks_passed": sum(1 for v in verification if v.get("passed")),
        "completed_at": task_state.get("completed_at")
    }
    
    for channel in self.channels:
        try:
            await channel.send(message, metadata)
        except Exception as e:
            # Log but don't fail - notification is best-effort
            logger.warning(f"Failed to notify via {channel}: {e}")

Phase 3: Configuration Integration

Add persistence configuration to openclaw.yaml:

yaml

openclaw.yaml (partial)

persistence: enabled: true backend: “sqlite” # or “file”, “etcd”, “postgres” path: “/var/lib/openclaw/state.db”

recovery: auto_recover_pending_tasks: true max_recovery_attempts: 3 recovery_delay_seconds: 5

notifications: proactive: enabled: true channels: - type: “webhook” url: “${OPENCLAW_NOTIFICATION_WEBHOOK}” - type: “log” level: “info” include_verification_details: true

Before vs. After Comparison

Aspect	Before Implementation	After Implementation
Pre-reboot state	Lost immediately	Persisted to SQLite with full context
Post-reboot awareness	Zero - fresh start	Loads and processes pending tasks
Verification	Manual query required	Automatic verification on restart
User notification	Reactive (user asks)	Proactive (agent reports)
Task continuity	Broken across restarts	Seamless continuation
Failure detection	Delayed, manual	Immediate, automated

🧪 Verification

Unit Test Verification

Test 1: State Persistence Across Simulated Restart

python

tests/unit/test_state_persistence.py

import pytest import asyncio from pathlib import Path from openclaw.state.persistence import SQLiteStateStore from openclaw.task.executor import CheckpointableExecutor, TaskState

@pytest.fixture def temp_db(tmp_path): return tmp_path / “test_state.db”

@pytest.fixture
def state_store(temp_db): return SQLiteStateStore(temp_db)

@pytest.fixture def executor(state_store): return CheckpointableExecutor(state_store)

@pytest.mark.asyncio async def test_task_state_persisted_before_reboot(executor, state_store): “““Verify task state is correctly saved before reboot simulation.””” task_id = “test-task-001” config = {“param”: “net.ipv4.tcp_timestamps”, “value”: “1”} verification = [ {“check_type”: “COMMAND_OUTPUT”, “command”: “echo test”, “expected”: “test”} ]

# Execute with checkpoint
await executor.execute_with_checkpoint(
    task_id=task_id,
    task_config=config,
    verification_criteria=verification,
    requires_reboot=True
)

# Simulate restart by creating new store instance
restarted_store = SQLiteStateStore(state_store.db_path)
loaded_state = await restarted_store.load_task(task_id)

assert loaded_state is not None
assert loaded_state["task_id"] == task_id
assert loaded_state["state"] == TaskState.AWAITING_VERIFICATION.value
assert loaded_state["config_snapshot"] == config

@pytest.mark.asyncio async def test_pending_tasks_loaded_on_restart(state_store, executor): “““Verify all pending tasks are recovered after restart.””” # Create multiple pending tasks for i in range(3): await state_store.save_task(f"task-{i}", { “task_id”: f"task-{i}", “state”: TaskState.AWAITING_VERIFICATION.value, “created_at”: “2024-01-01T00:00:00Z” })

# Simulate restart
restarted_store = SQLiteStateStore(state_store.db_path)
pending = await restarted_store.load_pending_tasks()

assert len(pending) == 3
assert all(t["task_id"].startswith("task-") for t in pending)

Test 2: Verification Execution on Recovery

python @pytest.mark.asyncio async def test_verification_runs_on_recovery(executor): “““Verify that verification criteria are executed after restart.””” task_id = “verify-task-001”

# Manually set task to awaiting verification
await executor.state_store.save_task(task_id, {
    "task_id": task_id,
    "state": TaskState.AWAITING_VERIFICATION.value,
    "verification_criteria": [
        {
            "check_type": "COMMAND_OUTPUT",
            "command": "echo 'success'",
            "expected": "success"
        }
    ],
    "created_at": "2024-01-01T00:00:00Z"
})

# Simulate recovery
result = await executor._post_restart_verification(task_id)

assert result["state"] == TaskState.COMPLETED.value
assert len(result["verification_results"]) == 1
assert result["verification_results"][0]["passed"] is True

Integration Test Verification

Test 3: Full Reboot Cycle Simulation

bash

integration-tests/test_reboot_persistence.sh

#!/bin/bash set -e

TASK_ID=“integration-test-$(date +%s)” OPENCLAW_ENDPOINT="${OPENCLAW_ENDPOINT:-http://localhost:8080}"

echo “=== Step 1: Submit reboot-requiring task ===” RESPONSE=$(curl -s -X POST “${OPENCLAW_ENDPOINT}/api/v1/tasks”
-H “Content-Type: application/json”
-d “{ "task_id": "${TASK_ID}", "type": "kernel_param", "config": { "param": "fs.file-max", "value": "65536" }, "requires_reboot": true, "verification": [ { "check_type": "command", "command": "sysctl fs.file-max", "expected": "65536" } ] }”)

echo “Response: $RESPONSE” TASK_STATE=$(echo “$RESPONSE” | jq -r ‘.state’) assert_equal “IN_PROGRESS” “$TASK_STATE”

echo “=== Step 2: Verify state persisted to database ===” SQLITE_DB="/var/lib/openclaw/state.db" PERSISTED_STATE=$(sqlite3 “$SQLITE_DB” “SELECT state_json FROM task_state WHERE task_id=’${TASK_ID}’”) echo “Persisted: $PERSISTED_STATE”

echo “=== Step 3: Simulate agent restart ===” systemctl restart openclaw-agent

echo “=== Step 4: Verify agent recovered task on startup ===” sleep 2 RECOVERY_LOG=$(journalctl -u openclaw-agent –since “1 minute ago” | grep “recovered task” || true) echo “Recovery log: $RECOVERY_LOG”

echo “=== Step 5: Verify task completed after recovery ===” FINAL_STATE=$(curl -s “${OPENCLAW_ENDPOINT}/api/v1/tasks/${TASK_ID}” | jq -r ‘.state’) assert_equal “COMPLETED” “$FINAL_STATE”

echo “=== Step 6: Verify proactive notification sent ===” NOTIFICATION_LOG=$(grep “notification sent” /var/log/openclaw/notifications.log | tail -1) echo “Notification: $NOTIFICATION_LOG”

echo “=== All integration tests passed ===”

Manual Verification Checklist

Execute this checklist to verify the implementation in a live environment:


VERIFICATION CHECKLIST
═════════════════════

□ 1. State Store Initialization
   $ ls -la /var/lib/openclaw/state.db
   Expected: File exists with correct permissions (0600)

□ 2. Task State Checkpointing
   $ sqlite3 /var/lib/openclaw/state.db \
     "SELECT task_id, state FROM task_state"
   Before reboot: Should show task in AWAITING_VERIFICATION state
   After reboot: Should show task in COMPLETED or FAILED state

□ 3. Agent Startup Recovery
   $ journalctl -u openclaw-agent -n 50 | grep -i "recovery\|pending"
   Expected: Log lines showing pending tasks being processed

□ 4. Verification Execution
   $ sqlite3 /var/lib/openclaw/state.db \
     "SELECT json_extract(state_json, '$.verification_results') FROM task_state"
   Expected: JSON array of verification check results

□ 5. Notification Dispatch
   $ tail -f /var/log/openclaw/notifications.log
   Expected: Outgoing webhook calls to configured notification endpoint

□ 6. End-to-End Latency
   Reboot cycle should complete verification within 30 seconds of restart

⚠️ Common Pitfalls

Implementation Pitfalls

Race Condition During Checkpoint Write
Symptom: Task state saved incompletely, leaving corrupted or partial records in the state store.
Mitigation: Use atomic write operations (write to temp file, fsync, then rename) or database transactions with WAL mode.

# INCORRECT - susceptible to corruption
async def save_task(task_id, state):
    with open(f"/tmp/{task_id}.json", "w") as f:
        json.dump(state, f)  # Crash here = corrupted state
CORRECT - atomic write
async def save_task(task_id, state):
temp_path = f"/tmp/{task_id}.json.tmp"
final_path = f"/var/lib/openclaw/{task_id}.json"
async with aiofiles.temp_path, mode=‘w’) as f:
await f.write(json.dumps(state))
await f.flush()
os.fsync(f.fileno())
os.rename(temp_path, final_path)

State Store Lock Contention
Symptom: Agent hangs or times out when accessing state store under high concurrency.
Mitigation: Configure appropriate SQLite busy timeout and use connection pooling for production backends.
```
# Configure SQLite for concurrent access
conn.execute("PRAGMA busy_timeout = 5000")  # 5 second timeout
conn.execute("PRAGMA journal_mode = WAL")  # Write-Ahead Logging
```

Verification Heisenbugs
Symptom: Verification passes in test but fails intermittently in production due to timing or system state.
Mitigation: Implement retry logic for transient verification failures and add jitter to avoid thundering herd.

async def verify_with_retry(criteria, max_attempts=3, base_delay=1):
    for attempt in range(max_attempts):
        try:
            if await run_verification(criteria):
                return True
        except TransientError:
            pass
        await asyncio.sleep(base_delay * (2 ** attempt) + random.uniform(0, 1))
    return False

Configuration Pitfalls

Missing Notification Channel Configuration
Symptom: Proactive notifications silently fail because webhook URL is not set.
Mitigation: Validate channel configuration at startup and fail fast if required channels are misconfigured.

# Validate at startup
if config.notifications.enabled:
    for channel in config.notifications.channels:
        if channel.type == "webhook" and not channel.url:
            raise ConfigurationError(
                "Webhook URL required for proactive notifications"
            )

State Store Path Permissions
Symptom: Agent cannot write state store, tasks lost after restart.
Mitigation: Document required permissions and create directories with correct ownership during installation.
```
# Installation script snippet
mkdir -p /var/lib/openclaw
chown openclaw:openclaw /var/lib/openclaw
chmod 0700 /var/lib/openclaw
```

Environment-Specific Pitfalls

Docker Volume Persistence
Symptom: State lost in Docker Compose environment when containers restart with default volume behavior.
Fix: Explicitly mount state directory to persistent volume.

# docker-compose.yml services: openclaw-agent: image: openclaw/agent:latest volumes: - openclaw-state:/var/lib/openclaw - /var/run:/var/run # For systemd socket access if needed

volumes: openclaw-state: driver: local

Kubernetes Pod Disruption
Symptom: Task state lost during pod eviction or node drain.
Fix: Use PersistentVolumeClaim for state store or external database backend (etcd, PostgreSQL).

# kubernetes deployment with PVC
spec:
  volumes:
    - name: openclaw-state
      persistentVolumeClaim:
        claimName: openclaw-state-pvc
  containers:
    - name: agent
      volumeMounts:
        - name: openclaw-state
          mountPath: /var/lib/openclaw

macOS Sandbox Restrictions
Symptom: State file write operations fail due to Application Sandbox entitlement restrictions.
Fix: Request explicit file access entitlements or use user defaults for state storage.
```
# If using macOS app bundle, add to entitlements:
com.apple.security.files.user-selected.read-write = true
com.apple.security.files.bookmarks.app-scope = true
```

Operational Pitfalls

State Store Growth (Unbounded)
Symptom: State database grows indefinitely, consuming disk space.
Fix: Implement TTL-based cleanup and archival policy.

# Cleanup job - run daily
DELETE FROM task_state 
WHERE json_extract(state_json, '$.state') IN ('COMPLETED', 'FAILED')
AND updated_at < datetime('now', '-7 days');
Vacuum to reclaim space
PRAGMA vacuum;

Stuck Tasks (No Timeout)
Symptom: Tasks stuck in AWAITING_VERIFICATION indefinitely on systems that never reboot.
Fix: Implement maximum wait time and automatic resolution or escalation.

MAX_AWAIT_SECONDS = 3600  # 1 hour
async def check_stuck_tasks():
for task in await store.load_pending_tasks():
elapsed = now() - task[“checkpoint_at”]
if elapsed > MAX_AWAIT_SECONDS:
await escalate_task(task)

Logically Connected Error Patterns

E_OPENCLAW_TASK_NOT_FOUND
Description: Agent cannot locate task state in persistence store during recovery. Indicates checkpoint failure or manual database manipulation.
Related: State store corruption, disk full during checkpoint write.
E_OPENCLAW_VERIFICATION_TIMEOUT
Description: Verification criteria check exceeded configured timeout. Common when verifying network-dependent configurations.
Related: Network interruption, firewall blocking required ports, service not yet started.
E_OPENCLAW_STATE_STORE_LOCKED
Description: Concurrent access to state store results in SQLITE_BUSY errors. Requires busy timeout configuration or connection pooling.
Related: High concurrent task submission, SQLite misconfiguration.
E_OPENCLAW_RECOVERY_FAILED
Description: Agent startup recovery process encountered unrecoverable error. Requires manual intervention.
Related: Schema migration failure, incompatible state format, corrupted verification criteria.
E_OPENCLAW_NOTIFICATION_DELIVERY_FAILED
Description: Proactive notification dispatch failed. Task completed but user not informed.
Related: Webhook endpoint unreachable, invalid credentials, rate limiting.
E_OPENCLAW_CHECKPOINT_INCOMPLETE
Description: Partial state written before crash. Detected during recovery validation.
Related: System crash during checkpoint, insufficient fsync, disk I/O errors.

Issue/PR	Title	Relationship
#142	Support for long-running tasks with progress reporting	Parent feature request
#187	Add checkpoint/resume capability to task executor	Direct implementation of this guide
#203	Proactive notifications via webhook	Notification component
#156	SQLite backend for state persistence	Persistence backend
#198	etcd/KV store support for distributed agents	Alternative persistence
#215	Kubernetes operator for OpenClaw agent lifecycle	K8s integration concern
#178	Task deduplication across agent restarts	Related recovery concern

External Dependencies

SQLite 3.35+: Required for JSON table functions used in state queries
aiofiles: Async file I/O for non-blocking state operations
httpx: Async HTTP client for webhook notifications
systemd: For service restart handling and watchdog integration

🔍 Symptoms

Direct User Experience Symptoms

Technical Manifestations

Affected Use Cases

Intended vs. Actual Behavior

🧠 Root Cause

Architectural Root Cause: Stateless Task Execution Model

Failure Sequence Analysis

Technical Debt: Missing State Persistence Layer

1. Absence of Checkpoint Mechanism

Current (simplified) task execution flow

2. No Persistent State Store

3. Missing Recovery/Resume Logic

Current (simplified) agent startup

Environmental Dependencies

🛠️ Step-by-Step Fix

Implementation Overview

Phase 1: Define State Persistence Schema

Phase 2: Implement Persistence Layer

Step 2.1: Create State Store Abstraction

openclaw/state/persistence.py

Step 2.2: Integrate Checkpoint into Task Execution

openclaw/task/executor.py

Step 2.3: Implement Recovery Logic on Agent Startup

openclaw/agent/startup.py

Step 2.4: Implement Proactive Notification Service

openclaw/notifications/proactive.py

Phase 3: Configuration Integration

openclaw.yaml (partial)

Before vs. After Comparison

🧪 Verification

Unit Test Verification

Test 1: State Persistence Across Simulated Restart

tests/unit/test_state_persistence.py

Test 2: Verification Execution on Recovery

Integration Test Verification

Test 3: Full Reboot Cycle Simulation

integration-tests/test_reboot_persistence.sh

Manual Verification Checklist

⚠️ Common Pitfalls

Implementation Pitfalls

CORRECT - atomic write

Configuration Pitfalls

Environment-Specific Pitfalls

Operational Pitfalls

Vacuum to reclaim space

🔗 Related Errors

Logically Connected Error Patterns

Related GitHub Issues and Feature Requests

External Dependencies