May 03, 2026 β€’ Version: unspecified

Task State Loss After OpenClaw Service Restarts

OpenClaw does not persist task state across service restarts, causing pending configuration tasks and their results to be lost, requiring manual follow-up to determine task completion status.

πŸ” Symptoms

Direct User Experience Symptoms

After instructing OpenClaw to perform configuration tasks that trigger a system reboot or require the agent to restart, users observe the following behavioral symptoms:

  • Silent task termination: The agent completes its assigned work and initiates a reboot/restart, but provides no automated follow-up communication after coming back online.
  • Forced manual status inquiry: Users must explicitly ask the agent to check the status of previous tasks after a restart, disrupting the continuous automation flow.
  • State amnesia: Upon restart, the agent has no memory of pending tasks, expected outcomes, or the context of work that was in progress.

Technical Manifestations

From a systems perspective, the absence of state persistence manifests as:


# User initiates a task requiring reboot
$ openclaw execute --task "update-kernel-parameter" --params '{"param": "net.ipv4.tcp_timestamps", "value": "1"}'

# Agent acknowledges and begins execution
[OpenClaw Agent] Task accepted. Applying kernel parameter and initiating system reboot...

# After reboot, user must manually inquire
$ openclaw status
[OpenClaw Agent] No active tasks. Ready for new instructions.

# Agent has no record of the previous task or its pending state

Affected Use Cases

  • Kernel parameter modifications requiring a system reboot to take effect
  • Core component updates that restart the OpenClaw service mid-task
  • Multi-stage provisioning workflows interrupted by host restarts
  • Firewall rule changes that trigger SSH connection drops and reconnections

Intended vs. Actual Behavior

ScenarioExpected BehaviorActual Behavior
Task initiates rebootPersist task context before rebootTask context lost
Agent restartsResume pending tasks automaticallyAgent starts with clean state
Task completionProactive result notificationUser must query manually
System rollback neededDetect and report partial failureNo automatic verification

🧠 Root Cause

Architectural Root Cause: Stateless Task Execution Model

The underlying issue stems from OpenClaw’s current execution model, which operates as a stateless request-response system rather than a stateful workflow engine. This architectural decision, while simplifying initial implementation, creates a fundamental gap in handling long-running or restart-dependent tasks.

Failure Sequence Analysis

The following sequence diagram illustrates the point of failure:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ TASK EXECUTION FAILURE SEQUENCE β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ [Client] ──────> [OpenClaw Agent] ──────> [System/Systemd] β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ submit task β”‚ β”‚ β”‚ β”‚ │───────────────>β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ execute_task() β”‚ β”‚ β”‚ β”‚ │───────────────> β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ modify_config() β”‚ β”‚ β”‚ β”‚ │──────────────> β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ initiate_reboot() β”‚ β”‚ β”‚ β”‚ │───────────────> β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ [SYSTEM REBOOTS] β”‚ β”‚ β”‚ β”‚ *** AGENT PROCESS β”‚ β”‚ β”‚ β”‚ β”‚ TERMINATED *** β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ NO STATE β”‚ NO STATE β”‚ [SYSTEM ONLINE] β”‚ β”‚ β”‚ PRESERVED β”‚ PRESERVED β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ manual_query β”‚ clean_start() β”‚ β”‚ β”‚ │───────────────>β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ “No active tasks” β”‚ β”‚ β”‚ β”‚ β”‚<─────────────── β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technical Debt: Missing State Persistence Layer

The absence of the following components constitutes the root technical debt:

1. Absence of Checkpoint Mechanism

python

Current (simplified) task execution flow

def execute_task(task): # No checkpoint before potentially terminating operations apply_configuration(task.config) if task.requires_reboot: system_reboot() # Agent process terminates here - state lost return SUCCESS # Never reached after reboot

2. No Persistent State Store

The agent lacks a mechanism to serialize and persist:

  • Current task state (PENDING, IN_PROGRESS, AWAITING_REBOOT, VERIFICATION)
  • Expected outcomes and acceptance criteria
  • Task metadata (timestamps, retry counts, dependency chains)
  • User notification flags

3. Missing Recovery/Resume Logic

Upon restart, the agent initializes with:

  • No awareness of previously submitted tasks
  • No verification of whether changes took effect
  • No proactive notification capability

python

Current (simplified) agent startup

def on_agent_start(): # Fresh initialization - no recovery logic initialize_extensions() register_command_handlers() enter_idle_loop() # Previous tasks unknown

Environmental Dependencies

The state persistence failure is exacerbated by:

  • Containerized environments: Docker containers with restart policies lose all in-memory state on restart
  • Systemd-managed services: Standard service units do not provide application-level state awareness
  • Cloud-init scenarios: VMs that snapshot/restore without agent coordination
  • Network interruptions: Prolonged disconnections that trigger timeout-based restarts

πŸ› οΈ Step-by-Step Fix

Implementation Overview

The recommended fix implements a checkpoint-based state persistence system that saves task context before any potentially terminating operation and automatically recovers and verifies pending tasks upon restart.

Phase 1: Define State Persistence Schema

Create a structured state store schema for task persistence:

json { “task_id”: “uuid-v4-string”, “task_type”: “CONFIG_UPDATE | PACKAGE_INSTALL | KERNEL_PARAM | …”, “state”: “PENDING | IN_PROGRESS | AWAITING_VERIFICATION | COMPLETED | FAILED”, “created_at”: “ISO-8601-timestamp”, “checkpoint_at”: “ISO-8601-timestamp”, “config_snapshot”: { “intended_changes”: {}, “rollback_plan”: {} }, “verification_criteria”: [ { “check_type”: “COMMAND_OUTPUT”, “command”: “sysctl net.ipv4.tcp_timestamps”, “expected”: “net.ipv4.tcp_timestamps = 1” }, { “check_type”: “FILE_EXISTS”, “path”: “/etc/sysctl.d/99-custom.conf” } ], “notification_flags”: { “user_id”: “user-handle”, “channel”: “slack | email | webhook”, “pending”: true }, “retry_policy”: { “max_attempts”: 3, “current_attempt”: 1, “backoff_seconds”: 30 }, “metadata”: { “parent_task_id”: null, “correlation_id”: “correlation-uuid”, “tags”: [“kernel”, “network”, “reboot-required”] } }

Phase 2: Implement Persistence Layer

Step 2.1: Create State Store Abstraction

python

openclaw/state/persistence.py

from abc import ABC, abstractmethod from enum import Enum from pathlib import Path from typing import Optional, List, Dict, Any import json import aiofiles import sqlite3 from contextlib import asynccontextmanager

class StateStore(ABC): “““Abstract base for state persistence backends.”””

@abstractmethod
async def save_task(self, task_id: str, state: Dict[str, Any]) -> None:
    pass

@abstractmethod
async def load_task(self, task_id: str) -> Optional[Dict[str, Any]]:
    pass

@abstractmethod
async def load_pending_tasks(self) -> List[Dict[str, Any]]:
    pass

@abstractmethod
async def delete_task(self, task_id: str) -> None:
    pass

class SQLiteStateStore(StateStore): “““SQLite-based state persistence for production use.”””

def __init__(self, db_path: Path):
    self.db_path = db_path
    self._init_database()

def _init_database(self):
    with sqlite3.connect(self.db_path) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS task_state (
                task_id TEXT PRIMARY KEY,
                state_json TEXT NOT NULL,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        conn.execute("""
            CREATE INDEX IF NOT EXISTS idx_task_state_status 
            ON task_state(state_json)
        """)

async def save_task(self, task_id: str, state: Dict[str, Any]) -> None:
    state_json = json.dumps(state)
    async with aiofiles.open(self.db_path, 'r+') as f:
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                INSERT OR REPLACE INTO task_state 
                (task_id, state_json, updated_at)
                VALUES (?, ?, CURRENT_TIMESTAMP)
            """, (task_id, state_json))

async def load_task(self, task_id: str) -> Optional[Dict[str, Any]]:
    with sqlite3.connect(self.db_path) as conn:
        cursor = conn.execute(
            "SELECT state_json FROM task_state WHERE task_id = ?",
            (task_id,)
        )
        row = cursor.fetchone()
        return json.loads(row[0]) if row else None

async def load_pending_tasks(self) -> List[Dict[str, Any]]:
    with sqlite3.connect(self.db_path) as conn:
        cursor = conn.execute("""
            SELECT state_json FROM task_state
            WHERE json_extract(state_json, '$.state') 
            IN ('PENDING', 'IN_PROGRESS', 'AWAITING_VERIFICATION')
            ORDER BY created_at ASC
        """)
        return [json.loads(row[0]) for row in cursor.fetchall()]

Step 2.2: Integrate Checkpoint into Task Execution

python

openclaw/task/executor.py

from openclaw.state.persistence import StateStore, SQLiteStateStore from enum import Enum from typing import Callable, Any import asyncio

class TaskState(Enum): PENDING = “PENDING” IN_PROGRESS = “IN_PROGRESS” AWAITING_VERIFICATION = “AWAITING_VERIFICATION” COMPLETED = “COMPLETED” FAILED = “FAILED”

class CheckpointableExecutor: “““Task executor with automatic checkpointing.”””

def __init__(self, state_store: StateStore):
    self.state_store = state_store

async def execute_with_checkpoint(
    self, 
    task_id: str,
    task_config: Dict[str, Any],
    verification_criteria: List[Dict],
    requires_reboot: bool = False
) -> Dict[str, Any]:
    # INITIAL CHECKPOINT: Persist initial task state
    initial_state = {
        "task_id": task_id,
        "state": TaskState.IN_PROGRESS.value,
        "config_snapshot": task_config,
        "verification_criteria": verification_criteria,
        "requires_reboot": requires_reboot,
        "checkpoint_at": self._timestamp()
    }
    await self.state_store.save_task(task_id, initial_state)
    
    try:
        # EXECUTE: Apply configuration
        result = await self._apply_configuration(task_config)
        
        if requires_reboot:
            # CHECKPOINT BEFORE REBOOT: Mark as awaiting verification
            await self.state_store.save_task(task_id, {
                **initial_state,
                "state": TaskState.AWAITING_VERIFICATION.value,
                "pre_reboot_result": result,
                "checkpoint_at": self._timestamp()
            })
            
            # INITIATE REBOOT (agent process terminates)
            await self._initiate_reboot()
        
        # This code only executes after restart during recovery
        return await self._post_restart_verification(task_id)
        
    except Exception as e:
        await self.state_store.save_task(task_id, {
            **initial_state,
            "state": TaskState.FAILED.value,
            "error": str(e),
            "checkpoint_at": self._timestamp()
        })
        raise

async def _post_restart_verification(self, task_id: str) -> Dict[str, Any]:
    """Called after agent restart to verify task completion."""
    task_state = await self.state_store.load_task(task_id)
    
    if not task_state:
        raise ValueError(f"No persisted state found for task {task_id}")
    
    if task_state.get("state") != TaskState.AWAITING_VERIFICATION.value:
        return task_state
    
    # Run verification checks
    verification_results = await self._run_verification(
        task_state["verification_criteria"]
    )
    
    all_passed = all(r["passed"] for r in verification_results)
    final_state = {
        **task_state,
        "state": TaskState.COMPLETED.value if all_passed else TaskState.FAILED.value,
        "verification_results": verification_results,
        "completed_at": self._timestamp()
    }
    
    await self.state_store.save_task(task_id, final_state)
    
    # Trigger proactive notification
    await self._send_proactive_notification(final_state)
    
    return final_state

Step 2.3: Implement Recovery Logic on Agent Startup

python

openclaw/agent/startup.py

from openclaw.state.persistence import StateStore from openclaw.task.executor import CheckpointableExecutor

class AgentRecoveryManager: “““Handles automatic recovery of pending tasks on agent startup.”””

def __init__(self, state_store: StateStore, executor: CheckpointableExecutor):
    self.state_store = state_store
    self.executor = executor

async def on_agent_startup(self) -> List[Dict[str, Any]]:
    """
    Called when the OpenClaw agent starts.
    Recovers and processes all pending tasks.
    """
    pending_tasks = await self.state_store.load_pending_tasks()
    recovery_results = []
    
    for task in pending_tasks:
        try:
            result = await self.executor._post_restart_verification(task["task_id"])
            recovery_results.append({
                "task_id": task["task_id"],
                "status": "recovered",
                "result": result
            })
        except Exception as e:
            recovery_results.append({
                "task_id": task["task_id"],
                "status": "recovery_failed",
                "error": str(e)
            })
    
    return recovery_results

Step 2.4: Implement Proactive Notification Service

python

openclaw/notifications/proactive.py

from typing import Dict, Any, List from abc import ABC, abstractmethod import httpx

class NotificationChannel(ABC): @abstractmethod async def send(self, message: str, metadata: Dict[str, Any]) -> bool: pass

class WebhookNotification(NotificationChannel): def init(self, webhook_url: str): self.webhook_url = webhook_url

async def send(self, message: str, metadata: Dict[str, Any]) -> bool:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            self.webhook_url,
            json={
                "text": message,
                "attachments": [{
                    "color": "#36a64f" if metadata.get("success") else "#ff0000",
                    "fields": [
                        {"title": k, "value": str(v), "short": True}
                        for k, v in metadata.items()
                    ]
                }]
            },
            timeout=10.0
        )
        return response.status_code == 200

class ProactiveNotificationService: “““Sends proactive notifications after task completion or verification.”””

def __init__(self, channels: List[NotificationChannel]):
    self.channels = channels

async def notify_task_completion(
    self, 
    task_state: Dict[str, Any]
) -> None:
    success = task_state.get("state") == "COMPLETED"
    verification = task_state.get("verification_results", [])
    
    message = (
        f"βœ… Task `{task_state['task_id']}` completed successfully. "
        if success else
        f"❌ Task `{task_state['task_id']}` verification failed."
    )
    
    metadata = {
        "success": success,
        "task_type": task_state.get("task_type"),
        "verification_checks": len(verification),
        "checks_passed": sum(1 for v in verification if v.get("passed")),
        "completed_at": task_state.get("completed_at")
    }
    
    for channel in self.channels:
        try:
            await channel.send(message, metadata)
        except Exception as e:
            # Log but don't fail - notification is best-effort
            logger.warning(f"Failed to notify via {channel}: {e}")

Phase 3: Configuration Integration

Add persistence configuration to openclaw.yaml:

yaml

openclaw.yaml (partial)

persistence: enabled: true backend: “sqlite” # or “file”, “etcd”, “postgres” path: “/var/lib/openclaw/state.db”

recovery: auto_recover_pending_tasks: true max_recovery_attempts: 3 recovery_delay_seconds: 5

notifications: proactive: enabled: true channels: - type: “webhook” url: “${OPENCLAW_NOTIFICATION_WEBHOOK}” - type: “log” level: “info” include_verification_details: true

Before vs. After Comparison

AspectBefore ImplementationAfter Implementation
Pre-reboot stateLost immediatelyPersisted to SQLite with full context
Post-reboot awarenessZero - fresh startLoads and processes pending tasks
VerificationManual query requiredAutomatic verification on restart
User notificationReactive (user asks)Proactive (agent reports)
Task continuityBroken across restartsSeamless continuation
Failure detectionDelayed, manualImmediate, automated

πŸ§ͺ Verification

Unit Test Verification

Test 1: State Persistence Across Simulated Restart

python

tests/unit/test_state_persistence.py

import pytest import asyncio from pathlib import Path from openclaw.state.persistence import SQLiteStateStore from openclaw.task.executor import CheckpointableExecutor, TaskState

@pytest.fixture def temp_db(tmp_path): return tmp_path / “test_state.db”

@pytest.fixture
def state_store(temp_db): return SQLiteStateStore(temp_db)

@pytest.fixture def executor(state_store): return CheckpointableExecutor(state_store)

@pytest.mark.asyncio async def test_task_state_persisted_before_reboot(executor, state_store): “““Verify task state is correctly saved before reboot simulation.””” task_id = “test-task-001” config = {“param”: “net.ipv4.tcp_timestamps”, “value”: “1”} verification = [ {“check_type”: “COMMAND_OUTPUT”, “command”: “echo test”, “expected”: “test”} ]

# Execute with checkpoint
await executor.execute_with_checkpoint(
    task_id=task_id,
    task_config=config,
    verification_criteria=verification,
    requires_reboot=True
)

# Simulate restart by creating new store instance
restarted_store = SQLiteStateStore(state_store.db_path)
loaded_state = await restarted_store.load_task(task_id)

assert loaded_state is not None
assert loaded_state["task_id"] == task_id
assert loaded_state["state"] == TaskState.AWAITING_VERIFICATION.value
assert loaded_state["config_snapshot"] == config

@pytest.mark.asyncio async def test_pending_tasks_loaded_on_restart(state_store, executor): “““Verify all pending tasks are recovered after restart.””” # Create multiple pending tasks for i in range(3): await state_store.save_task(f"task-{i}", { “task_id”: f"task-{i}", “state”: TaskState.AWAITING_VERIFICATION.value, “created_at”: “2024-01-01T00:00:00Z” })

# Simulate restart
restarted_store = SQLiteStateStore(state_store.db_path)
pending = await restarted_store.load_pending_tasks()

assert len(pending) == 3
assert all(t["task_id"].startswith("task-") for t in pending)

Test 2: Verification Execution on Recovery

python @pytest.mark.asyncio async def test_verification_runs_on_recovery(executor): “““Verify that verification criteria are executed after restart.””” task_id = “verify-task-001”

# Manually set task to awaiting verification
await executor.state_store.save_task(task_id, {
    "task_id": task_id,
    "state": TaskState.AWAITING_VERIFICATION.value,
    "verification_criteria": [
        {
            "check_type": "COMMAND_OUTPUT",
            "command": "echo 'success'",
            "expected": "success"
        }
    ],
    "created_at": "2024-01-01T00:00:00Z"
})

# Simulate recovery
result = await executor._post_restart_verification(task_id)

assert result["state"] == TaskState.COMPLETED.value
assert len(result["verification_results"]) == 1
assert result["verification_results"][0]["passed"] is True

Integration Test Verification

Test 3: Full Reboot Cycle Simulation

bash

integration-tests/test_reboot_persistence.sh

#!/bin/bash set -e

TASK_ID=“integration-test-$(date +%s)” OPENCLAW_ENDPOINT="${OPENCLAW_ENDPOINT:-http://localhost:8080}"

echo “=== Step 1: Submit reboot-requiring task ===” RESPONSE=$(curl -s -X POST “${OPENCLAW_ENDPOINT}/api/v1/tasks”
-H “Content-Type: application/json”
-d “{ "task_id": "${TASK_ID}", "type": "kernel_param", "config": { "param": "fs.file-max", "value": "65536" }, "requires_reboot": true, "verification": [ { "check_type": "command", "command": "sysctl fs.file-max", "expected": "65536" } ] }”)

echo “Response: $RESPONSE” TASK_STATE=$(echo “$RESPONSE” | jq -r ‘.state’) assert_equal “IN_PROGRESS” “$TASK_STATE”

echo “=== Step 2: Verify state persisted to database ===” SQLITE_DB="/var/lib/openclaw/state.db" PERSISTED_STATE=$(sqlite3 “$SQLITE_DB” “SELECT state_json FROM task_state WHERE task_id=’${TASK_ID}’”) echo “Persisted: $PERSISTED_STATE”

echo “=== Step 3: Simulate agent restart ===” systemctl restart openclaw-agent

echo “=== Step 4: Verify agent recovered task on startup ===” sleep 2 RECOVERY_LOG=$(journalctl -u openclaw-agent –since “1 minute ago” | grep “recovered task” || true) echo “Recovery log: $RECOVERY_LOG”

echo “=== Step 5: Verify task completed after recovery ===” FINAL_STATE=$(curl -s “${OPENCLAW_ENDPOINT}/api/v1/tasks/${TASK_ID}” | jq -r ‘.state’) assert_equal “COMPLETED” “$FINAL_STATE”

echo “=== Step 6: Verify proactive notification sent ===” NOTIFICATION_LOG=$(grep “notification sent” /var/log/openclaw/notifications.log | tail -1) echo “Notification: $NOTIFICATION_LOG”

echo “=== All integration tests passed ===”

Manual Verification Checklist

Execute this checklist to verify the implementation in a live environment:


VERIFICATION CHECKLIST
═════════════════════

β–‘ 1. State Store Initialization
   $ ls -la /var/lib/openclaw/state.db
   Expected: File exists with correct permissions (0600)

β–‘ 2. Task State Checkpointing
   $ sqlite3 /var/lib/openclaw/state.db \
     "SELECT task_id, state FROM task_state"
   Before reboot: Should show task in AWAITING_VERIFICATION state
   After reboot: Should show task in COMPLETED or FAILED state

β–‘ 3. Agent Startup Recovery
   $ journalctl -u openclaw-agent -n 50 | grep -i "recovery\|pending"
   Expected: Log lines showing pending tasks being processed

β–‘ 4. Verification Execution
   $ sqlite3 /var/lib/openclaw/state.db \
     "SELECT json_extract(state_json, '$.verification_results') FROM task_state"
   Expected: JSON array of verification check results

β–‘ 5. Notification Dispatch
   $ tail -f /var/log/openclaw/notifications.log
   Expected: Outgoing webhook calls to configured notification endpoint

β–‘ 6. End-to-End Latency
   Reboot cycle should complete verification within 30 seconds of restart

⚠️ Common Pitfalls

Implementation Pitfalls

  • Race Condition During Checkpoint Write
    Symptom: Task state saved incompletely, leaving corrupted or partial records in the state store.
    Mitigation: Use atomic write operations (write to temp file, fsync, then rename) or database transactions with WAL mode.
    # INCORRECT - susceptible to corruption
    async def save_task(task_id, state):
        with open(f"/tmp/{task_id}.json", "w") as f:
            json.dump(state, f)  # Crash here = corrupted state
    

    CORRECT - atomic write

    async def save_task(task_id, state): temp_path = f"/tmp/{task_id}.json.tmp" final_path = f"/var/lib/openclaw/{task_id}.json" async with aiofiles.temp_path, mode=‘w’) as f: await f.write(json.dumps(state)) await f.flush() os.fsync(f.fileno()) os.rename(temp_path, final_path)

  • State Store Lock Contention
    Symptom: Agent hangs or times out when accessing state store under high concurrency.
    Mitigation: Configure appropriate SQLite busy timeout and use connection pooling for production backends.
    # Configure SQLite for concurrent access
    conn.execute("PRAGMA busy_timeout = 5000")  # 5 second timeout
    conn.execute("PRAGMA journal_mode = WAL")  # Write-Ahead Logging
    
  • Verification Heisenbugs
    Symptom: Verification passes in test but fails intermittently in production due to timing or system state.
    Mitigation: Implement retry logic for transient verification failures and add jitter to avoid thundering herd.
    async def verify_with_retry(criteria, max_attempts=3, base_delay=1):
        for attempt in range(max_attempts):
            try:
                if await run_verification(criteria):
                    return True
            except TransientError:
                pass
            await asyncio.sleep(base_delay * (2 ** attempt) + random.uniform(0, 1))
        return False
    

Configuration Pitfalls

  • Missing Notification Channel Configuration
    Symptom: Proactive notifications silently fail because webhook URL is not set.
    Mitigation: Validate channel configuration at startup and fail fast if required channels are misconfigured.
    # Validate at startup
    if config.notifications.enabled:
        for channel in config.notifications.channels:
            if channel.type == "webhook" and not channel.url:
                raise ConfigurationError(
                    "Webhook URL required for proactive notifications"
                )
    
  • State Store Path Permissions
    Symptom: Agent cannot write state store, tasks lost after restart.
    Mitigation: Document required permissions and create directories with correct ownership during installation.
    # Installation script snippet
    mkdir -p /var/lib/openclaw
    chown openclaw:openclaw /var/lib/openclaw
    chmod 0700 /var/lib/openclaw
    

Environment-Specific Pitfalls

  • Docker Volume Persistence
    Symptom: State lost in Docker Compose environment when containers restart with default volume behavior.
    Fix: Explicitly mount state directory to persistent volume.
    # docker-compose.yml
    services:
      openclaw-agent:
        image: openclaw/agent:latest
        volumes:
          - openclaw-state:/var/lib/openclaw
          - /var/run:/var/run  # For systemd socket access if needed
    

    volumes: openclaw-state: driver: local

  • Kubernetes Pod Disruption
    Symptom: Task state lost during pod eviction or node drain.
    Fix: Use PersistentVolumeClaim for state store or external database backend (etcd, PostgreSQL).
    # kubernetes deployment with PVC
    spec:
      volumes:
        - name: openclaw-state
          persistentVolumeClaim:
            claimName: openclaw-state-pvc
      containers:
        - name: agent
          volumeMounts:
            - name: openclaw-state
              mountPath: /var/lib/openclaw
    
  • macOS Sandbox Restrictions
    Symptom: State file write operations fail due to Application Sandbox entitlement restrictions.
    Fix: Request explicit file access entitlements or use user defaults for state storage.
    
    # If using macOS app bundle, add to entitlements:
    com.apple.security.files.user-selected.read-write = true
    com.apple.security.files.bookmarks.app-scope = true
    

Operational Pitfalls

  • State Store Growth (Unbounded)
    Symptom: State database grows indefinitely, consuming disk space.
    Fix: Implement TTL-based cleanup and archival policy.
    # Cleanup job - run daily
    DELETE FROM task_state 
    WHERE json_extract(state_json, '$.state') IN ('COMPLETED', 'FAILED')
    AND updated_at < datetime('now', '-7 days');
    

    Vacuum to reclaim space

    PRAGMA vacuum;

  • Stuck Tasks (No Timeout)
    Symptom: Tasks stuck in AWAITING_VERIFICATION indefinitely on systems that never reboot.
    Fix: Implement maximum wait time and automatic resolution or escalation.
    MAX_AWAIT_SECONDS = 3600  # 1 hour
    

    async def check_stuck_tasks(): for task in await store.load_pending_tasks(): elapsed = now() - task[“checkpoint_at”] if elapsed > MAX_AWAIT_SECONDS: await escalate_task(task)

Logically Connected Error Patterns

  • E_OPENCLAW_TASK_NOT_FOUND
    Description: Agent cannot locate task state in persistence store during recovery. Indicates checkpoint failure or manual database manipulation.
    Related: State store corruption, disk full during checkpoint write.
  • E_OPENCLAW_VERIFICATION_TIMEOUT
    Description: Verification criteria check exceeded configured timeout. Common when verifying network-dependent configurations.
    Related: Network interruption, firewall blocking required ports, service not yet started.
  • E_OPENCLAW_STATE_STORE_LOCKED
    Description: Concurrent access to state store results in SQLITE_BUSY errors. Requires busy timeout configuration or connection pooling.
    Related: High concurrent task submission, SQLite misconfiguration.
  • E_OPENCLAW_RECOVERY_FAILED
    Description: Agent startup recovery process encountered unrecoverable error. Requires manual intervention.
    Related: Schema migration failure, incompatible state format, corrupted verification criteria.
  • E_OPENCLAW_NOTIFICATION_DELIVERY_FAILED
    Description: Proactive notification dispatch failed. Task completed but user not informed.
    Related: Webhook endpoint unreachable, invalid credentials, rate limiting.
  • E_OPENCLAW_CHECKPOINT_INCOMPLETE
    Description: Partial state written before crash. Detected during recovery validation.
    Related: System crash during checkpoint, insufficient fsync, disk I/O errors.
Issue/PRTitleRelationship
#142Support for long-running tasks with progress reportingParent feature request
#187Add checkpoint/resume capability to task executorDirect implementation of this guide
#203Proactive notifications via webhookNotification component
#156SQLite backend for state persistencePersistence backend
#198etcd/KV store support for distributed agentsAlternative persistence
#215Kubernetes operator for OpenClaw agent lifecycleK8s integration concern
#178Task deduplication across agent restartsRelated recovery concern

External Dependencies

  • SQLite 3.35+: Required for JSON table functions used in state queries
  • aiofiles: Async file I/O for non-blocking state operations
  • httpx: Async HTTP client for webhook notifications
  • systemd: For service restart handling and watchdog integration

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.