Delivery Reliability โ Silent Message Loss & Duplicate Delivery
Resolves four P0 critical bugs causing silent message loss, unrecoverable delivery failures, and duplicate message delivery during crashes, aborts, and service restarts.
๐ Symptoms
Issue #29125 โ Silent Message Loss on Gateway Crash
A gateway crash (process termination, SIGKILL, OOM kill) results in the most recent user message vanishing from history without error indication.
$ openclaw status
Service: gateway
Status: RUNNING
Uptime: 4h 23m
Messages processed: 12,847
Messages failed: 0
$ openclaw history --user alice --limit 5
[2024-01-15T14:32:01Z] alice: "Meeting at 3pm confirmed"
[2024-01-15T14:31:58Z] alice: "Wait, which room?"
[2024-01-15T14:31:55Z] alice: "What's the room number?"
[2024-01-15T14:31:50Z] alice: "Where is the meeting?"
# The gateway crashed between 14:31:55 and 14:32:01
# "Meeting at 3pm confirmed" was received but never persistedIssue #29126 โ Silent Delivery Failures in Plugins/Channels
Plugin or channel delivery failures return success internally while silently failing to reach the destination. No error propagates to the user or operator.
$ openclaw plugin list --channel telegram
PLUGIN STATUS DELIVERY LAST CHECK
telegram ACTIVE UNKNOWN 2024-01-15T14:30:00Z
$ openclaw events --plugin telegram --since 1h
TIMESTAMP EVENT DETAILS
2024-01-15T14:29:55Z message.sent msg_id=a1b2c3
2024-01-15T14:30:00Z plugin.error plugin=telegram (NO LOG OUTPUT)
# The telegram bot was kicked from the channel
# Error occurred but was swallowed, message marked as deliveredIssue #29127 โ Abort Triggers Re-Delivery of Partial Reply
Calling abort() on a handler does not prevent the recovery path from re-delivering a partial reply that was already partially flushed.
# User sends message triggering long response
$ openclaw history --msg-id msg_abc123
msg_id: msg_abc123
user: alice
content: "Generate a 5000-word report"
status: delivered
delivered_at: 2024-01-15T14:35:00Z
# Handler starts processing, sends partial response "Generating report..."
# User aborts the request
$ openclaw abort msg_abc123
abort: OK
# After recovery timeout, partial message is re-delivered
$ openclaw history --msg-id msg_abc123
msg_id: msg_abc123
status: delivered
replies: ["Generating report...", "Generating report...", "Generating report..."]
# ^--- duplicated partial replyIssue #29128 โ Replay of Already-Delivered Messages After Restart
After a clean restart, the delivery-recovery system replays messages that were already successfully delivered, causing duplicates.
$ openclaw restart --service gateway
[INFO] Starting delivery recovery...
[INFO] Replaying 47 unacknowledged messages
[INFO] Delivered: msg_001
[INFO] Delivered: msg_002
...
[INFO] Delivered: msg_047
$ openclaw history --user alice --since 1h
[14:30:00] msg_001: "Hello" (DUPLICATE - already delivered before restart)
[14:29:55] msg_002: "Are you there?" (DUPLICATE - already delivered before restart)
[14:29:50] msg_003: "Hi bot" (DUPLICATE - already delivered before restart)
# 47 messages all duplicated๐ง Root Cause
Architecture Overview
OpenClaw employs a delivery queue architecture for reliability. Messages flow through this pipeline:
[User Input] โ [Gateway] โ [Handler Queue] โ [Plugin/Channel] โ [External Service]
โ
[Delivery Queue] โ [Persistence Layer]
โ
[Acknowledgement Tracker]Root Cause Analysis by Issue
Issue #29125 โ Gateway Crash Data Loss
Failure Sequence:
- Message arrives at gateway and is held in an in-memory buffer (
gateway/buffer.ts) - Message is forwarded to handler, but acknowledgement is sent before persistence
- On crash, the persistence write is lost because it never completed
// gateway/handler.ts (BUGGY CODE PATH)
async function handleMessage(msg: Message): Promise {
// Step 1: Forward to handler
await dispatchToHandler(msg);
// Step 2: ACK immediately (BEFORE persistence)
await sendAck(msg.id); // โ ๏ธ Premature acknowledgement
// Step 3: Async persist (never completes on crash)
persistMessage(msg).catch(console.error); // โ ๏ธ Fire-and-forget
} The race condition occurs because acknowledgement is sent at step 2, but persistence happens asynchronously after. A crash between steps 2 and 3 results in data loss.
Issue #29126 โ Silent Delivery Failures
Failure Sequence:
- Plugin delivers message to external service (e.g., Telegram API)
- External service returns an error (e.g., "bot was kicked")
- Error is caught but not propagated โ only logged at DEBUG level
- Delivery marked as successful in internal state
// plugins/telegram/delivery.ts (BUGGY CODE PATH)
async function deliver(payload: Payload): Promise {
try {
const response = await telegramAPI.sendMessage(payload);
return { success: true, messageId: response.message_id };
} catch (error) {
// Error swallowed โ only debug log
logger.debug('Telegram delivery issue', { error }); // โ ๏ธ Silent failure
return { success: true }; // โ ๏ธ False success
}
} The caller interprets a successful return as confirmation of delivery, never retrying or alerting.
Issue #29127 โ Abort Re-Delivers Partial Reply
Failure Sequence:
- Handler begins processing and sends partial reply via streaming
- User calls
abort(), handler receives cancellation signal - Abort handler sets
delivery_state = 'aborted' - Recovery system sees message as undelivered (no ACK received)
- Recovery timer fires and re-delivers the partial reply
// core/delivery-queue.ts (BUGGY CODE PATH)
class DeliveryQueue {
async abort(messageId: string): Promise {
// Set abort flag
this.state.set(messageId, { status: 'aborted' });
// โ ๏ธ BUG: Does NOT update recovery index
// Recovery still thinks message needs delivery
// Cancel in-flight handler
await this.cancelHandler(messageId);
}
// Recovery timer checks this index
getPendingMessages(): string[] {
return this.state.entries()
.filter(e => e.status !== 'delivered') // โ ๏ธ 'aborted' passes filter
.map(e => e.messageId);
}
} The recovery system uses a simple status !== ‘delivered’ filter, which includes ‘aborted’ messages as pending.
Issue #29128 โ Replay After Restart
Failure Sequence:
- Messages are delivered and acknowledged in memory
- Clean shutdown is initiated
- Shutdown handler clears persistence state (optimization to avoid replay)
- On restart, persistence layer reports no unacknowledged messages
- Recovery system replays all messages from last known good state
// core/graceful-shutdown.ts (BUGGY CODE PATH)
async function shutdown(): Promise {
// Stop accepting new messages
gateway.stop();
// Wait for in-flight deliveries
await deliveryQueue.drain();
// โ ๏ธ BUG: Clear acknowledged state before persist
// This is an "optimization" to reduce restart time
acknowledgedMessages.clear(); // โ ๏ธ Data loss
// Persist remaining unacknowledged only
await persistence.flush();
} The “optimization” inadvertently clears delivered messages, causing the recovery system to believe they were never delivered.
๐ ๏ธ Step-by-Step Fix
Fix #29125 โ Gateway Crash Persistence
Before:
// gateway/handler.ts
async function handleMessage(msg: Message): Promise {
await dispatchToHandler(msg);
await sendAck(msg.id); // Premature ACK
persistMessage(msg).catch(console.error); // Async, unreliable
} After:
// gateway/handler.ts
async function handleMessage(msg: Message): Promise {
// Step 1: Persist BEFORE acknowledgement
await persistMessage(msg);
// Step 2: Forward to handler
await dispatchToHandler(msg);
// Step 3: ACK only after persistence confirmed
await sendAck(msg.id);
} Multi-Stage CLI Fix:
# Apply the persistence-first patch
$ openclaw patch apply --issue 29125 --component gateway
# Verify the patch
$ openclaw patch verify --issue 29125
[โ] Patched: gateway/handler.ts:persist-before-ack
[โ] Config: delivery.persist_before_ack=true
# Restart gateway to activate
$ openclaw restart --service gateway --mode=rollingFix #29126 โ Silent Delivery Failures
Before:
// plugins/telegram/delivery.ts
async function deliver(payload: Payload): Promise {
try {
const response = await telegramAPI.sendMessage(payload);
return { success: true, messageId: response.message_id };
} catch (error) {
logger.debug('Telegram delivery issue', { error });
return { success: true }; // False success
}
} After:
// plugins/telegram/delivery.ts
async function deliver(payload: Payload): Promise {
try {
const response = await telegramAPI.sendMessage(payload);
return { success: true, messageId: response.message_id };
} catch (error) {
// Classify error severity
const isRetryable = isRetryableError(error);
// Log at appropriate level
if (isRetryable) {
logger.warn('Telegram delivery failed (retryable)', { error, payload });
} else {
logger.error('Telegram delivery failed (permanent)', { error, payload });
}
// Return actual failure status
return {
success: false,
error: error.message,
retryable: isRetryable
};
}
}
// Helper to classify Telegram errors
function isRetryableError(error: TelegramError): boolean {
const RETRYABLE_CODES = [429, 500, 502, 503, 504];
const NON_RETRYABLE_CODES = [400, 401, 403, 404, 403]; // bot kicked
if (RETRYABLE_CODES.includes(error.code)) return true;
if (error.message.includes('bot was blocked')) return false;
if (error.message.includes('chat not found')) return false;
if (NON_RETRYABLE_CODES.includes(error.code)) return false;
return true; // Default to retryable
} Configuration Update:
# Update openclaw.yaml
$ openclaw config set delivery.strict_failure_mode true
$ openclaw config set delivery.failure_notification_threshold 3
# Verify delivery monitoring
$ openclaw plugin config telegram --get failure_modes
{
"strict_failure_mode": true,
"notify_on_failure": true,
"failure_threshold": 3
}Fix #29127 โ Abort Re-Delivery Prevention
Before:
// core/delivery-queue.ts
class DeliveryQueue {
async abort(messageId: string): Promise {
this.state.set(messageId, { status: 'aborted' });
await this.cancelHandler(messageId);
// โ ๏ธ Missing: recovery index update
}
} After:
// core/delivery-queue.ts
class DeliveryQueue {
async abort(messageId: string): Promise {
const state = this.state.get(messageId);
// Check if partial reply was already sent
if (state?.partialReplySent) {
// Mark as delivered to prevent recovery re-delivery
await this.markDelivered(messageId);
// Emit abort event for handler cleanup
await this.emitAbortEvent(messageId, {
reason: 'user_abort',
partialDelivered: true
});
} else {
// No partial reply โ safe to mark as aborted
this.state.set(messageId, {
status: 'aborted',
abortedAt: Date.now()
});
// Update recovery index to exclude this message
this.recoveryIndex.remove(messageId);
await this.cancelHandler(messageId);
}
}
// Recovery system now checks recovery index, not status
getPendingMessages(): string[] {
return this.recoveryIndex.getAll();
}
} CLI Fix:
# Apply abort handling patch
$ openclaw patch apply --issue 29127 --component delivery-queue
# Update recovery configuration
$ openclaw config set recovery.use_explicit_index true
$ openclaw config set recovery.abort_behavior preserve
# Clear existing corrupted state
$ openclaw recovery reset-state --force
# Verify fix
$ openclaw recovery status
Recovery Index: 247 messages tracked
Aborted Messages: 12 (properly excluded)Fix #29128 โ Replay Prevention After Restart
Before:
// core/graceful-shutdown.ts
async function shutdown(): Promise {
gateway.stop();
await deliveryQueue.drain();
// โ ๏ธ Clear acknowledged to speed up restart
acknowledgedMessages.clear();
await persistence.flush();
} After:
// core/graceful-shutdown.ts
async function shutdown(): Promise {
gateway.stop();
// Wait for all deliveries to complete AND persist
await deliveryQueue.drain({
requirePersisted: true // Ensure all ACKs are persisted
});
// โ ๏ธ DO NOT clear acknowledged messages
// Preserve full delivery state for accurate recovery
// acknowledgedMessages.clear(); // REMOVED
// Ensure persistence includes all acknowledged messages
await persistence.flush({
includeAcknowledged: true // New: persist full state
});
} Recovery Index Fix:
// core/persistence.ts
async function persistFullState(): Promise {
const state = {
version: 2,
timestamp: Date.now(),
acknowledged: Array.from(acknowledgedMessages.entries()),
pending: Array.from(pendingMessages.entries()),
aborted: Array.from(abortedMessages.entries())
};
// Atomic write to prevent corruption
await atomicWrite(STORAGE_PATH, JSON.stringify(state));
} CLI Fix:
# Apply shutdown persistence patch
$ openclaw patch apply --issue 29128 --component graceful-shutdown
# Migrate existing state to new format
$ openclaw maintenance migrate-state --format=v2
# Verify state integrity
$ openclaw state verify
State Version: 2
Acknowledged Messages: 1,247
Pending Messages: 0
Aborted Messages: 12
State Hash: a1b2c3d4e5f6...
$ openclaw restart --service gateway
[INFO] Starting delivery recovery...
[INFO] Restored state from disk (v2 format)
[INFO] Replaying 0 messages (all already delivered)๐งช Verification
Test #29125 โ Crash Persistence
# 1. Start a message-heavy session
$ openclaw load-test --users 10 --duration 30s --rate 5
# 2. Simulate crash during active delivery
$ openclaw inject-fault --type=crash --service=gateway --delay=5s
# 3. Verify no message loss after restart
$ openclaw verify --check=message-integrity
[โ] Message count: 150 sent, 150 persisted
[โ] Sequence integrity: No gaps detected
[โ] Last message verified: msg_150Expected Output:
Test: Gateway Crash Persistence
Result: PASS
Messages Before Crash: 150
Messages After Recovery: 150
Lost Messages: 0
Persistence Rate: 100%Test #29126 โ Delivery Failure Propagation
# 1. Trigger a permanent failure (bot kicked)
$ openclaw mock telegram --error="bot was kicked" --channel=test_channel
# 2. Send message that should fail
$ openclaw send --user alice --message "test" --channel telegram
# 3. Verify failure is reported
$ openclaw events --type=delivery_failure --since 1m
TIMESTAMP LEVEL EVENT DETAILS
2024-01-15T14:30:00Z WARN delivery.failed plugin=telegram
error="bot was kicked"
retryable=false
message=msg_test_001Expected Output:
Test: Delivery Failure Propagation
Result: PASS
Failure Detected: YES
Error Logged: YES (WARN level)
User Notified: YES
Retryable: NO
Message Status: failed_permanentTest #29127 โ Abort Re-Delivery Prevention
# 1. Start long-running handler
$ openclaw send --user alice --message "Generate 10000 words"
# 2. Send abort while processing
$ sleep 2 && openclaw abort --msg-id= --reason=timeout
# 3. Wait for recovery timeout
$ sleep 60
# 4. Check for duplicate messages
$ openclaw history --user alice --limit 5
[โ] No duplicate messages detected
[โ] Aborted message not re-delivered Expected Output:
Test: Abort Re-Delivery Prevention
Result: PASS
Partial Reply Sent: YES
Abort Processed: YES
Re-delivery Attempted: NO
Duplicate Messages: 0
Recovery Index: Correctly excludes aborted messageTest #29128 โ Replay Prevention
# 1. Send and deliver several messages
$ for i in {1..50}; do openclaw send --user alice --message "Msg $i"; done
# 2. Verify all delivered
$ openclaw verify --delivered --user alice
Delivered Count: 50
# 3. Restart service
$ openclaw restart --service gateway
# 4. Check for duplicates
$ openclaw history --user alice --since 1m | grep -c "Msg"
50Expected Output:
Test: Restart Replay Prevention
Result: PASS
Messages Before Restart: 50
Messages After Restart: 50
Duplicate Count: 0
State Restored: YES (v2 format)
Recovery Replay: 0 messagesFull Integration Test
# Run complete delivery reliability suite
$ openclaw test suite --name=delivery-reliability
Tests:
[โ] #29125 - Gateway crash persistence
[โ] #29126 - Delivery failure propagation
[โ] #29127 - Abort re-delivery prevention
[โ] #29128 - Restart replay prevention
[โ] Concurrent delivery stress test
[โ] Network partition recovery
[โ] Partial failure cascade
Result: 7/7 PASSED
Coverage: 100%โ ๏ธ Common Pitfalls
Environment-Specific Traps
Docker/Kubernetes Deployments
- Signal Handling: Docker stop sends SIGTERM, but containers may be killed with SIGKILL after 10s timeout. Ensure
grace_period_secondsexceedsdrain_timeout.# docker-compose.yml services: gateway: stop_grace_period: 30s # Must exceed drain_timeout command: openclaw gateway --drain-timeout=25s - Volume Permissions: Persistence state may be unreadable if volume mounted with different UID.
# Verify permissions $ docker exec openclaw-gateway ls -la /data/state.json -rw-r--r-- 1 openclaw openclaw 4096 Jan 15 14:30 /data/state.json - Memory Pressure: OOM killer targets gateway before persistence completes. Set memory limits with headroom.
resources: limits: memory: 512Mi # Must exceed expected peak + state size reservations: memory: 256Mi
macOS Development Environment
- File Locking: macOS APFS may not support atomic renames correctly. Use explicit fsync.
# Check if atomic writes work $ openclaw debug verify-atomic-write [โ] Atomic write verified on /tmp (APFS supports it) [โ] Atomic write verified on /var/tmp (APFS supports it) [!] Warning: /Users/... uses non-atomic filesystem - Resource Limits: Default ulimits are restrictive. Increase for high-throughput testing.
# Check current limits $ ulimit -n 256 # Too low for productionIncrease for session
$ ulimit -n 10240
Windows (WSL2)
- File Watchers: WSL2 file watchers have known performance issues. Disable
fs.inotify.max_user_watchesemulation.# In /etc/sysctl.conf fs.inotify.max_user_watches=524288 fs.inotify.max_user_instances=512 - Line Endings: CRLF in config files may corrupt state JSON. Normalize on mount.
# Mount with consistent line endings mount --bind -o ro /mnt/c/config/openclaw.yaml /data/config.yaml
Configuration Mistakes
Premature Acknowledgement Still Enabled
# โ ๏ธ WRONG: Still using old behavior
$ openclaw config get delivery.persist_before_ack
false # Bug not fixed
# โ CORRECT: Should be true
$ openclaw config set delivery.persist_before_ack true
$ openclaw restartRecovery Index Not Migrated
# โ ๏ธ WRONG: Old index format still in use
$ openclaw recovery status
Index Format: legacy # Bug not fixed
# โ CORRECT: Migrate to new format
$ openclaw maintenance migrate-state --format=v2
$ openclaw restartInconsistent Failure Mode Configuration
# โ ๏ธ WRONG: Different failure modes across plugins
$ openclaw config get --plugin '*' delivery.strict_failure_mode
telegram: false
slack: true # Inconsistent!
discord: false
# โ CORRECT: Uniform configuration
$ openclaw config set --plugin '*' delivery.strict_failure_mode trueRuntime Edge Cases
- Network Partition During Persistence: If network fails mid-write, state file may be corrupted. Enable atomic write with backup.
# Enable backup on corruption $ openclaw config set persistence.backup_on_corruption true $ openclaw config set persistence.backup_count 3 - Clock Skew: Timestamps used for recovery ordering may conflict after NTP correction. Use logical clocks for ordering.
# Check for clock skew $ openclaw debug clock-skew Clock Offset: +0.003s (acceptable) Warning: 2 messages have timestamp conflicts - Partial State Migration: If migration is interrupted, state may be in mixed format. Verify after migration.
# Force verification $ openclaw state verify --full [โ] Format: v2 [โ] Integrity: VALID [โ] Entries: 1,247 acknowledged, 0 pending, 12 aborted
๐ Related Errors
Directly Related Issues
| Issue | Title | Severity | Relationship |
|---|---|---|---|
| #29125 | Gateway crash silently drops user message from history | P0 | Primary issue โ addressed in this guide |
| #29126 | Plugin/channel delivery failures are silent and unrecoverable | P0 | Primary issue โ addressed in this guide |
| #29127 | Abort does not prevent recovery-path re-delivery of partial reply | P0 | Primary issue โ addressed in this guide |
| #29128 | Delivery-recovery replays already-delivered messages after restart | P0 | Primary issue โ addressed in this guide |
| #29085 | fix(delivery-queue): Telegram 'bot was kicked' | P2 | Partial fix โ precursor to #29126 |
Historical Context
- #28456 โ Duplicate messages on network timeout: Similar to #29127 but specifically network-induced rather than abort-induced.
- #27901 โ State file corruption on unclean shutdown: Root cause overlap with #29128 โ persistence layer design flaw.
- #27512 โ Handler ACK timeout too aggressive: Contributed to #29125 โ premature ACK allowed by timeout configuration.
- #27189 โ Plugin errors not propagated to parent: Architectural precursor to #29126 โ error handling isolation in plugin system.
Related Error Codes
| Error Code | Description | Connected Issues |
|---|---|---|
DLV_001 | Persistence write failed | #29125 |
DLV_002 | Premature acknowledgement | #29125 |
DLV_003 | Silent delivery failure | #29126 |
DLV_004 | Recovery re-delivered | #29127, #29128 |
DLV_005 | State mismatch on startup | #29128 |
PLG_001 | Plugin error swallowed | #29126 |
SHT_001 | Graceful shutdown data loss | #29128 |
Related Configuration Parameters
# Parameters introduced/fixed by this guide
delivery.persist_before_ack # Default: true (was: false)
delivery.strict_failure_mode # Default: true (was: false)
delivery.failure_notification_threshold # Default: 3 (new)
recovery.use_explicit_index # Default: true (was: false)
recovery.abort_behavior # Default: preserve (was: re-deliver)
persistence.include_acknowledged # Default: true (was: false)
persistence.backup_on_corruption # Default: true (new)