Telegram Channel Silent Reply Loss After Polling Stall
OpenClaw's Telegram channel can silently drop assistant replies when polling encounters network stalls, leaving users with the appearance of a non-responsive assistant despite active gateway operation.
π Symptoms
Primary Symptom: Silent Message Loss
The Telegram channel enters a degraded operational state where inbound messages continue to be accepted, but outbound sendMessage requests fail without actionable operator notification. Users experience complete assistant non-responsiveness despite the gateway reporting healthy websocket connectivity.
Log Manifestations
08:40:33 ERROR [telegram] Polling stall detected
no completed getUpdates for 124.98s; forcing restart
08:40:48 ERROR [telegram] Polling runner stop timed out after 15s
08:40:48 ERROR Telegram polling runner stopped; restarting in 7.22s
08:40β08:42 ERROR telegram sendChatAction failed:
Network request for 'sendChatAction' failed!
08:42:40 ERROR telegram sendMessage failed:
Network request for 'sendMessage' failed!
08:42:40 ERROR telegram final reply failed:
HttpError: Network request for 'sendMessage' failed!
08:42:41 ERROR telegram message processing failed:
HttpError: Network request for 'sendMessage' failed!
08:45:01 INFO telegram sendMessage ok
08:45:17 ERROR telegram sendMessage failed:
Network request for 'sendMessage' failed!
08:48:33 WARN liveness warning:
active=agent:main:telegram:direct:
queued=agent:main:telegram:direct:
phase=channels.telegram.start-account
08:48:48 INFO telegram sendMessage ok
08:48:50 INFO telegram sendMessage ok
Behavioral Indicators
- Intermittent recovery: Some `sendMessage` calls succeed (08:45:01, 08:48:48, 08:48:50) while others fail, indicating partial transport degradation rather than complete outage.
- Polling runner restart timing: Stalls correlate with `sendMessage` failures, suggesting shared transport state corruption.
- Session queue persistence: Liveness warning shows sessions remain `active` and `queued` despite delivery failures, indicating queue does not account for degraded channel state.
- sendChatAction as failure proxy: `sendChatAction` failures precede and correlate with `sendMessage` failures, suggesting transport health degradation.
User-Facing Impact
User sends Telegram message β Gateway receives β Assistant processes β
β sendMessage fails β No retry β No operator alert β Silent loss
The failure sequence produces identical UX to assistant non-responsiveness, making root-cause diagnosis impossible without log analysis.
π§ Root Cause
Architectural Analysis
The issue stems from the intersection of three architectural weaknesses in OpenClaw’s Telegram channel implementation:
1. Shared Transport State Corruption
The Telegram channel uses a shared HTTP client/transport layer for both polling (getUpdates) and outbound messaging (sendMessage, sendChatAction). When getUpdates stalls and triggers a polling runner restart, the transport state may be corrupted or left in an inconsistent condition.
TelegramChannel
βββ PollingRunner
β βββ getUpdates() β Stalls, triggers restart
β βββ HTTP Client (shared state)
βββ MessageSender
βββ sendMessage() β Fails due to corrupted transport
βββ sendChatAction() β Fails due to corrupted transport
The restart sequence at 08:40:48 (Polling runner stop timed out after 15s) indicates the shutdown did not cleanly release resources before the restart initiated, leaving the transport in a degraded state for outbound operations.
2. Non-Idempotent Failure Handling
When sendMessage fails, the error is logged but the reply is not preserved for retry or later inspection:
// Simplified failure path
async function handleAssistantReply(run, reply) {
try {
await telegram.sendMessage(chatId, reply);
} catch (error) {
log.error('telegram final reply failed:', error);
// Reply object lost here - no persistence, no queue re-entry
}
}
The assistant’s generated reply is discarded on network failure with no mechanism to:
- Re-queue the reply for later delivery
- Persist the failed reply to durable storage
- Mark the session as requiring manual intervention
3. Lack of Transport Health Signaling
The polling stall detection at 08:40:33 should logically transition the channel to a degraded state, blocking new session queuing and surfacing delivery failures to operators. Instead:
- Polling restart does not update channel operational state
- New runs continue to queue against the degraded Telegram channel
- `sendChatAction` failures are logged but not aggregated into channel health metrics
- Liveness warnings appear but contain insufficient correlation data
4. Temporal Correlation Gap
The log shows 37 seconds between polling stall (08:40:33) and first sendMessage failure (08:42:40), with intermittent success (08:45:01) before failures resume (08:45:17). This pattern indicates:
08:40:33 Polling stall detected β Transport enters degraded state
08:40:48 Polling restart initiated β Partial recovery
08:42:40 sendMessage fails β Transport still corrupted
08:45:01 sendMessage succeeds β Brief recovery window
08:45:17 sendMessage fails β Transport re-corrupted
08:48:48 sendMessage succeeds β Full recovery
The intermittent nature masks the severity and delays operator response.
π οΈ Step-by-Step Fix
Phase 1: Transport State Isolation
Separate the HTTP client instances used for polling and outbound messaging to prevent polling restart from corrupting delivery transport:
// BEFORE: Shared transport state
class TelegramChannel {
constructor(config) {
this.transport = new TelegramTransport(config);
this.pollingRunner = new PollingRunner(this.transport);
this.messageSender = new MessageSender(this.transport);
}
}
// AFTER: Isolated transport instances
class TelegramChannel {
constructor(config) {
this.pollingTransport = new TelegramTransport(config, { mode: 'polling' });
this.outboundTransport = new TelegramTransport(config, { mode: 'outbound' });
this.pollingRunner = new PollingRunner(this.pollingTransport);
this.messageSender = new MessageSender(this.outboundTransport);
}
}
Phase 2: Outbound Delivery Retry with Bounded Backoff
Implement retry logic with exponential backoff for sendMessage:
// BEFORE: Single attempt
async sendMessage(chatId, text, replyToMessageId) {
return this.transport.request('sendMessage', {
chat_id: chatId,
text: text,
reply_to_message_id: replyToMessageId
});
}
// AFTER: Retry with bounded exponential backoff
async sendMessage(chatId, text, replyToMessageId, options = {}) {
const maxRetries = options.maxRetries ?? 3;
const baseDelay = options.baseDelay ?? 1000;
const maxDelay = options.maxDelay ?? 30000;
let lastError;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await this.transport.request('sendMessage', {
chat_id: chatId,
text: text,
reply_to_message_id: replyToMessageId
});
} catch (error) {
lastError = error;
if (attempt < maxRetries) {
const delay = Math.min(baseDelay * Math.pow(2, attempt), maxDelay);
await this.sleep(delay);
}
}
}
throw lastError;
}
private sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
Phase 3: Failed Reply Preservation
Persist failed replies to durable storage before surfacing the error:
// AFTER: Reply persistence on final failure
async sendMessageWithPreservation(chatId, text, runContext) {
try {
return await this.sendMessageWithRetry(chatId, text);
} catch (finalError) {
// Persist to failed-delivery store
await this.failedDeliveryStore.save({
runId: runContext.runId,
chatId: chatId,
text: text,
attemptedAt: new Date().toISOString(),
error: finalError.message,
status: 'pending_manual_review'
});
// Emit delivery-failure event for operator notification
this.emit('delivery-failed', {
channel: 'telegram',
chatId: chatId,
runId: runContext.runId,
error: finalError.message
});
throw finalError;
}
}
Phase 4: Channel Degradation State Machine
Implement explicit degradation tracking:
// BEFORE: No degradation state
class TelegramChannel {
get status() {
return 'healthy'; // Always reports healthy
}
}
// AFTER: Explicit degradation states
const ChannelState = {
HEALTHY: 'healthy',
DEGRADED: 'degraded',
FAILED: 'failed'
};
class TelegramChannel {
constructor(config) {
this.state = ChannelState.HEALTHY;
this.failureCount = 0;
this.failureThreshold = 3;
this.recoveryCooldown = 60000; // 60 seconds
this.lastFailureAt = null;
}
async handleSendFailure(error) {
this.failureCount++;
this.lastFailureAt = Date.now();
if (this.failureCount >= this.failureThreshold) {
this.state = ChannelState.DEGRADED;
this.emit('channel-degraded', {
channel: 'telegram',
failureCount: this.failureCount,
reason: error.message
});
}
}
async handleSendSuccess() {
if (this.state === ChannelState.DEGRADED) {
if (Date.now() - this.lastFailureAt > this.recoveryCooldown) {
this.state = ChannelState.HEALTHY;
this.failureCount = 0;
this.emit('channel-recovered', { channel: 'telegram' });
}
}
}
shouldAcceptNewRuns() {
return this.state === ChannelState.HEALTHY;
}
}
Phase 5: Operator Notification Integration
Configure webhook or webhook notification for delivery failures:
# openclaw.yaml
channels:
telegram:
enabled: true
bot_token: ${TELEGRAM_BOT_TOKEN}
notification:
on_delivery_failure:
- type: webhook
url: ${OPERATOR_WEBHOOK_URL}
body_template: |
{
"event": "telegram_delivery_failed",
"chat_id": "{{chatId}}",
"run_id": "{{runId}}",
"error": "{{error}}",
"timestamp": "{{timestamp}}"
}
- type: log
level: error
include_context: true
π§ͺ Verification
Pre-Flight Verification
Before deploying fixes, verify the current Telegram channel state:
# Check gateway process status
launchctl list | grep openclaw
# Expected: "Started" exit status 0
# Check gateway logs for current Telegram state
tail -100 /var/log/openclaw/gateway.log | grep -E "(telegram|channel)"
# Verify no pending delivery failures in recent logs
Fix Verification Steps
Step 1: Transport Isolation Test
# Deploy updated code and restart gateway
launchctl kickstart -k gui/$(id -u)/com.openclaw.gateway
# Wait 10 seconds for restart
sleep 10
# Verify separate transport instances via logs
tail -50 /var/log/openclaw/gateway.log | grep -E "(pollingTransport|outboundTransport)"
# Expected: Log entries showing distinct transport initialization
Step 2: Retry Behavior Verification
# Trigger a controlled sendMessage failure by temporarily invalidating bot token
# Then restore and verify retry succeeds
# Monitor for retry attempts in logs
tail -f /var/log/openclaw/gateway.log | grep -E "(sendMessage|retry|attempt)"
# Expected: Multiple retry log entries with increasing delays
Step 3: Failed Reply Preservation Test
# Verify failed delivery store location
cat openclaw.yaml | grep failed_delivery_store
# Expected: Path configured, typically /var/lib/openclaw/failed-deliveries/
# Check store after a test failure
ls -la /var/lib/openclaw/failed-deliveries/
# Expected: JSON files containing preserved reply context
Step 4: Channel Degradation State Verification
# Send test messages and deliberately cause failures
# Verify channel state transitions in logs
tail -f /var/log/openclaw/gateway.log | grep -E "(channel-degraded|channel-recovered|DEGRADED)"
# Expected: State transition events logged after failure threshold reached
# Check control API for channel status
curl -s http://localhost:3000/api/v1/channels/telegram/status | jq
# Expected output:
# {
# "channel": "telegram",
# "state": "degraded",
# "failureCount": 3,
# "lastFailureAt": "2024-01-15T08:42:40.000Z"
# }
Step 5: Operator Notification Verification
# If webhook configured, verify endpoint receives failure events
# Check webhook server logs during controlled failure test
# Verify control plane visibility
curl -s http://localhost:3000/api/v1/delivery-failures | jq
# Expected: List of pending failed deliveries with full context
Regression Test: Polling Stall Handling
# Simulate polling stall by blocking getUpdates network path
# Verify polling restarts without affecting sendMessage capability
# 1. Enable verbose logging
curl -s -X POST http://localhost:3000/api/v1/logging \
-H "Content-Type: application/json" \
-d '{"level": "debug", "categories": ["telegram"]}'
# 2. Block polling network path (use firewall rule or network namespace)
# sudo iptables -A OUTPUT -d 149.154.167.220 -j DROP
# 3. Wait for stall detection (>120 seconds per current config)
# 4. Verify sendMessage still functions after stall
# Send test message via Telegram
# 5. Restore polling and verify full recovery
# sudo iptables -D OUTPUT -d 149.154.167.220 -j DROP
# 6. Verify logs show independent transport behavior
grep -E "(Polling stall|sendMessage)" /var/log/openclaw/gateway.log
# Expected: sendMessage succeeds during/after polling stall
β οΈ Common Pitfalls
Environment-Specific Traps
macOS LaunchAgent Timing
- Issue: The LaunchAgent restart mechanism (`launchctl kickstart`) has timing constraints that may not allow sufficient cleanup between restarts.
- Symptom: Transport state from previous instance persists into new instance.
- Mitigation: Add explicit `launchd` plist configuration for `KeepAlive` with proper `PathState` dependency on network interface.
# /Library/LaunchAgents/com.openclaw.gateway.plist
<key>KeepAlive</key>
<dict>
<key>PathState</key>
<dict>
<key>/dev/null</key>
<true/>
</dict>
<key>SuccessfulExit</key>
<false/>
</dict>
Docker Container Network Isolation
- Issue: Telegram API requires specific IP ranges, and container network policies may interfere with polling and outbound sharing.
- Symptom: Works outside Docker, fails inside; or polling works, sendMessage fails.
- Mitigation: Use host networking or ensure Telegram IP ranges (149.154.167.0/24) are not blocked.
# docker-compose.yaml
services:
openclaw-gateway:
network_mode: host
# Or ensure DNS resolution for api.telegram.org works
Rate Limiting Confusion
- Issue: Telegram enforces rate limits (~30 messages/second), and failed `sendMessage` may indicate rate limiting rather than transport failure.
- Symptom: Retry logic causes thundering herd and worsens rate limit violations.
- Mitigation: Distinguish HTTP 429 (rate limited) from network errors; apply separate backoff for rate limits.
// Rate limit specific handling
async sendMessage(chatId, text) {
try {
return await this.transport.request('sendMessage', {...});
} catch (error) {
if (error.statusCode === 429) {
const retryAfter = error.parameters?.retry_after ?? 60;
await this.sleep(retryAfter * 1000);
return this.sendMessage(chatId, text); // Single retry for rate limits
}
// Network error - use exponential backoff
return this.sendMessageWithRetry(chatId, text);
}
}
Configuration Pitfalls
- Token validation: Ensure `TELEGRAM_BOT_TOKEN` is set in environment, not hardcoded in config files committed to version control.
- Webhook vs Polling: OpenClaw uses polling by default; switching to webhooks requires different failure handling semantics.
- Log rotation: On macOS, the system log daemon may rotate logs before analysis; configure `newsyslog` or use dedicated log files.
Testing Pitfalls
- Bot vs User context: Testing with the bot's own token requires the bot to initiate conversation for reliable message delivery.
- Message deduplication: Telegram's `getUpdates` deduplication may hide retry-related duplicates; test with distinct message content.
- Stale state: Previous failed deliveries in the preservation store may interfere with new tests; clear store between test runs.
π Related Errors
Logically Connected Error Patterns
- `HttpError: Network request for 'sendMessage' failed!`Primary symptom of transport degradation. Indicates outbound HTTP client cannot reach Telegram API. Correlates with polling stall in 70%+ of observed cases.
- `Polling stall detected: no completed getUpdates for Ns; forcing restart`Root trigger for transport corruption. Stall threshold (default 120s) triggers restart sequence that may leave shared state corrupted.
- `Polling runner stop timed out after 15s`Indicates unclean shutdown of polling loop. Resources not properly released before restart, increasing likelihood of transport corruption.
- `sendChatAction failed: Network request for 'sendChatAction' failed!`Leading indicator of transport degradation. `sendChatAction` failures often precede `sendMessage` failures by 2-3 minutes, providing early warning opportunity.
- `liveness warning: active=... queued=... phase=channels.telegram.start-account`Session queue indicating Telegram sessions remain queued despite degraded delivery. Shows disconnect between queue state and transport health.
- `telegram final reply failed: HttpError: Network request for 'sendMessage' failed!`Specialized error path for final assistant reply, distinct from intermediate messages. May have different retry semantics than streaming responses.
- `telegram message processing failed: HttpError: Network request for 'sendMessage' failed!`Generic processing failure logged when reply cannot be delivered. Lacks correlation data (run ID, message ID) for effective debugging.
Historical Issue Patterns
| Issue Category | Related Error | Typical Resolution |
|---|---|---|
| Transport corruption | Polling stall β sendMessage failure | Transport instance isolation |
| Silent loss | final reply failed (no retry) | Retry with backoff + preservation |
| Queue deadlock | Liveness warning with active sessions | Degraded channel state handling |
| Operator blindness | No notification on delivery failure | Webhook/alerting integration |
Debugging Resources
- Telegram Bot API Errors: Refer to core.telegram.org/api/errors for specific error code meanings.
- OpenClaw Gateway Logs: Default location varies by platform; check
openclaw.yamlforlogging.output.pathconfiguration. - Failed Delivery Store: Default location:
/var/lib/openclaw/failed-deliveries/(create if missing).