May 07, 2026 β€’ Version: 0.9.x

Telegram Channel Silent Reply Loss After Polling Stall

OpenClaw's Telegram channel can silently drop assistant replies when polling encounters network stalls, leaving users with the appearance of a non-responsive assistant despite active gateway operation.

πŸ” Symptoms

Primary Symptom: Silent Message Loss

The Telegram channel enters a degraded operational state where inbound messages continue to be accepted, but outbound sendMessage requests fail without actionable operator notification. Users experience complete assistant non-responsiveness despite the gateway reporting healthy websocket connectivity.

Log Manifestations

08:40:33 ERROR [telegram] Polling stall detected
  no completed getUpdates for 124.98s; forcing restart

08:40:48 ERROR [telegram] Polling runner stop timed out after 15s
08:40:48 ERROR Telegram polling runner stopped; restarting in 7.22s

08:40–08:42 ERROR telegram sendChatAction failed:
  Network request for 'sendChatAction' failed!

08:42:40 ERROR telegram sendMessage failed:
  Network request for 'sendMessage' failed!

08:42:40 ERROR telegram final reply failed:
  HttpError: Network request for 'sendMessage' failed!

08:42:41 ERROR telegram message processing failed:
  HttpError: Network request for 'sendMessage' failed!

08:45:01 INFO telegram sendMessage ok

08:45:17 ERROR telegram sendMessage failed:
  Network request for 'sendMessage' failed!

08:48:33 WARN liveness warning:
  active=agent:main:telegram:direct:
  queued=agent:main:telegram:direct:
  phase=channels.telegram.start-account

08:48:48 INFO telegram sendMessage ok
08:48:50 INFO telegram sendMessage ok

Behavioral Indicators

  • Intermittent recovery: Some `sendMessage` calls succeed (08:45:01, 08:48:48, 08:48:50) while others fail, indicating partial transport degradation rather than complete outage.
  • Polling runner restart timing: Stalls correlate with `sendMessage` failures, suggesting shared transport state corruption.
  • Session queue persistence: Liveness warning shows sessions remain `active` and `queued` despite delivery failures, indicating queue does not account for degraded channel state.
  • sendChatAction as failure proxy: `sendChatAction` failures precede and correlate with `sendMessage` failures, suggesting transport health degradation.

User-Facing Impact

User sends Telegram message β†’ Gateway receives β†’ Assistant processes β†’
  β†’ sendMessage fails β†’ No retry β†’ No operator alert β†’ Silent loss

The failure sequence produces identical UX to assistant non-responsiveness, making root-cause diagnosis impossible without log analysis.

🧠 Root Cause

Architectural Analysis

The issue stems from the intersection of three architectural weaknesses in OpenClaw’s Telegram channel implementation:

1. Shared Transport State Corruption

The Telegram channel uses a shared HTTP client/transport layer for both polling (getUpdates) and outbound messaging (sendMessage, sendChatAction). When getUpdates stalls and triggers a polling runner restart, the transport state may be corrupted or left in an inconsistent condition.

TelegramChannel
β”œβ”€β”€ PollingRunner
β”‚   β”œβ”€β”€ getUpdates() ← Stalls, triggers restart
β”‚   └── HTTP Client (shared state)
└── MessageSender
    β”œβ”€β”€ sendMessage() ← Fails due to corrupted transport
    └── sendChatAction() ← Fails due to corrupted transport

The restart sequence at 08:40:48 (Polling runner stop timed out after 15s) indicates the shutdown did not cleanly release resources before the restart initiated, leaving the transport in a degraded state for outbound operations.

2. Non-Idempotent Failure Handling

When sendMessage fails, the error is logged but the reply is not preserved for retry or later inspection:

// Simplified failure path
async function handleAssistantReply(run, reply) {
  try {
    await telegram.sendMessage(chatId, reply);
  } catch (error) {
    log.error('telegram final reply failed:', error);
    // Reply object lost here - no persistence, no queue re-entry
  }
}

The assistant’s generated reply is discarded on network failure with no mechanism to:

  • Re-queue the reply for later delivery
  • Persist the failed reply to durable storage
  • Mark the session as requiring manual intervention

3. Lack of Transport Health Signaling

The polling stall detection at 08:40:33 should logically transition the channel to a degraded state, blocking new session queuing and surfacing delivery failures to operators. Instead:

  • Polling restart does not update channel operational state
  • New runs continue to queue against the degraded Telegram channel
  • `sendChatAction` failures are logged but not aggregated into channel health metrics
  • Liveness warnings appear but contain insufficient correlation data

4. Temporal Correlation Gap

The log shows 37 seconds between polling stall (08:40:33) and first sendMessage failure (08:42:40), with intermittent success (08:45:01) before failures resume (08:45:17). This pattern indicates:

08:40:33  Polling stall detected β†’ Transport enters degraded state
08:40:48  Polling restart initiated β†’ Partial recovery
08:42:40  sendMessage fails β†’ Transport still corrupted
08:45:01  sendMessage succeeds β†’ Brief recovery window
08:45:17  sendMessage fails β†’ Transport re-corrupted
08:48:48  sendMessage succeeds β†’ Full recovery

The intermittent nature masks the severity and delays operator response.

πŸ› οΈ Step-by-Step Fix

Phase 1: Transport State Isolation

Separate the HTTP client instances used for polling and outbound messaging to prevent polling restart from corrupting delivery transport:

// BEFORE: Shared transport state
class TelegramChannel {
  constructor(config) {
    this.transport = new TelegramTransport(config);
    this.pollingRunner = new PollingRunner(this.transport);
    this.messageSender = new MessageSender(this.transport);
  }
}

// AFTER: Isolated transport instances
class TelegramChannel {
  constructor(config) {
    this.pollingTransport = new TelegramTransport(config, { mode: 'polling' });
    this.outboundTransport = new TelegramTransport(config, { mode: 'outbound' });
    this.pollingRunner = new PollingRunner(this.pollingTransport);
    this.messageSender = new MessageSender(this.outboundTransport);
  }
}

Phase 2: Outbound Delivery Retry with Bounded Backoff

Implement retry logic with exponential backoff for sendMessage:

// BEFORE: Single attempt
async sendMessage(chatId, text, replyToMessageId) {
  return this.transport.request('sendMessage', {
    chat_id: chatId,
    text: text,
    reply_to_message_id: replyToMessageId
  });
}

// AFTER: Retry with bounded exponential backoff
async sendMessage(chatId, text, replyToMessageId, options = {}) {
  const maxRetries = options.maxRetries ?? 3;
  const baseDelay = options.baseDelay ?? 1000;
  const maxDelay = options.maxDelay ?? 30000;
  
  let lastError;
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await this.transport.request('sendMessage', {
        chat_id: chatId,
        text: text,
        reply_to_message_id: replyToMessageId
      });
    } catch (error) {
      lastError = error;
      if (attempt < maxRetries) {
        const delay = Math.min(baseDelay * Math.pow(2, attempt), maxDelay);
        await this.sleep(delay);
      }
    }
  }
  throw lastError;
}

private sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Phase 3: Failed Reply Preservation

Persist failed replies to durable storage before surfacing the error:

// AFTER: Reply persistence on final failure
async sendMessageWithPreservation(chatId, text, runContext) {
  try {
    return await this.sendMessageWithRetry(chatId, text);
  } catch (finalError) {
    // Persist to failed-delivery store
    await this.failedDeliveryStore.save({
      runId: runContext.runId,
      chatId: chatId,
      text: text,
      attemptedAt: new Date().toISOString(),
      error: finalError.message,
      status: 'pending_manual_review'
    });
    
    // Emit delivery-failure event for operator notification
    this.emit('delivery-failed', {
      channel: 'telegram',
      chatId: chatId,
      runId: runContext.runId,
      error: finalError.message
    });
    
    throw finalError;
  }
}

Phase 4: Channel Degradation State Machine

Implement explicit degradation tracking:

// BEFORE: No degradation state
class TelegramChannel {
  get status() {
    return 'healthy'; // Always reports healthy
  }
}

// AFTER: Explicit degradation states
const ChannelState = {
  HEALTHY: 'healthy',
  DEGRADED: 'degraded',
  FAILED: 'failed'
};

class TelegramChannel {
  constructor(config) {
    this.state = ChannelState.HEALTHY;
    this.failureCount = 0;
    this.failureThreshold = 3;
    this.recoveryCooldown = 60000; // 60 seconds
    this.lastFailureAt = null;
  }
  
  async handleSendFailure(error) {
    this.failureCount++;
    this.lastFailureAt = Date.now();
    
    if (this.failureCount >= this.failureThreshold) {
      this.state = ChannelState.DEGRADED;
      this.emit('channel-degraded', {
        channel: 'telegram',
        failureCount: this.failureCount,
        reason: error.message
      });
    }
  }
  
  async handleSendSuccess() {
    if (this.state === ChannelState.DEGRADED) {
      if (Date.now() - this.lastFailureAt > this.recoveryCooldown) {
        this.state = ChannelState.HEALTHY;
        this.failureCount = 0;
        this.emit('channel-recovered', { channel: 'telegram' });
      }
    }
  }
  
  shouldAcceptNewRuns() {
    return this.state === ChannelState.HEALTHY;
  }
}

Phase 5: Operator Notification Integration

Configure webhook or webhook notification for delivery failures:

# openclaw.yaml
channels:
  telegram:
    enabled: true
    bot_token: ${TELEGRAM_BOT_TOKEN}
    notification:
      on_delivery_failure:
        - type: webhook
          url: ${OPERATOR_WEBHOOK_URL}
          body_template: |
            {
              "event": "telegram_delivery_failed",
              "chat_id": "{{chatId}}",
              "run_id": "{{runId}}",
              "error": "{{error}}",
              "timestamp": "{{timestamp}}"
            }
        - type: log
          level: error
          include_context: true

πŸ§ͺ Verification

Pre-Flight Verification

Before deploying fixes, verify the current Telegram channel state:

# Check gateway process status
launchctl list | grep openclaw
# Expected: "Started" exit status 0

# Check gateway logs for current Telegram state
tail -100 /var/log/openclaw/gateway.log | grep -E "(telegram|channel)"
# Verify no pending delivery failures in recent logs

Fix Verification Steps

Step 1: Transport Isolation Test

# Deploy updated code and restart gateway
launchctl kickstart -k gui/$(id -u)/com.openclaw.gateway
# Wait 10 seconds for restart
sleep 10

# Verify separate transport instances via logs
tail -50 /var/log/openclaw/gateway.log | grep -E "(pollingTransport|outboundTransport)"
# Expected: Log entries showing distinct transport initialization

Step 2: Retry Behavior Verification

# Trigger a controlled sendMessage failure by temporarily invalidating bot token
# Then restore and verify retry succeeds

# Monitor for retry attempts in logs
tail -f /var/log/openclaw/gateway.log | grep -E "(sendMessage|retry|attempt)"
# Expected: Multiple retry log entries with increasing delays

Step 3: Failed Reply Preservation Test

# Verify failed delivery store location
cat openclaw.yaml | grep failed_delivery_store
# Expected: Path configured, typically /var/lib/openclaw/failed-deliveries/

# Check store after a test failure
ls -la /var/lib/openclaw/failed-deliveries/
# Expected: JSON files containing preserved reply context

Step 4: Channel Degradation State Verification

# Send test messages and deliberately cause failures
# Verify channel state transitions in logs

tail -f /var/log/openclaw/gateway.log | grep -E "(channel-degraded|channel-recovered|DEGRADED)"
# Expected: State transition events logged after failure threshold reached

# Check control API for channel status
curl -s http://localhost:3000/api/v1/channels/telegram/status | jq
# Expected output:
# {
#   "channel": "telegram",
#   "state": "degraded",
#   "failureCount": 3,
#   "lastFailureAt": "2024-01-15T08:42:40.000Z"
# }

Step 5: Operator Notification Verification

# If webhook configured, verify endpoint receives failure events
# Check webhook server logs during controlled failure test

# Verify control plane visibility
curl -s http://localhost:3000/api/v1/delivery-failures | jq
# Expected: List of pending failed deliveries with full context

Regression Test: Polling Stall Handling

# Simulate polling stall by blocking getUpdates network path
# Verify polling restarts without affecting sendMessage capability

# 1. Enable verbose logging
curl -s -X POST http://localhost:3000/api/v1/logging \
  -H "Content-Type: application/json" \
  -d '{"level": "debug", "categories": ["telegram"]}'

# 2. Block polling network path (use firewall rule or network namespace)
# sudo iptables -A OUTPUT -d 149.154.167.220 -j DROP

# 3. Wait for stall detection (>120 seconds per current config)

# 4. Verify sendMessage still functions after stall
# Send test message via Telegram

# 5. Restore polling and verify full recovery
# sudo iptables -D OUTPUT -d 149.154.167.220 -j DROP

# 6. Verify logs show independent transport behavior
grep -E "(Polling stall|sendMessage)" /var/log/openclaw/gateway.log
# Expected: sendMessage succeeds during/after polling stall

⚠️ Common Pitfalls

Environment-Specific Traps

macOS LaunchAgent Timing

  • Issue: The LaunchAgent restart mechanism (`launchctl kickstart`) has timing constraints that may not allow sufficient cleanup between restarts.
  • Symptom: Transport state from previous instance persists into new instance.
  • Mitigation: Add explicit `launchd` plist configuration for `KeepAlive` with proper `PathState` dependency on network interface.
# /Library/LaunchAgents/com.openclaw.gateway.plist
<key>KeepAlive</key>
<dict>
  <key>PathState</key>
  <dict>
    <key>/dev/null</key>
    <true/>
  </dict>
  <key>SuccessfulExit</key>
  <false/>
</dict>

Docker Container Network Isolation

  • Issue: Telegram API requires specific IP ranges, and container network policies may interfere with polling and outbound sharing.
  • Symptom: Works outside Docker, fails inside; or polling works, sendMessage fails.
  • Mitigation: Use host networking or ensure Telegram IP ranges (149.154.167.0/24) are not blocked.
# docker-compose.yaml
services:
  openclaw-gateway:
    network_mode: host
    # Or ensure DNS resolution for api.telegram.org works

Rate Limiting Confusion

  • Issue: Telegram enforces rate limits (~30 messages/second), and failed `sendMessage` may indicate rate limiting rather than transport failure.
  • Symptom: Retry logic causes thundering herd and worsens rate limit violations.
  • Mitigation: Distinguish HTTP 429 (rate limited) from network errors; apply separate backoff for rate limits.
// Rate limit specific handling
async sendMessage(chatId, text) {
  try {
    return await this.transport.request('sendMessage', {...});
  } catch (error) {
    if (error.statusCode === 429) {
      const retryAfter = error.parameters?.retry_after ?? 60;
      await this.sleep(retryAfter * 1000);
      return this.sendMessage(chatId, text); // Single retry for rate limits
    }
    // Network error - use exponential backoff
    return this.sendMessageWithRetry(chatId, text);
  }
}

Configuration Pitfalls

  • Token validation: Ensure `TELEGRAM_BOT_TOKEN` is set in environment, not hardcoded in config files committed to version control.
  • Webhook vs Polling: OpenClaw uses polling by default; switching to webhooks requires different failure handling semantics.
  • Log rotation: On macOS, the system log daemon may rotate logs before analysis; configure `newsyslog` or use dedicated log files.

Testing Pitfalls

  • Bot vs User context: Testing with the bot's own token requires the bot to initiate conversation for reliable message delivery.
  • Message deduplication: Telegram's `getUpdates` deduplication may hide retry-related duplicates; test with distinct message content.
  • Stale state: Previous failed deliveries in the preservation store may interfere with new tests; clear store between test runs.

Logically Connected Error Patterns

  • `HttpError: Network request for 'sendMessage' failed!`
    Primary symptom of transport degradation. Indicates outbound HTTP client cannot reach Telegram API. Correlates with polling stall in 70%+ of observed cases.
  • `Polling stall detected: no completed getUpdates for Ns; forcing restart`
    Root trigger for transport corruption. Stall threshold (default 120s) triggers restart sequence that may leave shared state corrupted.
  • `Polling runner stop timed out after 15s`
    Indicates unclean shutdown of polling loop. Resources not properly released before restart, increasing likelihood of transport corruption.
  • `sendChatAction failed: Network request for 'sendChatAction' failed!`
    Leading indicator of transport degradation. `sendChatAction` failures often precede `sendMessage` failures by 2-3 minutes, providing early warning opportunity.
  • `liveness warning: active=... queued=... phase=channels.telegram.start-account`
    Session queue indicating Telegram sessions remain queued despite degraded delivery. Shows disconnect between queue state and transport health.
  • `telegram final reply failed: HttpError: Network request for 'sendMessage' failed!`
    Specialized error path for final assistant reply, distinct from intermediate messages. May have different retry semantics than streaming responses.
  • `telegram message processing failed: HttpError: Network request for 'sendMessage' failed!`
    Generic processing failure logged when reply cannot be delivered. Lacks correlation data (run ID, message ID) for effective debugging.

Historical Issue Patterns

Issue CategoryRelated ErrorTypical Resolution
Transport corruptionPolling stall β†’ sendMessage failureTransport instance isolation
Silent lossfinal reply failed (no retry)Retry with backoff + preservation
Queue deadlockLiveness warning with active sessionsDegraded channel state handling
Operator blindnessNo notification on delivery failureWebhook/alerting integration

Debugging Resources

  • Telegram Bot API Errors: Refer to core.telegram.org/api/errors for specific error code meanings.
  • OpenClaw Gateway Logs: Default location varies by platform; check openclaw.yaml for logging.output.path configuration.
  • Failed Delivery Store: Default location: /var/lib/openclaw/failed-deliveries/ (create if missing).

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.