April 22, 2026

[Anthropic 529 重试后消息发送至错误话题] - Telegram Forum: Message Sent to Wrong Topic After Anthropic 529 Retry

当 Anthropic API 返回 529(超载)且 OpenClaw 重试请求时,回复消息会在没有正确 message_thread_id 的情况下发送,导致消息从论坛话题中消失。

🔍 症状

主要症状

在 Anthropic API 529 重试周期后,Telegram 回复消息发送成功(Telegram API 返回 ok 和有效的 message_id),但消息没有出现在预期的论坛主题中。

日志证据

2026-03-03T11:19:05.208Z [agent/embedded] embedded run agent end: runId=561c9fa1 isError=true error=The AI service is temporarily overloaded.
2026-03-03T11:19:05.685Z [agent/embedded] embedded run agent end: runId=81dab484 isError=true error=The AI service is temporarily overloaded.
2026-03-03T11:24:34.955Z [telegram] sendMessage ok chat=-1003885638534 message=13832

诊断症状

  • thread not found 错误 — Telegram 没有拒绝 thread ID
  • 日志中无 message_thread_id — 调试输出省略了 thread 参数,阻碍了诊断
  • 5 分钟间隔 — 最后一次 529 错误与 sendMessage 之间的时间差(表明进行了带退避的重试)
  • 会话记录缺失 — 事件当天没有 topic:562 的会话记录
  • stale-socket 重启 — 发生在 sendMessage 后 7 分钟,但消息早已丢失

用户可见行为

  • 主题 562 中的原始消息没有收到回复
  • 响应消息 ID 存在于 Telegram 数据库中(通过 API 响应确认)
  • 消息在目标主题和"综合"主题中都不可见
  • 消息看起来已"发送"但实际已孤立

🧠 根因分析

主要故障:重试期间的线程上下文丢失

根本原因是重试管道中的上下文传播失败。当 Anthropic 返回 HTTP 529 时,发生以下序列:

  1. 接收消息 — OpenClaw 收到包含 message.chat.idmessage.message_thread_id: 562 和会话上下文的 Telegram 更新
  2. 发起 API 调用 — OpenClaw 使用会话上下文调用 Anthropic 的 messages API
  3. 收到 529 错误 — Anthropic 返回 HTTP 529: The AI service is temporarily overloaded
  4. 触发重试 — OpenClaw 的重试机制(带退避)重新尝试 API 调用
  5. 上下文损坏 — 在重试周期中,原始 Telegram 更新的 message_thread_id 没有被传递到 sendMessage 调用

架构问题:会话状态与内联上下文

OpenClaw 使用基于会话的架构,会话上下文存储在会话存储中。关键 bug 发生在:

// Simplified flow showing the failure point
async function handleUpdate(update) {
  const threadId = update.message.message_thread_id; // 562 - captured here
  
  // On first attempt, session is created/loaded
  const session = await sessionStore.get(update.chat.id);
  session.threadId = threadId;
  await sessionStore.set(update.chat.id, session);
  
  // ... API call made, 529 received ...
  
  // On retry, session state may be stale or overwritten
  const retrySession = await sessionStore.get(update.chat.id);
  // retrySession.threadId could be undefined, null, or wrong value
  
  // sendMessage called without correct thread_id
  await telegram.sendMessage({
    chat_id: update.chat.id,
    text: response,
    message_thread_id: retrySession.threadId // BUG: undefined!
  });
}

促成因素

  • 重试延迟造成竞态条件 — 529 和重试之间 5 分钟的退避导致会话状态可能被清除、损坏或覆盖
  • sendMessage 日志中没有 thread_id — 调试语句省略了 message_thread_id,阻止了早期检测:
    // Current (broken) log format
    console.log(`sendMessage ok chat=${chatId} message=${messageId}`);
    

    // Missing: message_thread_id=${threadId || ‘undefined’}

  • 会话存储 TTL/过期 — 如果会话在重试窗口期间过期,线程上下文会丢失
  • 并发消息处理 — 如果在重试期间另一个消息到达不同主题,会话状态可能被覆盖

为何没有引发错误

Telegram 在没有 message_thread_id 的情况下接受消息,因为它默认为发送到"主主题"(thread_id: 0)。然而,论坛群组中主主题的行为因客户端和 Telegram 版本而异——如果原始上下文来自不同线程,某些客户端会完全隐藏这些消息。

🛠️ 逐步修复

步骤 1:确保 Thread ID 传递给 sendMessage

修改 Telegram 适配器以始终在 sendMessage 有效载荷中包含 message_thread_id,如果会话状态中不可用,则默认为传入消息的值:

// BEFORE (broken implementation)
async sendMessage(chatId, text, options = {}) {
  const payload = {
    chat_id: chatId,
    text: text,
    // message_thread_id not included - defaults to 0/undefined
    ...options
  };
  
  const result = await this.telegram.sendMessage(payload);
  console.log(`sendMessage ok chat=${chatId} message=${result.message_id}`);
  return result;
}

// AFTER (fixed implementation)
async sendMessage(chatId, text, options = {}) {
  const payload = {
    chat_id: chatId,
    text: text,
    parse_mode: 'Markdown',
    ...options
    // message_thread_id MUST be passed explicitly in options
    // No defaulting to undefined - caller is responsible
  };
  
  // Enhanced logging with thread_id
  console.log(`sendMessage ok chat=${chatId} thread=${payload.message_thread_id ?? 'main'} message=${result.message_id}`);
  return result;
}

步骤 2:在重试周期中保留 Thread ID

确保来自传入消息的 thread_id 被传递到 sendMessage 调用,不依赖会话状态:

// BEFORE (session-dependent)
async handleMessage(ctx, messageText) {
  const session = await this.getSession(ctx.chat.id);
  const response = await this.callAIWithRetry(messageText, session.context);
  
  // Thread ID from session - may be stale after retry
  await this.telegram.sendMessage(ctx.chat.id, response, {
    message_thread_id: session.threadId
  });
}

// AFTER (incoming message context preserved)
async handleMessage(ctx, messageText) {
  // Capture thread_id from the ACTUAL incoming message, not session
  const originalThreadId = ctx.message.message_thread_id;
  
  const session = await this.getSession(ctx.chat.id);
  const response = await this.callAIWithRetry(messageText, session.context);
  
  // Always use the original message's thread_id
  await this.telegram.sendMessage(ctx.chat.id, response, {
    message_thread_id: originalThreadId
  });
}

步骤 3:向所有发送操作添加 Thread ID

确保所有 Telegram 发送方法在论坛上下文中操作时包含 thread_id:

// Helper to build send options with thread context
function buildSendOptions(originalMessage, overrides = {}) {
  const options = { ...overrides };
  
  // Always include thread_id if original message had one
  if (originalMessage.message_thread_id) {
    options.message_thread_id = originalMessage.message_thread_id;
  }
  
  return options;
}

// Usage
const sendOptions = buildSendOptions(ctx.message);
await this.telegram.sendMessage(ctx.chat.id, text, sendOptions);
await this.telegram.editMessageReplyMarkup(ctx.chat.id, messageId, sendOptions);

步骤 4:改进重试日志

在每次重试尝试时记录 thread_id 以帮助调试:

async callAIWithRetry(message, context, threadId) {
  const maxRetries = 3;
  let lastError;
  
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    console.log(`[retry] attempt=${attempt} thread=${threadId} maxRetries=${maxRetries}`);
    
    try {
      return await this.anthropic.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 1024,
        messages: [{ role: 'user', content: message }],
        extra_headers: { 'anthropic-dangerous-direct-browser-access': 'true' }
      });
    } catch (error) {
      lastError = error;
      
      if (error.status === 529) {
        console.log(`[retry] received 529 (overloaded) thread=${threadId}`);
        const backoffMs = Math.min(1000 * Math.pow(2, attempt), 30000);
        console.log(`[retry] backing off for ${backoffMs}ms thread=${threadId}`);
        await sleep(backoffMs);
      } else if (error.status === 529) {
        throw error; // Non-retryable error
      }
    }
  }
  
  throw lastError;
}

步骤 5:会话状态锁定(高级)

防止长时间重试周期中的会话状态损坏:

// Use optimistic locking for session updates
async updateSession(chatId, updater, threadId) {
  const maxAttempts = 3;
  
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const session = await this.sessionStore.get(chatId);
    const updated = updater(session);
    
    // Preserve thread_id across session updates
    updated.threadId = session.threadId || threadId;
    
    try {
      await this.sessionStore.set(chatId, updated);
      return updated;
    } catch (conflictError) {
      if (attempt === maxAttempts) throw conflictError;
      await sleep(50 * attempt); // Brief backoff
    }
  }
}

🧪 验证

步骤 1:重现 529 场景

模拟 Anthropic 529 错误以触发重试路径:

# Using curl to simulate the Telegram update webhook
curl -X POST http://localhost:3000/webhook/telegram \
  -H "Content-Type: application/json" \
  -d '{
    "update_id": 123456789,
    "message": {
      "message_id": 100,
      "chat": { "id": -1003885638534, "type": "supergroup" },
      "message_thread_id": 562,
      "text": "Test message for 529 retry scenario"
    }
  }'

步骤 2:验证 sendMessage 日志输出

应用修复后,确认日志包含 thread

# Expected log output AFTER fix
2026-03-03T11:24:34.955Z [telegram] sendMessage ok chat=-1003885638534 thread=562 message=13832

# Should NOT see (before fix):
2026-03-03T11:24:34.955Z [telegram] sendMessage ok chat=-1003885638534 message=13832

步骤 3:验证消息出现在正确的主题中

# Use Telegram's getMessage to verify thread placement
curl "https://api.telegram.org/bot${BOT_TOKEN}/getMessage?chat_id=-1003885638534&message_id=13832"

# Expected response includes:
{
  "ok": true,
  "result": {
    "message_id": 13832,
    "chat": { "id": -1003885638534, "type": "supergroup" },
    "message_thread_id": 562,  // <-- Must match original
    "text": "..."
  }
}

步骤 4:验证会话包含 Thread ID

# Check session store for correct thread_id
# (depends on session store implementation)

# If using Redis:
redis-cli GET "session:-1003885638534"
# Should contain: {"threadId": 562, "..."}

# If using file-based:
cat sessions/-1003885638534.json
# Should contain: {"threadId": 562, "..."}

步骤 5:线程上下文保留的单元测试

describe('Telegram forum thread context', () => {
  it('should preserve message_thread_id through 529 retry', async () => {
    const ctx = createMockContext({
      chatId: -1003885638534,
      messageId: 100,
      threadId: 562,
      text: 'Test message'
    });
    
    // Mock Anthropic to return 529 twice, then success
    aiClient.messages.create
      .mockRejectedValueOnce({ status: 529, message: 'overloaded' })
      .mockRejectedValueOnce({ status: 529, message: 'overloaded' })
      .mockResolvedValueOnce({ content: [{ type: 'text', text: 'Response' }] });
    
    await handler.handleUpdate(ctx);
    
    // Verify sendMessage was called with correct thread_id
    expect(telegramAdapter.sendMessage).toHaveBeenCalledWith(
      -1003885638534,
      expect.any(String),
      expect.objectContaining({ message_thread_id: 562 })
    );
  });
});

步骤 6:Telegram 测试环境集成测试

# Use Telegram's test environment or a private bot
# Send message in a forum topic, trigger 529 error, verify reply location

# 1. Set BOT_TOKEN to test bot
export BOT_TOKEN="test_bot_token"

# 2. Run openclaw with logging
OPENCLAW_LOG_LEVEL=debug npm start

# 3. Monitor for:
# - sendMessage logs with thread=562
# - Message appears in correct topic
# - No "lost" messages

⚠️ 常见陷阱

环境特定陷阱

  • Docker 容器重启清除会话状态

    如果 OpenClaw 在 Docker 中运行且容器在长时间重试周期期间重启,会话状态(包括 thread_id)会丢失。确保会话存储是外部化的(Redis)而不是内存中的。

    # Docker Compose configuration - externalize session storage
    services:
      openclaw:
        image: openclaw:latest
        environment:
          - SESSION_STORE=redis
          - REDIS_URL=redis://redis:6379
      redis:
        image: redis:7-alpine
        volumes:
          - redis-data:/data
    volumes:
      redis-data:
    
  • macOS 文件描述符限制

    在 macOS 上使用基于文件的会话时,默认的 ulimit 可能在高负载下导致会话写入失败:

    # Check current limit
    ulimit -n
    # Increase if below 1024
    ulimit -n 65535
    
  • Windows 路径分隔符与会话键

    会话存储文件路径在 Windows 上可能因聊天 ID 中的特殊字符(前导连字符)而出现问题:

    # Use encodeURIComponent for chat IDs in file paths
    const sessionPath = path.join(
      sessionDir,
      `${encodeURIComponent(String(chatId))}.json`
    );
    

配置陷阱

  • 忘记在 BotFather 中启用论坛支持

    Telegram 机器人需要论坛主题的显式群组成员权限:

    # Required BotFather commands:
    # /setprivacy -> Disable (for forum access)
    # /setjoingroup -> Yes
    # /setforums -> Enable (if available)
    
  • 会话 TTL 与重试退避不匹配

    如果会话 TTL 短于重试退避周期,线程上下文会过期:

    # Example: 5-minute TTL but 5-minute backoff = guaranteed context loss
    SESSION_TTL=300000  # 5 minutes in ms
    MAX_RETRY_BACKOFF=300000  # Should be less than TTL
    
  • 使用 reply_to_message_id 而不使用 message_thread_id

    即使正确设置了 reply_to_message_id,省略 message_thread_id 也会导致论坛消息丢失:

    # BROKEN: reply without thread context
    {
      chat_id: -1003885638534,
      text: "Reply text",
      reply_to_message_id: 100
      // Missing: message_thread_id: 562
    }
    

    CORRECT: include both

    { chat_id: -1003885638534, text: “Reply text”, reply_to_message_id: 100, message_thread_id: 562 }

代码级陷阱

  • 将 thread_id 存储为字符串与数字

    Telegram API 两者都接受,但混合类型会导致问题:

    # Telegram API is flexible but some clients expect integer
    const threadId = parseInt(message.message_thread_id, 10);
    # Or ensure consistent type
    const threadId = String(message.message_thread_id);
    
  • 在并发处理器中覆盖会话

    如果多个消息同时到达同一聊天,会话写入可能发生竞态:

    // PROBLEMATIC: Read-modify-write without atomicity
    const session = await getSession(chatId);
    session.threadId = threadId;  // Read
    await saveSession(chatId, session);  // Write - another request may overwrite
    

    // FIXED: Use atomic operations or locking await updateSessionAtomic(chatId, (s) => { s.threadId = threadId; return s; });

  • 重试处理器中的 Async/await 竞态条件

    回调/ Promise 链可能丢失上下文:

    // PROBLEMATIC
    function handleMessage(ctx) {
      let threadId = ctx.message.message_thread_id;
    

    retry(3, () => ai.call()).then(response => { // ’this’ and ’threadId’ may be out of scope or stale sendMessage(ctx.chat.id, response, { threadId }); });

    // New message arrives, ’threadId’ is overwritten threadId = newMessage.message_thread_id; }

    // FIXED: Capture context in closure function handleMessage(ctx) { const threadId = ctx.message.message_thread_id; // Capture immediately

    retry(3, () => ai.call()).then(response => { sendMessage(ctx.chat.id, response, { message_thread_id: threadId }); }); }

🔗 相关错误

  • HTTP 529: The AI service is temporarily overloaded

    Anthropic 的速率限制错误,触发重试序列。529 错误是 bug 的触发事件。

  • stale-socket

    网关健康监控重启,发生在丢失消息 7 分钟后。与此问题无直接关系,但表明底层连接不稳定,可能加剧重试问题。

  • thread not found(本案例中未出现)
    当 message_thread_id 引用不存在的主题时的 Telegram API 错误。缺少此错误确认 Telegram 收到了有效的 thread ID(或者根本没有 thread ID)。

  • 长时间运行操作期间的会话过期

    与会话 TTL 短于操作持续时间导致上下文丢失的 GitHub 问题相关。与 thread_id 丢失的根本原因相似。

  • 缺少线程上下文的 Webhook 传递失败

    相关问题,Telegram webhook 更新为论坛消息到达时没有 message_thread_id,导致路由失败。

  • context.lengthExceeded Anthropic 错误

    当会话上下文在重试期间增长过大时,Anthropic 返回此错误。如果错误处理丢失状态,可能会加重线程上下文问题。

  • 并发 Telegram 更新中的竞态条件

    当多个更新同时到达同一聊天时,会话状态可能被覆盖,丢失 thread_id。与此 bug 相同的架构漏洞。

  • 机器人令牌刷新后消息发送到错误的聊天

    相关的会话/上下文丢失场景,其中机器人配置在操作中途更改导致消息路由错误。

依据与来源

本故障排除指南由 FixClaw 智能管线从社区讨论中自动合成。