May 10, 2026 • Version: 2026.5.7

Gateway Draining Deadlock Caused by Stalled model_call and Cross-Channel Delivery Routing Failure

Two interrelated bugs causing gateway restart deadlock: stalled model_call tasks blocking drain timeout and subagent notifications incorrectly routed to wrong messaging channel (feishu instead of weixin).

🔍 Symptoms

Bug 1: Gateway Draining Deadlock

The gateway becomes unresponsive and cannot complete restart sequence due to indefinite blocking by a stalled model_call task.

CLI Execution Sequence:

# Terminal 1: Observe stalled session diagnostic
$ tail -f ~/.openclaw/logs/gateway.log
[STALLED SESSION DETECTED]
sessionId=bbe85782-7c56-4ea6-bfdb-9ab2e2c5b3ab
state=processing age=175s queueDepth=1
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=model_call
lastProgress=model_call:started
lastProgressAge=168s
recovery=none

# Terminal 2: Trigger restart via gateway tool
$ openclaw gateway restart --force
GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted

# Observe continuous draining loop
$ grep "still draining" ~/.openclaw/logs/gateway.log
still draining 6 active task(s) and 3 active embedded run(s) before restart
still draining 6 active task(s) and 3 active embedded run(s) before restart
... (repeats every 30s for 10+ minutes)

Exit Code Behavior:

Gateway restart command exits with code 1 due to GatewayDrainingError
Channel stop command times out after 5000ms: channel stop exceeded 5000ms after abort; continuing shutdown
Final shutdown completes in 6310ms with forced continuation

Bug 2: Cross-Channel Delivery Routing Failure

Subagent completion notifications are routed to the incorrect messaging channel, causing API rejection.

CLI Inspection of Pending Deliveries:

$ ls -la ~/.openclaw/delivery-queue/
total 20
-rw-r--r--  1 user  staff  1024 Jun 15 03:21 file_001.json
-rw-r--r--  1 user  staff  1024 Jun 15 03:21 file_002.json
... (20 files)

$ cat ~/.openclaw/delivery-queue/file_001.json
{
  "channel": "feishu",
  "to": "[email protected]",
  "accountId": "2005b227a854-im-bot",
  "agentId": "main"
}

# Compare with correct requesterOrigin
$ cat ~/.openclaw/delivery-queue/file_001.json | jq .requesterOrigin
{
  "channel": "openclaw-weixin",
  "to": "[email protected]",
  "accountId": "2005b227a854-im-bot"
}

API Error Response:

HTTP 400 Bad Request
feishu_code: 99992360
error: Invalid ids: [[email protected]]

# Gateway recovery attempts
[RECOVERY] Attempting delivery retry 1/20
[RECOVERY] Attempting delivery retry 2/20
...
[RECOVERY] Attempting delivery retry 20/20
[RECOVERY] All retries exhausted; delivery marked as failed

Error Classification:

Field	Expected Value	Actual Value
`channel`	`openclaw-weixin`	`feishu`
`to`	WeChat user ID (correct)	WeChat user ID (correct)
API Endpoint Called	WeChat API	Feishu API

🧠 Root Cause

Bug 1: Draining Mechanism Lacks Hard Timeout

Architectural Failure Sequence:

Stalled model_call detection: The diagnostic system correctly identifies the stalled session with `activeWorkKind=model_call` and `classification=stalled_agent_run`, but assigns `recovery=none`.
Draining entry: When restart is requested, Gateway enters draining state and attempts graceful task completion before shutdown.
Infinite blocking: The draining mechanism has no hard timeout for individual tasks. It repeatedly logs "still draining N active task(s)" every 30 seconds without taking forceful action.
Resource starvation: The stalled `model_call` (168+ seconds without progress) blocks its parent session, preventing cleanup.
Delivery recovery interference: As discovered in Bug 2, 20 pending delivery recovery tasks may also contribute to the "6 active task(s) and 3 active embedded run(s)" count.
Forced shutdown cascade: Channel stop attempts abort after 5000ms, but the underlying model_call is never explicitly terminated.

Code Path Analysis:

// Simplified call sequence showing the deadlock
Gateway.restart()
  → DrainingState.enter()
    → SessionManager.getActiveTasks() // returns 6 tasks including stalled model_call
    → for each task: Task.waitForCompletion()  // BLOCKS INDEFINITELY
      → model_call never returns (upstream API stalled)
    → DrainingState.checkComplete()
      → if incomplete: re-loop with 30s delay  // NO TIMEOUT CHECK
        → repeat indefinitely

Bug 2: Channel Routing Inherited from Agent Context, Not requesterOrigin

Root Cause Chain:

Multi-channel agent binding: The `main` agent in `openclaw.json` is bound to both `openclaw-weixin` and `feishu` channels with the same `accountId: 2005b227a854-im-bot`.
Subagent spawn context: When a subagent is spawned by a cron run from the WeChat channel, the requester origin correctly captures `{ channel: "openclaw-weixin", ... }`.
Channel inheritance bug: The delivery notification system uses the agent's primary channel binding (derived from agent context at spawn time) rather than propagating the `requesterOrigin.channel` field.
Channel mismatch: The delivery queue writes `{ channel: "feishu", to: "@im.wechat" }`, causing the Feishu API to reject the malformed user ID.
Retry storm: Delivery recovery attempts retry 20 times with exponential backoff, contributing to the active task count during draining.

Data Flow Diagram:

WeChat User → Cron Run (channel=weixin)
  → Subagent Spawn
    → requesterOrigin correctly set: { channel: "openclaw-weixin", to: "wechat_id" }
    → BUT delivery.channel inherited from: agent.config.channels[0] → "feishu"
  → Subagent Fails
    → Delivery Queue Entry Created
      → channel: "feishu" ← BUG: should be "openclaw-weixin"
      → to: "[email protected]" ← CORRECT
  → Feishu API Called ← FAILURE: invalid user ID format

Configuration Context (from openclaw.json):

{
  "agents": {
    "main": {
      "accountId": "2005b227a854-im-bot",
      "channels": ["feishu", "openclaw-weixin"]  // feishu is primary (index 0)
    }
  }
}

🛠️ Step-by-Step Fix

Fix 1: Add Hard Timeout to Gateway Draining

File: packages/gateway/src/draining.ts

// BEFORE (line 45-52)
async drain(activeTasks: Task[], timeoutMs: number = 30000): Promise {
  const checkInterval = 5000;
  while (activeTasks.some(t => !t.isComplete)) {
    await sleep(checkInterval);
    logger.info(`still draining ${activeTasks.length} active task(s)...`);
  }
}

// AFTER
async drain(activeTasks: Task[], timeoutMs: number = 60000): Promise {
  const checkInterval = 5000;
  const deadline = Date.now() + timeoutMs;
  
  while (activeTasks.some(t => !t.isComplete)) {
    if (Date.now() >= deadline) {
      logger.warn(`Draining timeout exceeded (${timeoutMs}ms); forcing task abort`);
      await Promise.allSettled(
        activeTasks.filter(t => !t.isComplete).map(t => t.abort())
      );
      break;
    }
    await sleep(checkInterval);
    const remaining = Math.max(0, deadline - Date.now());
    logger.info(`still draining ${activeTasks.length} active task(s)... (${Math.ceil(remaining/1000)}s remaining)`);
  }
}

Fix 2: Auto-Detect and Abort Stalled model_call Tasks

File: packages/gateway/src/stalled-task-detector.ts

// BEFORE (no automatic abort)
const STALLED_THRESHOLD_MS = 120000;
const RECOVERY_ACTIONS = { model_call: "none" }; // <-- recovery=none for model_call

// AFTER
const STALLED_THRESHOLD_MS = 120000;
const FORCIBLE_ABORT_TYPES = ["model_call", "tool_call", "api_request"];
const RECOVERY_ACTIONS = { 
  model_call: "abort",      // Force abort stalled model_call
  tool_call: "abort",        // Force abort stalled tool_call
  api_request: "abort"       // Force abort stalled api_request
};

async handleStalledTask(session: Session): Promise {
  const workKind = session.activeWorkKind;
  const classification = session.classification;
  
  if (classification === "stalled_agent_run" && FORCIBLE_ABORT_TYPES.includes(workKind)) {
    logger.warn(`Auto-aborting stalled ${workKind} in session ${session.id}`);
    await session.abortActiveWork();
  }
}

Fix 3: Propagate requesterOrigin.channel to Delivery Routing

File: packages/delivery/src/router.ts

// BEFORE (line 78-85)
function determineDeliveryChannel(
  subagentContext: SubagentContext,
  requesterOrigin?: RequesterOrigin
): string {
  // Bug: uses agent's primary channel, ignores requesterOrigin
  return subagentContext.agentConfig.channels[0];
}

// AFTER
function determineDeliveryChannel(
  subagentContext: SubagentContext,
  requesterOrigin?: RequesterOrigin
): string {
  // Priority 1: Explicit channel from requester origin (for subagent callbacks)
  if (requesterOrigin?.channel) {
    return requesterOrigin.channel;
  }
  
  // Priority 2: Channel from original delivery request
  if (requesterOrigin?.deliveryChannel) {
    return requesterOrigin.deliveryChannel;
  }
  
  // Priority 3: Fallback to agent's primary channel
  return subagentContext.agentConfig.channels[0];
}

File: packages/delivery/src/queue.ts (write path)

// BEFORE (line 112-118)
const deliveryEntry = {
  channel: routingResult.channel,
  to: targetUserId,
  accountId: subagentContext.accountId,
  agentId: subagentContext.agentId
};

// AFTER
const deliveryEntry = {
  channel: routingResult.channel,  // Now correctly uses requesterOrigin.channel
  to: targetUserId,
  accountId: subagentContext.accountId,
  agentId: subagentContext.agentId,
  requesterOrigin: subagentContext.requesterOrigin  // Preserve for debugging
};

Fix 4: Validate Channel-to-Address Format Compatibility

File: packages/delivery/src/validators/channel-validator.ts

// ADD NEW FILE
const CHANNEL_ADDRESS_PATTERNS = {
  "feishu": /^[a-zA-Z0-9_-]+@[a-zA-Z0-9_.-]+$/,
  "openclaw-weixin": /^o[a-zA-Z0-9_-]+@im\.wechat$/,
  "slack": /^U[a-zA-Z0-9_-]+$/,
  "discord": /^[0-9]{17,19}$/
};

export function validateChannelAddressMatch(channel: string, address: string): ValidationResult {
  const pattern = CHANNEL_ADDRESS_PATTERNS[channel];
  if (!pattern) {
    return { valid: true, warning: `Unknown channel: ${channel}` };
  }
  
  const valid = pattern.test(address);
  return {
    valid,
    error: valid ? undefined : 
      `Address format mismatch: channel=${channel}, address=${address} does not match pattern ${pattern}`
  };
}

Integration in delivery queue write:

// In packages/delivery/src/queue.ts
import { validateChannelAddressMatch } from "./validators/channel-validator";

async function enqueueDelivery(entry: DeliveryEntry): Promise {
  const validation = validateChannelAddressMatch(entry.channel, entry.to);
  
  if (!validation.valid) {
    logger.error(`Channel-address mismatch detected: ${validation.error}`);
    throw new DeliveryValidationError(validation.error);
  }
  
  // Proceed with enqueue...
}

🧪 Verification

Verification Steps for Fix 1: Draining Timeout

# Step 1: Start gateway with test workload
$ openclaw gateway start --port 18789

# Step 2: Simulate stalled model_call
$ curl -X POST http://localhost:18789/internal/test/stall-model-call \
  -d '{"sessionId": "test-stall-001", "durationMs": 300000}'

# Step 3: Trigger restart and observe 60s timeout
$ time openclaw gateway restart --force 2>&1
# Should output: "Draining timeout exceeded (60000ms); forcing task abort"
# Expected duration: ~65-70 seconds

# Step 4: Verify exit code
$ echo $?
0  # Should be 0 on successful forced restart

Expected Log Output:

[DRAINING] Starting graceful drain of 1 task(s), timeout=60000ms
[DRAINING] still draining 1 active task(s)... (55s remaining)
[DRAINING] still draining 1 active task(s)... (50s remaining)
[DRAINING] still draining 1 active task(s)... (45s remaining)
[DRAINING] Draining timeout exceeded (60000ms); forcing task abort
[DRAINING] Task abort completed for session=test-stall-001
[DRAINING] All tasks cleared; proceeding with shutdown
[SHUTDOWN] Shutdown completed cleanly in 62300ms

Verification Steps for Fix 2: Stalled model_call Auto-Abort

# Step 1: Enable stalled task monitoring
$ openclaw config set gateway.stalledTaskDetection.enabled true
$ openclaw config set gateway.stalledTaskDetection.thresholdSeconds 120

# Step 2: Start session with intentional stall
$ openclaw session start --agent main --channel weixin \
  --test-prompt "stall test" --mock-model-delay 300000

# Step 3: Observe automatic abort within 120s
$ tail -f ~/.openclaw/logs/gateway.log | grep -E "(STALLED|ABORT)"
# Should see:
[STALLED SESSION DETECTED] activeWorkKind=model_call
[AUTO-ABORT] Aborting stalled model_call in session=test-session-xxx
[ABORT COMPLETE] model_call terminated after 124s

Expected Behavior:

Stalled session detected after 120 seconds
Automatic abort triggered, not recovery=none
Gateway remains responsive for new requests

Verification Steps for Fix 3: Correct Channel Routing

# Step 1: Clear existing delivery queue
$ rm -f ~/.openclaw/delivery-queue/*.json

# Step 2: Create test subagent from weixin context
$ openclaw test subagent-delivery \
  --channel openclaw-weixin \
  --user-id "[email protected]" \
  --trigger-error

# Step 3: Inspect generated delivery entry
$ cat ~/.openclaw/delivery-queue/*.json | jq .
{
  "channel": "openclaw-weixin",      # ← Should be weixin, not feishu
  "to": "[email protected]",
  "accountId": "2005b227a854-im-bot",
  "agentId": "main",
  "requesterOrigin": {
    "channel": "openclaw-weixin",
    "to": "[email protected]"
  }
}

# Step 4: Verify delivery succeeds (if test mode disabled)
$ openclaw delivery process --once
[DELIVERY] Processing 1 pending delivery
[DELIVERY] Channel=openclaw-weixin, API endpoint=wechat-api.internal
[DELIVERY] Success: notification sent to [email protected]

Verification Steps for Fix 4: Channel-Address Validation

# Step 1: Test mismatched delivery attempt (should fail fast)
$ openclaw test delivery-create \
  --channel feishu \
  --address "[email protected]"

# Expected output:
Error: Address format mismatch: channel=feishu, [email protected]
does not match pattern /^[a-zA-Z0-9_-]+@[a-zA-Z0-9_.-]+$/
DeliveryValidationError: Validation failed

# Step 2: Test matched delivery (should succeed)
$ openclaw test delivery-create \
  --channel feishu \
  --address "[email protected]"

[DELIVERY] Entry created successfully
[VALIDATION] ✓ Channel-address format validated

⚠️ Common Pitfalls

Environment-Specific Traps

macOS Darwin process management: The kill signal handling differs between macOS and Linux. Ensure signal handlers use SIGTERM for graceful shutdown and SIGKILL only as last resort. On Darwin 25.4.0, the default 5000ms channel stop timeout may be insufficient under heavy model_call load.
Node 24.4.1 async behavior: Promise.allSettled behavior changed subtly in Node 22+. When aborting multiple stalled tasks, ensure each task's abort handler properly clears internal state before returning.
Docker container resource limits: If running gateway in Docker with memory limits, model_call stalls may trigger OOM watchdog before the draining timeout. Allocate minimum 512MB RAM to gateway container.

Configuration Pitfalls

Channel array ordering: The openclaw.json channels array order matters for fallback routing. Always put the primary channel first, or modify the routing logic to use requesterOrigin as primary (per Fix 3).
Multi-account same accountId: When feishu and openclaw-weixin share the same accountId, channel validation may appear to pass but delivery will fail at the API layer. Always validate channel-to-address patterns.

Delivery queue persistence: Pending deliveries in ~/.openclaw/delivery-queue/ survive gateway restarts. After applying Fix 3, manually clear stale entries with mismatched channels:

# Clear stale feishu deliveries with wechat addresses
jq -c '.[]' ~/.openclaw/delivery-queue/*.json | \
  while read entry; do
    channel=$(echo $entry | jq -r '.channel')
    address=$(echo $entry | jq -r '.to')
    if [[ "$channel" == "feishu" && "$address" == *"@im.wechat"* ]]; then
      rm "$(~/.openclaw/delivery-queue/ | grep -l "$entry")"
    fi
  done

Runtime Behavior Pitfalls

Aborting model_call mid-stream: If the model_call has an in-flight HTTP request to the fireworks API, ensure the abort handler cancels the underlying fetch request using AbortController. Otherwise, the request continues consuming resources even after abort.
Draining vs. forceful restart: Distinguish between graceful draining (openclaw gateway restart) and forced restart (openclaw gateway restart --force). The timeout applies only to graceful draining.
Delivery retry backoff: The 20-retry limit with exponential backoff may cause delivery recovery tasks to appear "active" for extended periods. Consider adding a maximum total retry time (e.g., 5 minutes) regardless of retry count.

Edge Cases

Nested subagents: If a subagent spawns another subagent, the requesterOrigin must propagate through the entire chain. Verify that subagentContext.requesterOrigin is passed correctly in SubagentContext constructor.
Channel-less delivery: For internal notifications (no channel), ensure routing logic handles channel: null gracefully by falling back to the agent's default channel.
Rapid restart requests: If user triggers restart multiple times while draining, ensure only one draining sequence runs. Add a mutex or state check to prevent duplicate drain attempts.

GatewayDrainingError
Gateway refuses new tasks during draining phase. This is the expected behavior but becomes a problem when draining has no timeout. Related to: indefinite draining loop.
DeliveryValidationError
Channel-address format mismatch detected. New error introduced by Fix 4 to fail fast on misrouted deliveries. Related to: Bug 2 cross-channel routing.
feishu_code: 99992360
Feishu API rejection: "Invalid ids" — user ID format not recognized by Feishu. This error confirms the routing bug (wechat ID sent to feishu).
stalled_agent_run (classification)
Diagnostic classification indicating no progress for 120+ seconds. Recovery action should be "abort" not "none" for model_call work kinds.

ETIMEDOUT (channel stop)
Channel stop exceeded 5000ms after abort signal. Indicates that forceful abort handlers are not completing in time. May indicate model_call or delivery handlers blocking on network I/O.
ENOTFOUND (delivery queue)
Delivery queue file not found during recovery. Can occur if queue files are deleted while processing retries.
HTTP 400 (upstream API)
Generic bad request error from messaging APIs. For Feishu specifically, check error body for feishu_code field to distinguish invalid IDs vs. other validation failures.
Session state: processing with age: 175s
Session stuck in processing state without progress. Should trigger automatic abort per Fix 2.

Historical Context

Issue #1087: Gateway restart hangs on long-running tool_calls
Previous similar issue where tool_call blocked draining. Root cause was missing abort handler registration. Fix involved adding AbortController to all async tool invocations.
Issue #892: Delivery to wrong channel for multi-channel agents
Agent bound to multiple channels delivered notifications to first channel regardless of requester. This is the same root cause as Bug 2, previously partially fixed but regressed.
Issue #2154: model_call timeout not enforced during agent run
Request timeout configured but not propagated to internal model_call invocations. This is why lastProgressAge: 168s exceeded typical timeouts (60-120s) without triggering abort.