Gateway Draining Deadlock Caused by Stalled model_call and Cross-Channel Delivery Routing Failure
Two interrelated bugs causing gateway restart deadlock: stalled model_call tasks blocking drain timeout and subagent notifications incorrectly routed to wrong messaging channel (feishu instead of weixin).
π Symptoms
Bug 1: Gateway Draining Deadlock
The gateway becomes unresponsive and cannot complete restart sequence due to indefinite blocking by a stalled model_call task.
CLI Execution Sequence:
# Terminal 1: Observe stalled session diagnostic
$ tail -f ~/.openclaw/logs/gateway.log
[STALLED SESSION DETECTED]
sessionId=bbe85782-7c56-4ea6-bfdb-9ab2e2c5b3ab
state=processing age=175s queueDepth=1
reason=active_work_without_progress
classification=stalled_agent_run
activeWorkKind=model_call
lastProgress=model_call:started
lastProgressAge=168s
recovery=none
# Terminal 2: Trigger restart via gateway tool
$ openclaw gateway restart --force
GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted
# Observe continuous draining loop
$ grep "still draining" ~/.openclaw/logs/gateway.log
still draining 6 active task(s) and 3 active embedded run(s) before restart
still draining 6 active task(s) and 3 active embedded run(s) before restart
... (repeats every 30s for 10+ minutes)Exit Code Behavior:
- Gateway restart command exits with code
1due toGatewayDrainingError - Channel stop command times out after 5000ms:
channel stop exceeded 5000ms after abort; continuing shutdown - Final shutdown completes in 6310ms with forced continuation
Bug 2: Cross-Channel Delivery Routing Failure
Subagent completion notifications are routed to the incorrect messaging channel, causing API rejection.
CLI Inspection of Pending Deliveries:
$ ls -la ~/.openclaw/delivery-queue/
total 20
-rw-r--r-- 1 user staff 1024 Jun 15 03:21 file_001.json
-rw-r--r-- 1 user staff 1024 Jun 15 03:21 file_002.json
... (20 files)
$ cat ~/.openclaw/delivery-queue/file_001.json
{
"channel": "feishu",
"to": "[email protected]",
"accountId": "2005b227a854-im-bot",
"agentId": "main"
}
# Compare with correct requesterOrigin
$ cat ~/.openclaw/delivery-queue/file_001.json | jq .requesterOrigin
{
"channel": "openclaw-weixin",
"to": "[email protected]",
"accountId": "2005b227a854-im-bot"
}API Error Response:
HTTP 400 Bad Request
feishu_code: 99992360
error: Invalid ids: [[email protected]]
# Gateway recovery attempts
[RECOVERY] Attempting delivery retry 1/20
[RECOVERY] Attempting delivery retry 2/20
...
[RECOVERY] Attempting delivery retry 20/20
[RECOVERY] All retries exhausted; delivery marked as failedError Classification:
| Field | Expected Value | Actual Value |
|---|---|---|
channel | openclaw-weixin | feishu |
to | WeChat user ID (correct) | WeChat user ID (correct) |
| API Endpoint Called | WeChat API | Feishu API |
π§ Root Cause
Bug 1: Draining Mechanism Lacks Hard Timeout
Architectural Failure Sequence:
- Stalled model_call detection: The diagnostic system correctly identifies the stalled session with `activeWorkKind=model_call` and `classification=stalled_agent_run`, but assigns `recovery=none`.
- Draining entry: When restart is requested, Gateway enters draining state and attempts graceful task completion before shutdown.
- Infinite blocking: The draining mechanism has no hard timeout for individual tasks. It repeatedly logs "still draining N active task(s)" every 30 seconds without taking forceful action.
- Resource starvation: The stalled `model_call` (168+ seconds without progress) blocks its parent session, preventing cleanup.
- Delivery recovery interference: As discovered in Bug 2, 20 pending delivery recovery tasks may also contribute to the "6 active task(s) and 3 active embedded run(s)" count.
- Forced shutdown cascade: Channel stop attempts abort after 5000ms, but the underlying model_call is never explicitly terminated.
Code Path Analysis:
// Simplified call sequence showing the deadlock
Gateway.restart()
β DrainingState.enter()
β SessionManager.getActiveTasks() // returns 6 tasks including stalled model_call
β for each task: Task.waitForCompletion() // BLOCKS INDEFINITELY
β model_call never returns (upstream API stalled)
β DrainingState.checkComplete()
β if incomplete: re-loop with 30s delay // NO TIMEOUT CHECK
β repeat indefinitelyBug 2: Channel Routing Inherited from Agent Context, Not requesterOrigin
Root Cause Chain:
- Multi-channel agent binding: The `main` agent in `openclaw.json` is bound to both `openclaw-weixin` and `feishu` channels with the same `accountId: 2005b227a854-im-bot`.
- Subagent spawn context: When a subagent is spawned by a cron run from the WeChat channel, the requester origin correctly captures `{ channel: "openclaw-weixin", ... }`.
- Channel inheritance bug: The delivery notification system uses the agent's primary channel binding (derived from agent context at spawn time) rather than propagating the `requesterOrigin.channel` field.
- Channel mismatch: The delivery queue writes `{ channel: "feishu", to: "
@im.wechat" }`, causing the Feishu API to reject the malformed user ID. - Retry storm: Delivery recovery attempts retry 20 times with exponential backoff, contributing to the active task count during draining.
Data Flow Diagram:
WeChat User β Cron Run (channel=weixin)
β Subagent Spawn
β requesterOrigin correctly set: { channel: "openclaw-weixin", to: "wechat_id" }
β BUT delivery.channel inherited from: agent.config.channels[0] β "feishu"
β Subagent Fails
β Delivery Queue Entry Created
β channel: "feishu" β BUG: should be "openclaw-weixin"
β to: "[email protected]" β CORRECT
β Feishu API Called β FAILURE: invalid user ID formatConfiguration Context (from openclaw.json):
{
"agents": {
"main": {
"accountId": "2005b227a854-im-bot",
"channels": ["feishu", "openclaw-weixin"] // feishu is primary (index 0)
}
}
}π οΈ Step-by-Step Fix
Fix 1: Add Hard Timeout to Gateway Draining
File: packages/gateway/src/draining.ts
// BEFORE (line 45-52)
async drain(activeTasks: Task[], timeoutMs: number = 30000): Promise {
const checkInterval = 5000;
while (activeTasks.some(t => !t.isComplete)) {
await sleep(checkInterval);
logger.info(`still draining ${activeTasks.length} active task(s)...`);
}
}
// AFTER
async drain(activeTasks: Task[], timeoutMs: number = 60000): Promise {
const checkInterval = 5000;
const deadline = Date.now() + timeoutMs;
while (activeTasks.some(t => !t.isComplete)) {
if (Date.now() >= deadline) {
logger.warn(`Draining timeout exceeded (${timeoutMs}ms); forcing task abort`);
await Promise.allSettled(
activeTasks.filter(t => !t.isComplete).map(t => t.abort())
);
break;
}
await sleep(checkInterval);
const remaining = Math.max(0, deadline - Date.now());
logger.info(`still draining ${activeTasks.length} active task(s)... (${Math.ceil(remaining/1000)}s remaining)`);
}
} Fix 2: Auto-Detect and Abort Stalled model_call Tasks
File: packages/gateway/src/stalled-task-detector.ts
// BEFORE (no automatic abort)
const STALLED_THRESHOLD_MS = 120000;
const RECOVERY_ACTIONS = { model_call: "none" }; // <-- recovery=none for model_call
// AFTER
const STALLED_THRESHOLD_MS = 120000;
const FORCIBLE_ABORT_TYPES = ["model_call", "tool_call", "api_request"];
const RECOVERY_ACTIONS = {
model_call: "abort", // Force abort stalled model_call
tool_call: "abort", // Force abort stalled tool_call
api_request: "abort" // Force abort stalled api_request
};
async handleStalledTask(session: Session): Promise {
const workKind = session.activeWorkKind;
const classification = session.classification;
if (classification === "stalled_agent_run" && FORCIBLE_ABORT_TYPES.includes(workKind)) {
logger.warn(`Auto-aborting stalled ${workKind} in session ${session.id}`);
await session.abortActiveWork();
}
} Fix 3: Propagate requesterOrigin.channel to Delivery Routing
File: packages/delivery/src/router.ts
// BEFORE (line 78-85)
function determineDeliveryChannel(
subagentContext: SubagentContext,
requesterOrigin?: RequesterOrigin
): string {
// Bug: uses agent's primary channel, ignores requesterOrigin
return subagentContext.agentConfig.channels[0];
}
// AFTER
function determineDeliveryChannel(
subagentContext: SubagentContext,
requesterOrigin?: RequesterOrigin
): string {
// Priority 1: Explicit channel from requester origin (for subagent callbacks)
if (requesterOrigin?.channel) {
return requesterOrigin.channel;
}
// Priority 2: Channel from original delivery request
if (requesterOrigin?.deliveryChannel) {
return requesterOrigin.deliveryChannel;
}
// Priority 3: Fallback to agent's primary channel
return subagentContext.agentConfig.channels[0];
}File: packages/delivery/src/queue.ts (write path)
// BEFORE (line 112-118)
const deliveryEntry = {
channel: routingResult.channel,
to: targetUserId,
accountId: subagentContext.accountId,
agentId: subagentContext.agentId
};
// AFTER
const deliveryEntry = {
channel: routingResult.channel, // Now correctly uses requesterOrigin.channel
to: targetUserId,
accountId: subagentContext.accountId,
agentId: subagentContext.agentId,
requesterOrigin: subagentContext.requesterOrigin // Preserve for debugging
};Fix 4: Validate Channel-to-Address Format Compatibility
File: packages/delivery/src/validators/channel-validator.ts
// ADD NEW FILE
const CHANNEL_ADDRESS_PATTERNS = {
"feishu": /^[a-zA-Z0-9_-]+@[a-zA-Z0-9_.-]+$/,
"openclaw-weixin": /^o[a-zA-Z0-9_-]+@im\.wechat$/,
"slack": /^U[a-zA-Z0-9_-]+$/,
"discord": /^[0-9]{17,19}$/
};
export function validateChannelAddressMatch(channel: string, address: string): ValidationResult {
const pattern = CHANNEL_ADDRESS_PATTERNS[channel];
if (!pattern) {
return { valid: true, warning: `Unknown channel: ${channel}` };
}
const valid = pattern.test(address);
return {
valid,
error: valid ? undefined :
`Address format mismatch: channel=${channel}, address=${address} does not match pattern ${pattern}`
};
}Integration in delivery queue write:
// In packages/delivery/src/queue.ts
import { validateChannelAddressMatch } from "./validators/channel-validator";
async function enqueueDelivery(entry: DeliveryEntry): Promise {
const validation = validateChannelAddressMatch(entry.channel, entry.to);
if (!validation.valid) {
logger.error(`Channel-address mismatch detected: ${validation.error}`);
throw new DeliveryValidationError(validation.error);
}
// Proceed with enqueue...
} π§ͺ Verification
Verification Steps for Fix 1: Draining Timeout
# Step 1: Start gateway with test workload
$ openclaw gateway start --port 18789
# Step 2: Simulate stalled model_call
$ curl -X POST http://localhost:18789/internal/test/stall-model-call \
-d '{"sessionId": "test-stall-001", "durationMs": 300000}'
# Step 3: Trigger restart and observe 60s timeout
$ time openclaw gateway restart --force 2>&1
# Should output: "Draining timeout exceeded (60000ms); forcing task abort"
# Expected duration: ~65-70 seconds
# Step 4: Verify exit code
$ echo $?
0 # Should be 0 on successful forced restartExpected Log Output:
[DRAINING] Starting graceful drain of 1 task(s), timeout=60000ms
[DRAINING] still draining 1 active task(s)... (55s remaining)
[DRAINING] still draining 1 active task(s)... (50s remaining)
[DRAINING] still draining 1 active task(s)... (45s remaining)
[DRAINING] Draining timeout exceeded (60000ms); forcing task abort
[DRAINING] Task abort completed for session=test-stall-001
[DRAINING] All tasks cleared; proceeding with shutdown
[SHUTDOWN] Shutdown completed cleanly in 62300msVerification Steps for Fix 2: Stalled model_call Auto-Abort
# Step 1: Enable stalled task monitoring
$ openclaw config set gateway.stalledTaskDetection.enabled true
$ openclaw config set gateway.stalledTaskDetection.thresholdSeconds 120
# Step 2: Start session with intentional stall
$ openclaw session start --agent main --channel weixin \
--test-prompt "stall test" --mock-model-delay 300000
# Step 3: Observe automatic abort within 120s
$ tail -f ~/.openclaw/logs/gateway.log | grep -E "(STALLED|ABORT)"
# Should see:
[STALLED SESSION DETECTED] activeWorkKind=model_call
[AUTO-ABORT] Aborting stalled model_call in session=test-session-xxx
[ABORT COMPLETE] model_call terminated after 124sExpected Behavior:
- Stalled session detected after 120 seconds
- Automatic abort triggered, not
recovery=none - Gateway remains responsive for new requests
Verification Steps for Fix 3: Correct Channel Routing
# Step 1: Clear existing delivery queue
$ rm -f ~/.openclaw/delivery-queue/*.json
# Step 2: Create test subagent from weixin context
$ openclaw test subagent-delivery \
--channel openclaw-weixin \
--user-id "[email protected]" \
--trigger-error
# Step 3: Inspect generated delivery entry
$ cat ~/.openclaw/delivery-queue/*.json | jq .
{
"channel": "openclaw-weixin", # β Should be weixin, not feishu
"to": "[email protected]",
"accountId": "2005b227a854-im-bot",
"agentId": "main",
"requesterOrigin": {
"channel": "openclaw-weixin",
"to": "[email protected]"
}
}
# Step 4: Verify delivery succeeds (if test mode disabled)
$ openclaw delivery process --once
[DELIVERY] Processing 1 pending delivery
[DELIVERY] Channel=openclaw-weixin, API endpoint=wechat-api.internal
[DELIVERY] Success: notification sent to [email protected]Verification Steps for Fix 4: Channel-Address Validation
# Step 1: Test mismatched delivery attempt (should fail fast)
$ openclaw test delivery-create \
--channel feishu \
--address "[email protected]"
# Expected output:
Error: Address format mismatch: channel=feishu, [email protected]
does not match pattern /^[a-zA-Z0-9_-]+@[a-zA-Z0-9_.-]+$/
DeliveryValidationError: Validation failed
# Step 2: Test matched delivery (should succeed)
$ openclaw test delivery-create \
--channel feishu \
--address "[email protected]"
[DELIVERY] Entry created successfully
[VALIDATION] β Channel-address format validatedβ οΈ Common Pitfalls
Environment-Specific Traps
- macOS Darwin process management: The
killsignal handling differs between macOS and Linux. Ensure signal handlers useSIGTERMfor graceful shutdown andSIGKILLonly as last resort. On Darwin 25.4.0, the default 5000ms channel stop timeout may be insufficient under heavy model_call load. - Node 24.4.1 async behavior:
Promise.allSettledbehavior changed subtly in Node 22+. When aborting multiple stalled tasks, ensure each task's abort handler properly clears internal state before returning. - Docker container resource limits: If running gateway in Docker with memory limits, model_call stalls may trigger OOM watchdog before the draining timeout. Allocate minimum 512MB RAM to gateway container.
Configuration Pitfalls
- Channel array ordering: The
openclaw.jsonchannels array order matters for fallback routing. Always put the primary channel first, or modify the routing logic to userequesterOriginas primary (per Fix 3). - Multi-account same accountId: When
feishuandopenclaw-weixinshare the sameaccountId, channel validation may appear to pass but delivery will fail at the API layer. Always validate channel-to-address patterns. - Delivery queue persistence: Pending deliveries in
~/.openclaw/delivery-queue/survive gateway restarts. After applying Fix 3, manually clear stale entries with mismatched channels:# Clear stale feishu deliveries with wechat addresses jq -c '.[]' ~/.openclaw/delivery-queue/*.json | \ while read entry; do channel=$(echo $entry | jq -r '.channel') address=$(echo $entry | jq -r '.to') if [[ "$channel" == "feishu" && "$address" == *"@im.wechat"* ]]; then rm "$(~/.openclaw/delivery-queue/ | grep -l "$entry")" fi done
Runtime Behavior Pitfalls
- Aborting model_call mid-stream: If the model_call has an in-flight HTTP request to the fireworks API, ensure the abort handler cancels the underlying
fetchrequest usingAbortController. Otherwise, the request continues consuming resources even after abort. - Draining vs. forceful restart: Distinguish between graceful draining (
openclaw gateway restart) and forced restart (openclaw gateway restart --force). The timeout applies only to graceful draining. - Delivery retry backoff: The 20-retry limit with exponential backoff may cause delivery recovery tasks to appear "active" for extended periods. Consider adding a maximum total retry time (e.g., 5 minutes) regardless of retry count.
Edge Cases
- Nested subagents: If a subagent spawns another subagent, the
requesterOriginmust propagate through the entire chain. Verify thatsubagentContext.requesterOriginis passed correctly inSubagentContextconstructor. - Channel-less delivery: For internal notifications (no channel), ensure routing logic handles
channel: nullgracefully by falling back to the agent's default channel. - Rapid restart requests: If user triggers restart multiple times while draining, ensure only one draining sequence runs. Add a mutex or state check to prevent duplicate drain attempts.
π Related Errors
Directly Related Errors
GatewayDrainingError
Gateway refuses new tasks during draining phase. This is the expected behavior but becomes a problem when draining has no timeout. Related to: indefinite draining loop.DeliveryValidationError
Channel-address format mismatch detected. New error introduced by Fix 4 to fail fast on misrouted deliveries. Related to: Bug 2 cross-channel routing.feishu_code: 99992360
Feishu API rejection: "Invalid ids" β user ID format not recognized by Feishu. This error confirms the routing bug (wechat ID sent to feishu).stalled_agent_run(classification)
Diagnostic classification indicating no progress for 120+ seconds. Recovery action should be "abort" not "none" formodel_callwork kinds.
Indirectly Related Errors
ETIMEDOUT(channel stop)
Channel stop exceeded 5000ms after abort signal. Indicates that forceful abort handlers are not completing in time. May indicate model_call or delivery handlers blocking on network I/O.ENOTFOUND(delivery queue)
Delivery queue file not found during recovery. Can occur if queue files are deleted while processing retries.HTTP 400(upstream API)
Generic bad request error from messaging APIs. For Feishu specifically, check error body forfeishu_codefield to distinguish invalid IDs vs. other validation failures.- Session state:
processingwithage: 175s
Session stuck in processing state without progress. Should trigger automatic abort per Fix 2.
Historical Context
- Issue #1087: Gateway restart hangs on long-running tool_calls
Previous similar issue wheretool_callblocked draining. Root cause was missing abort handler registration. Fix involved addingAbortControllerto all async tool invocations. - Issue #892: Delivery to wrong channel for multi-channel agents
Agent bound to multiple channels delivered notifications to first channel regardless of requester. This is the same root cause as Bug 2, previously partially fixed but regressed. - Issue #2154: model_call timeout not enforced during agent run
Request timeout configured but not propagated to internal model_call invocations. This is whylastProgressAge: 168sexceeded typical timeouts (60-120s) without triggering abort.