ReplyRunAlreadyActiveError Fires on 2026.5.4 for Discrete Sequential chat.send via WebSocket
Sequential discrete chat.send calls through the gateway WebSocket path trigger ReplyRunAlreadyActiveError at 50% rate despite the #77485 fix in 2026.5.4, indicating a coverage gap in the active-run guard cleanup for the WS dispatcher path versus the agent-runner's queued follow-up path.
π Symptoms
Primary Manifestation
The ReplyRunAlreadyActiveError reproduces deterministically on 2026.5.4 when sending sequential chat.send requests through the gateway WebSocket path, producing an alternating pass/fail pattern at 50% failure rate.
CLI Reproduction Sequence
Execute the following probe against a gateway running 2026.5.4 in embedded mode:
for i in 1 2 3 4 5 6 7 8 9 10; do
START=$(date +%s%3N)
RESPONSE=$(curl -s -w "\nHTTP_CODE:%{http_code}" -X POST http://127.0.0.1:18789/chat.send \
-H "Content-Type: application/json" \
-d "{\"sessionKey\":\"agent:test:main\",\"message\":\"Reply containing literal: ok-$i-$(date +%s)\"}")
END=$(date +%s%3N)
ELAPSED=$((END - START))
echo "call $i: ${ELAPSED}ms"
echo "$RESPONSE"
sleep 1
doneObserved Output Pattern
| Call # | Elapsed | Status | Behavior |
|---|---|---|---|
| 1 | 317ms | FAIL | Empty/canned reply returned |
| 2 | 1689ms | PASS | Real LLM reply |
| 3 | 302ms | FAIL | Empty/canned reply returned |
| 4 | 1876ms | PASS | Real LLM reply |
| 5 | 299ms | FAIL | Empty/canned reply returned |
| 6 | 1592ms | PASS | Real LLM reply |
| 7 | 303ms | FAIL | Empty/canned reply returned |
| 8 | 1778ms | PASS | Real LLM reply |
| 9 | 315ms | FAIL | Empty/canned reply returned |
| 10 | 1604ms | PASS | Real LLM reply |
Gateway Error Log Evidence
The gateway error log (pm2 logs openclaw-gateway) shows 16 occurrences of:
followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:mainTechnical Characteristics of Failures
- Fast-fail timing: Failed calls return in ~300ms, which is below typical provider RTT. The error is thrown before any LLM dispatch occurs.
- 1-second gap is insufficient: Despite the 1s pause between calls (well past the prior call's wall-clock completion), the guard remains active.
- Canned fallback returned: Failed calls return the agent-runner's fallback message ("I had a brief hiccup processing that. Could you try again?") rather than a legitimate LLM response.
- Binary verification passes: The installed binary is definitively 2026.5.4:
dist/run-state-Bg5KVIP6.jssha256:3cdea3a69fe7be00ccf0a77279c51fbe9e977cfc13868063f09259f6305538dddist/agent-runner.runtime-BwDd4yvB.js(updated from 5.3)
Baseline Comparison
Against 2026.4.26 (last known good), the same 10-call probe produces:
- All 10 calls succeed with real replies
- Warm latency: 1.2β1.7s per call
- Zero ReplyRunAlreadyActiveError events in gateway log
π§ Root Cause
Architectural Overview
The OpenClaw gateway maintains an activeRunsByKey guard (a Map or Set keyed by sessionKey) to prevent concurrent reply runs for the same session. The guard is checked at request entry and cleared on run completion.
The Regression Introduction (2026.5.3 β 2026.5.4)
The fix for #77485 (commit a9817a5, shipped in 2026.5.4) addressed the queued auto-follow-up path. The release notes state:
“clear the active reply-run guard before draining queued same-session follow-up turns, so sequential chat.send calls no longer trip ReplyRunAlreadyActiveError”
However, this fix introduced or exposed a coverage gap for the discrete sequential chat.send path through the gateway WebSocket dispatcher.
Two Distinct Paths with Shared Guard
The activeRunsByKey guard is shared between two code paths: