May 07, 2026 • Version: 2026.5.4

ReplyRunAlreadyActiveError Fires on 2026.5.4 for Discrete Sequential chat.send via WebSocket

Sequential discrete chat.send calls through the gateway WebSocket path trigger ReplyRunAlreadyActiveError at 50% rate despite the #77485 fix in 2026.5.4, indicating a coverage gap in the active-run guard cleanup for the WS dispatcher path versus the agent-runner's queued follow-up path.

🔍 Symptoms

Primary Manifestation

The ReplyRunAlreadyActiveError reproduces deterministically on 2026.5.4 when sending sequential chat.send requests through the gateway WebSocket path, producing an alternating pass/fail pattern at 50% failure rate.

CLI Reproduction Sequence

Execute the following probe against a gateway running 2026.5.4 in embedded mode:

for i in 1 2 3 4 5 6 7 8 9 10; do
  START=$(date +%s%3N)
  RESPONSE=$(curl -s -w "\nHTTP_CODE:%{http_code}" -X POST http://127.0.0.1:18789/chat.send \
    -H "Content-Type: application/json" \
    -d "{\"sessionKey\":\"agent:test:main\",\"message\":\"Reply containing literal: ok-$i-$(date +%s)\"}")
  END=$(date +%s%3N)
  ELAPSED=$((END - START))
  echo "call $i: ${ELAPSED}ms"
  echo "$RESPONSE"
  sleep 1
done

Observed Output Pattern

Call #	Elapsed	Status	Behavior
1	317ms	FAIL	Empty/canned reply returned
2	1689ms	PASS	Real LLM reply
3	302ms	FAIL	Empty/canned reply returned
4	1876ms	PASS	Real LLM reply
5	299ms	FAIL	Empty/canned reply returned
6	1592ms	PASS	Real LLM reply
7	303ms	FAIL	Empty/canned reply returned
8	1778ms	PASS	Real LLM reply
9	315ms	FAIL	Empty/canned reply returned
10	1604ms	PASS	Real LLM reply

Gateway Error Log Evidence

The gateway error log (pm2 logs openclaw-gateway) shows 16 occurrences of:

followup queue drain failed for agent:test:main: ReplyRunAlreadyActiveError: Reply run already active for agent:test:main

Technical Characteristics of Failures

Fast-fail timing: Failed calls return in ~300ms, which is below typical provider RTT. The error is thrown before any LLM dispatch occurs.
1-second gap is insufficient: Despite the 1s pause between calls (well past the prior call's wall-clock completion), the guard remains active.
Canned fallback returned: Failed calls return the agent-runner's fallback message ("I had a brief hiccup processing that. Could you try again?") rather than a legitimate LLM response.
Binary verification passes: The installed binary is definitively 2026.5.4:
- dist/run-state-Bg5KVIP6.js sha256: 3cdea3a69fe7be00ccf0a77279c51fbe9e977cfc13868063f09259f6305538dd
- dist/agent-runner.runtime-BwDd4yvB.js (updated from 5.3)

Baseline Comparison

Against 2026.4.26 (last known good), the same 10-call probe produces:

All 10 calls succeed with real replies
Warm latency: 1.2–1.7s per call
Zero ReplyRunAlreadyActiveError events in gateway log

🧠 Root Cause

Architectural Overview

The OpenClaw gateway maintains an activeRunsByKey guard (a Map or Set keyed by sessionKey) to prevent concurrent reply runs for the same session. The guard is checked at request entry and cleared on run completion.

The Regression Introduction (2026.5.3 → 2026.5.4)

The fix for #77485 (commit a9817a5, shipped in 2026.5.4) addressed the queued auto-follow-up path. The release notes state:

“clear the active reply-run guard before draining queued same-session follow-up turns, so sequential chat.send calls no longer trip ReplyRunAlreadyActiveError”

However, this fix introduced or exposed a coverage gap for the discrete sequential chat.send path through the gateway WebSocket dispatcher.

Two Distinct Paths with Shared Guard

The activeRunsByKey guard is shared between two code paths: