May 05, 2026 • Version: 2026.3.8

Control UI Chat Stuck on 'Stop' After Embedded Run Timeout

Embedded runs that timeout or abort may not emit terminal lifecycle events to the Control UI, causing the webchat session to remain in a stuck running state indefinitely.

🔍 Symptoms

Observable Behavior

When an embedded run times out on Windows, users observe the following:

UI State: The "Stop" button remains visible and does not transition to "Run" or display completion
Interaction Failure: Clicking the "Stop" button has no effect
Gateway Health: Gateway health monitoring continues to show normal status
Chat Content: Tool/edit actions already appear completed in the chat history
Zombied Session: The specific run/session appears permanently stuck in a running state

Log Evidence

The following timeout-related log entry is observed:

embedded run timeout: runId=<...> sessionId=<...> timeoutMs=600000

Additional timeout entries may include LLM request timeouts occurring around the same period.

Reproduction Context

This behavior has been observed in:

Gateway Dashboard chat interface
Control UI sessions
Long-running embedded runs (10+ minute timeout scenarios)

🧠 Root Cause

Architectural Analysis

The issue stems from a disconnect between the embedded run abort path and the Control UI lifecycle event subscription.

Code Path Flow

Embedded Run Initialization: When an embedded run starts, agentRunStarted is set to true in the webchat/control-ui state
UI Finalization Dependency: Control UI finalizes runs only upon receiving lifecycle terminal events where phase === "end" or phase === "error"
Timeout Trigger: When timeout occurs, abortRun(true) / activeSession.abort() is invoked
Missing Terminal Event: The timeout/abort path does not directly emit a terminal lifecycle event
Terminal Event Dependency: Terminal lifecycle emission relies on the embedded subscription receiving agent_end

Failure Sequence

Embedded Run Starts ↓ agentRunStarted = true (UI enters dependent state) ↓ Timeout/Abort Triggered → abortRun(true) / activeSession.abort() ↓ [PATH A] Normal: agent_end received → terminal lifecycle event → UI finalized [PATH B] Failure: agent_end NOT received → terminal lifecycle event NOT emitted → UI stuck

The Core Gap

The Control UI has no fallback finalization path when:

agentRunStarted === true
No terminal lifecycle event (phase === "end" or phase === "error") arrives

Contributing Factors

Unsubscribe Race Condition: subscribeEmbeddedPiSession(...) may unsubscribe before the terminal lifecycle event is emitted
Abort Signal Bypass: activeSession.abort() may bypass agent_end emission in certain timeout scenarios
No Defensive Finalization: When dispatchInboundMessage() settles but no terminal lifecycle event was observed, there is no defensive cleanup

🛠️ Step-by-Step Fix

Fix 1: Ensure Terminal Lifecycle Event Emission in Timeout Path

File: Embedded run execution handler (runEmbeddedAttempt(…) / runEmbeddedPiAgent(…))

Before: javascript // Timeout path - abort only function handleTimeout(runId, sessionId) { abortRun(true); activeSession.abort(); // No terminal lifecycle event emitted }

After: javascript // Timeout path - abort AND emit terminal event function handleTimeout(runId, sessionId) { abortRun(true); activeSession.abort();

// Emit terminal lifecycle event to unblock UI
emitLifecycleEvent({
    phase: 'error',
    runId: runId,
    sessionId: sessionId,
    reason: 'timeout'
});

}

Fix 2: Add Subscription Guards for Terminal Events

File: subscribeEmbeddedPiSession(…)

Before: javascript // Unsubscribe may occur before agent_end session.subscribe({ onAgentEnd: () => { /* cleanup / }, onError: () => { / cleanup */ } }); // No guard against premature unsubscription

After: javascript // Guard terminal event emission let terminalEventEmitted = false; session.subscribe({ onAgentEnd: () => { if (!terminalEventEmitted) { terminalEventEmitted = true; emitLifecycleEvent({ phase: ’end’, … }); } }, onError: (error) => { if (!terminalEventEmitted) { terminalEventEmitted = true; emitLifecycleEvent({ phase: ’error’, error, … }); } } });

Fix 3: Add Defensive Finalization in dispatchInboundMessage

File: Webchat/control-ui message dispatch handler

Before: javascript async function dispatchInboundMessage(message) { await processMessage(message); // No terminal state check after settlement }

After: javascript async function dispatchInboundMessage(message) { await processMessage(message);

// Defensive: Check if run started but no terminal event received
if (agentRunStarted && !terminalLifecycleReceived) {
    // Finalize as timeout if no progress after settlement
    finalizeWithTerminalState({
        phase: 'error',
        reason: 'timeout_pending_finalization'
    });
}

}

Fix 4: Guard against Premature Unsubscription

File: Session management in embedded runner

Before: javascript // Subscription cleanup may happen too early function cleanup() { subscription.unsubscribe(); }

After: javascript // Subscription cleanup only after terminal state function cleanup() { // Only unsubscribe after terminal event or explicit finalization if (terminalEventEmitted || forceCleanup) { subscription.unsubscribe(); } else { // Schedule cleanup after terminal state scheduleTerminalCleanup(); } }

🧪 Verification

Verification Steps

Simulate Embedded Timeout:

# Start a long-running embedded run and trigger timeout
# Using the Gateway Dashboard chat interface

Open Gateway Dashboard
Start a chat with tool execution
Trigger embedded run with extended timeout
Wait for timeout (600000ms in affected scenario)

Check Terminal Lifecycle Event:

# Verify the terminal lifecycle event is emitted
# Check logs for: phase === "end" OR phase === "error"
Expected log output:
embedded run timeout: runId=<…> sessionId=<…> timeoutMs=600000
lifecycle_event: phase=“error” reason=“timeout” runId=<…>

Verify UI State Transition:

# After timeout, verify:
# 1. Stop button changes to Run (or shows completed state)
# 2. No further "Stop" button interaction failures
# 3. Session no longer appears zombied in health monitoring
Expected behavior:


UI transitions to completed/aborted state within 5 seconds of timeout


“Stop” button becomes responsive again


Session clears from running state tracker

Regression Test:

# Test normal completion path still works


Start Gateway Dashboard chat
Run embedded operation with normal completion (no timeout)
Verify: phase === “end” event fires normally

Verify: UI finalizes correctly on successful completion

Expected Output After Fix

Scenario	Before Fix	After Fix
Timeout	UI stuck on “Stop”	UI transitions to “Error” state
Abort	Session zombied	Session properly finalized
Normal completion	Works	Still works (no regression)
Click “Stop” after timeout	No effect	Clears properly

⚠️ Common Pitfalls

Race Condition in Unsubscription: Ensure the subscription cleanup does not occur before agent_end is processed. Adding boolean flags (terminalEventEmitted) prevents premature unsubscription.
Double Emission of Terminal Events: When adding guards, ensure agent_end and error handlers cannot both trigger. Use a terminalEventEmitted flag to prevent duplicate emissions.
Defensive Finalization Timing: The fallback finalization in dispatchInboundMessage() must not trigger for in-progress runs. Only activate when the run has started and the message processing has settled.
Windows-Specific Timing: The issue was observed on Windows 11. Ensure timeout handling does not rely on platform-specific timing assumptions. Use explicit timeout checks rather than implicit cleanup.
Partial Tool Completion: If tool operations complete but LLM request times out, ensure the terminal event is still emitted. The UI showing completed tool/edit actions does not mean the run is finalized.
Gateway Health Independence: Do not expect gateway health to indicate this issue. The gateway can remain healthy while individual sessions are zombied. Monitor session-level states, not just gateway health.
Session ID Correlation: When tracing logs, ensure runId and sessionId are correctly correlated. Mismatched IDs can cause confusion during debugging.

Environment-Specific Considerations

Environment	Consideration
Windows 11	Timeout paths may have different race characteristics due to process scheduling
Docker	Subscription timing may vary with container resource constraints
macOS	Similar to Windows but verify timeout handler execution order
Linux	Standard POSIX timing; generally more predictable

LLM Request Timeout (600000ms) — Often precedes the stuck UI state; indicates the timeout trigger, not the root cause
agent_end Not Received — The specific failure point where the terminal lifecycle event chain breaks
Subscription Unsubscribed Before Terminal Event — Race condition causing event loss in the embedded subscription path
Zombied Session State — Terminal symptom where sessions remain in running state despite internal completion
Stop Button Non-Responsive — UI symptom resulting from the session being in an invalid state
flushPendingToolResultsAfterIdle Bounded Cleanup — Verified as not the root cause; that cleanup path is properly bounded
dispatchInboundMessage Early Return — Verified as not the issue; no simple early-return bug was found

Historical Context

This issue is related to lifecycle event propagation in the embedded run subsystem. Similar patterns have been observed in:

Embedded Pi Agent session management (v2026.x series)
Control UI run state management
Gateway Dashboard chat finalization

Recommended Monitoring Additions

After applying the fix, monitor for:

# Metrics to track
- lifecycle_event_emission_rate (should be 100% of run terminations)
- ui_finalization_lag (time between run termination and UI update)
- subscription_cleanup_timing (should only occur after terminal event)