May 05, 2026 β€’ Version: 2026.3.8

Control UI Chat Stuck on 'Stop' After Embedded Run Timeout

Embedded runs that timeout or abort may not emit terminal lifecycle events to the Control UI, causing the webchat session to remain in a stuck running state indefinitely.

πŸ” Symptoms

Observable Behavior

When an embedded run times out on Windows, users observe the following:

  • UI State: The "Stop" button remains visible and does not transition to "Run" or display completion
  • Interaction Failure: Clicking the "Stop" button has no effect
  • Gateway Health: Gateway health monitoring continues to show normal status
  • Chat Content: Tool/edit actions already appear completed in the chat history
  • Zombied Session: The specific run/session appears permanently stuck in a running state

Log Evidence

The following timeout-related log entry is observed:

embedded run timeout: runId=<...> sessionId=<...> timeoutMs=600000

Additional timeout entries may include LLM request timeouts occurring around the same period.

Reproduction Context

This behavior has been observed in:

  • Gateway Dashboard chat interface
  • Control UI sessions
  • Long-running embedded runs (10+ minute timeout scenarios)

🧠 Root Cause

Architectural Analysis

The issue stems from a disconnect between the embedded run abort path and the Control UI lifecycle event subscription.

Code Path Flow

  1. Embedded Run Initialization: When an embedded run starts, agentRunStarted is set to true in the webchat/control-ui state
  2. UI Finalization Dependency: Control UI finalizes runs only upon receiving lifecycle terminal events where phase === "end" or phase === "error"
  3. Timeout Trigger: When timeout occurs, abortRun(true) / activeSession.abort() is invoked
  4. Missing Terminal Event: The timeout/abort path does not directly emit a terminal lifecycle event
  5. Terminal Event Dependency: Terminal lifecycle emission relies on the embedded subscription receiving agent_end

Failure Sequence

Embedded Run Starts ↓ agentRunStarted = true (UI enters dependent state) ↓ Timeout/Abort Triggered β†’ abortRun(true) / activeSession.abort() ↓ [PATH A] Normal: agent_end received β†’ terminal lifecycle event β†’ UI finalized [PATH B] Failure: agent_end NOT received β†’ terminal lifecycle event NOT emitted β†’ UI stuck

The Core Gap

The Control UI has no fallback finalization path when:

  • agentRunStarted === true
  • No terminal lifecycle event (phase === "end" or phase === "error") arrives

Contributing Factors

  1. Unsubscribe Race Condition: subscribeEmbeddedPiSession(...) may unsubscribe before the terminal lifecycle event is emitted
  2. Abort Signal Bypass: activeSession.abort() may bypass agent_end emission in certain timeout scenarios
  3. No Defensive Finalization: When dispatchInboundMessage() settles but no terminal lifecycle event was observed, there is no defensive cleanup

πŸ› οΈ Step-by-Step Fix

Fix 1: Ensure Terminal Lifecycle Event Emission in Timeout Path

File: Embedded run execution handler (runEmbeddedAttempt(…) / runEmbeddedPiAgent(…))

Before: javascript // Timeout path - abort only function handleTimeout(runId, sessionId) { abortRun(true); activeSession.abort(); // No terminal lifecycle event emitted }

After: javascript // Timeout path - abort AND emit terminal event function handleTimeout(runId, sessionId) { abortRun(true); activeSession.abort();

// Emit terminal lifecycle event to unblock UI
emitLifecycleEvent({
    phase: 'error',
    runId: runId,
    sessionId: sessionId,
    reason: 'timeout'
});

}

Fix 2: Add Subscription Guards for Terminal Events

File: subscribeEmbeddedPiSession(…)

Before: javascript // Unsubscribe may occur before agent_end session.subscribe({ onAgentEnd: () => { /* cleanup / }, onError: () => { / cleanup */ } }); // No guard against premature unsubscription

After: javascript // Guard terminal event emission let terminalEventEmitted = false; session.subscribe({ onAgentEnd: () => { if (!terminalEventEmitted) { terminalEventEmitted = true; emitLifecycleEvent({ phase: ’end’, … }); } }, onError: (error) => { if (!terminalEventEmitted) { terminalEventEmitted = true; emitLifecycleEvent({ phase: ’error’, error, … }); } } });

Fix 3: Add Defensive Finalization in dispatchInboundMessage

File: Webchat/control-ui message dispatch handler

Before: javascript async function dispatchInboundMessage(message) { await processMessage(message); // No terminal state check after settlement }

After: javascript async function dispatchInboundMessage(message) { await processMessage(message);

// Defensive: Check if run started but no terminal event received
if (agentRunStarted && !terminalLifecycleReceived) {
    // Finalize as timeout if no progress after settlement
    finalizeWithTerminalState({
        phase: 'error',
        reason: 'timeout_pending_finalization'
    });
}

}

Fix 4: Guard against Premature Unsubscription

File: Session management in embedded runner

Before: javascript // Subscription cleanup may happen too early function cleanup() { subscription.unsubscribe(); }

After: javascript // Subscription cleanup only after terminal state function cleanup() { // Only unsubscribe after terminal event or explicit finalization if (terminalEventEmitted || forceCleanup) { subscription.unsubscribe(); } else { // Schedule cleanup after terminal state scheduleTerminalCleanup(); } }

πŸ§ͺ Verification

Verification Steps

  1. Simulate Embedded Timeout:
    # Start a long-running embedded run and trigger timeout
    # Using the Gateway Dashboard chat interface
    
    1. Open Gateway Dashboard
    2. Start a chat with tool execution
    3. Trigger embedded run with extended timeout
    4. Wait for timeout (600000ms in affected scenario)
  • Check Terminal Lifecycle Event:
    # Verify the terminal lifecycle event is emitted
    # Check logs for: phase === "end" OR phase === "error"
    

    Expected log output: embedded run timeout: runId=<…> sessionId=<…> timeoutMs=600000 lifecycle_event: phase=“error” reason=“timeout” runId=<…>

  • Verify UI State Transition:
    # After timeout, verify:
    # 1. Stop button changes to Run (or shows completed state)
    # 2. No further "Stop" button interaction failures
    # 3. Session no longer appears zombied in health monitoring
    

    Expected behavior:

    • UI transitions to completed/aborted state within 5 seconds of timeout

    • “Stop” button becomes responsive again

    • Session clears from running state tracker

  • Regression Test:
    # Test normal completion path still works
    
    1. Start Gateway Dashboard chat
    2. Run embedded operation with normal completion (no timeout)
    3. Verify: phase === “end” event fires normally
    4. Verify: UI finalizes correctly on successful completion

    Expected Output After Fix

    ScenarioBefore FixAfter Fix
    TimeoutUI stuck on “Stop”UI transitions to “Error” state
    AbortSession zombiedSession properly finalized
    Normal completionWorksStill works (no regression)
    Click “Stop” after timeoutNo effectClears properly

    ⚠️ Common Pitfalls

    • Race Condition in Unsubscription: Ensure the subscription cleanup does not occur before agent_end is processed. Adding boolean flags (terminalEventEmitted) prevents premature unsubscription.
    • Double Emission of Terminal Events: When adding guards, ensure agent_end and error handlers cannot both trigger. Use a terminalEventEmitted flag to prevent duplicate emissions.
    • Defensive Finalization Timing: The fallback finalization in dispatchInboundMessage() must not trigger for in-progress runs. Only activate when the run has started and the message processing has settled.
    • Windows-Specific Timing: The issue was observed on Windows 11. Ensure timeout handling does not rely on platform-specific timing assumptions. Use explicit timeout checks rather than implicit cleanup.
    • Partial Tool Completion: If tool operations complete but LLM request times out, ensure the terminal event is still emitted. The UI showing completed tool/edit actions does not mean the run is finalized.
    • Gateway Health Independence: Do not expect gateway health to indicate this issue. The gateway can remain healthy while individual sessions are zombied. Monitor session-level states, not just gateway health.
    • Session ID Correlation: When tracing logs, ensure runId and sessionId are correctly correlated. Mismatched IDs can cause confusion during debugging.

    Environment-Specific Considerations

    EnvironmentConsideration
    Windows 11Timeout paths may have different race characteristics due to process scheduling
    DockerSubscription timing may vary with container resource constraints
    macOSSimilar to Windows but verify timeout handler execution order
    LinuxStandard POSIX timing; generally more predictable
    • LLM Request Timeout (600000ms) β€” Often precedes the stuck UI state; indicates the timeout trigger, not the root cause
    • agent_end Not Received β€” The specific failure point where the terminal lifecycle event chain breaks
    • Subscription Unsubscribed Before Terminal Event β€” Race condition causing event loss in the embedded subscription path
    • Zombied Session State β€” Terminal symptom where sessions remain in running state despite internal completion
    • Stop Button Non-Responsive β€” UI symptom resulting from the session being in an invalid state
    • flushPendingToolResultsAfterIdle Bounded Cleanup β€” Verified as not the root cause; that cleanup path is properly bounded
    • dispatchInboundMessage Early Return β€” Verified as not the issue; no simple early-return bug was found

    Historical Context

    This issue is related to lifecycle event propagation in the embedded run subsystem. Similar patterns have been observed in:

    • Embedded Pi Agent session management (v2026.x series)
    • Control UI run state management
    • Gateway Dashboard chat finalization

    After applying the fix, monitor for:

    # Metrics to track
    - lifecycle_event_emission_rate (should be 100% of run terminations)
    - ui_finalization_lag (time between run termination and UI update)
    - subscription_cleanup_timing (should only occur after terminal event)
    

    Evidence & Sources

    This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.