|
|
decision=surface_error reason=timeout
- **Gateway logs** show a `ConnectionAbortedError` indicating the connection was terminated:
ConnectionAbortedError: [WinError 10053] Your host software aborted an established connection. RemoteProtocolError: Server disconnected without sending a response.
- **Web UI** displays an indefinite loading spinner, never transitioning to an error state or recovering
- **User impact**: No error message is displayed, and the user cannot retry without refreshing the page (which loses conversation context)
## Root Cause Analysis
After analyzing the logs and behavior, the root cause is identified as follows:
1. **Timeout detection works correctly**: The agent's timeout mechanism properly detects when an LLM response exceeds the threshold and logs `decision=surface_error reason=timeout`.
2. **Premature connection termination**: When the agent aborts a run due to timeout, the WebSocket connection is terminated before the agent can send a `final` event with `status: "timeout"` to the Web UI.
3. **Missing error propagation**: The `agent.wait` method is expected to return `status: "timeout"`, but this status is never transmitted to the UI client because the connection is already closed.
4. **Race condition**: The connection teardown happens faster than the error event can be dispatched, causing the UI to remain in a perpetual loading state.
5. **Gateway limitation**: The custom gateway (`ai_router.py`) handles retry logic correctly for network errors, but cannot compensate for the agent aborting the connection from its side.
## Solution
To resolve this issue, the following changes are required:
1. **Ensure `final` event delivery before connection teardown**: Modify the agent's timeout handling to guarantee that a `final` event with `status: "timeout"` is sent to the Web UI **before** the WebSocket connection is terminated.
2. **Implement graceful timeout error response**: Update the agent's timeout logic to construct and send a proper error event:
```python
# Example implementation guidance
await websocket.send_json({
"type": "final",
"status": "timeout",
"error": {
"code": "AGENT_TIMEOUT",
"message": "Agent execution timed out waiting for LLM response"
}
})
-
Add timeout event to WebSocket protocol: Ensure the WebSocket handler in the gateway recognizes and properly propagates timeout events to connected clients.
-
UI timeout handling: Verify that the Web UI correctly handles the
finalevent withstatus: "timeout"and displays an appropriate error message with a retry option.
Prevention
To prevent similar issues in the future:
-
Establish event delivery guarantees: Implement a protocol where critical events (especially
finalevents with any status) must be delivered before connection termination, using proper acknowledgment or flushing mechanisms. -
Add integration tests for timeout scenarios: Create automated tests that verify timeout errors are correctly propagated to the UI across all deployment methods.
-
Implement connection graceful shutdown: Ensure WebSocket connections undergo a graceful shutdown sequence that flushes pending events before closing.
-
Add monitoring for incomplete sessions: Implement metrics/alerting for sessions that remain in loading state beyond expected durations.
-
Document WebSocket event protocol: Maintain clear documentation of all event types and statuses that the UI should handle, including timeout scenarios.
Additional Information
Affected deployment methods: All deployment methods (Docker, bare metal, etc.) on any OS where LLM response times may exceed the agent timeout threshold.
Workaround: Refresh the page to reset the UI state, though this results in loss of conversation context.
Related components:
- Agent timeout handler
- WebSocket gateway service
- Web UI event listener
Suggested debugging steps:
- Enable WebSocket frame logging to capture all events sent to the UI
- Add timing instrumentation around the timeout detection and connection termination
- Trace the event delivery path from
agent.waitto WebSocketsend
Priority: High - This bug blocks user workflows and makes the UI unusable until page refresh.