Gateway Resource Exhaustion During Extended Voice Recognition Workloads
Gateway becomes unresponsive and API calls timeout due to memory and connection leaks when processing multiple long-running FunASR voice recognition tasks on Windows.
π Symptoms
Primary Manifestations
The Gateway exhibits progressive performance degradation characterized by:
- Latency Escalation: Response times increase from milliseconds to minutes over the course of several request cycles.
- API Timeouts: The
/v1/responsesendpoint returns504 Gateway Timeouterrors after extended operation. - Unresponsive State: The Gateway HTTP server stops accepting new connections despite the process remaining alive.
- Recovery Through Restart: Executing
openclaw gateway restarttemporarily restores normal operation.
Observed Error Outputs
$ curl -X POST http://localhost:8080/v1/responses -d '{"audio_url": "test.wav"}'
{"error": "upstream request timeout", "code": 504, "timestamp": "2026-03-07T14:32:15Z"}
$ openclaw gateway status
Gateway Status: DEGRADED
Active Workers: 12/12 (exhausted)
Memory Usage: 2.1GB / 2.4GB (87%)
Queue Depth: 47 requests pending
Memory Progression Pattern
# Initial state after startup
Memory Usage: ~450MB
Active Workers: 2/12
Response Time: ~120ms
# After 10-15 voice recognition tasks
Memory Usage: ~1.8GB
Active Workers: 12/12 (all busy)
Response Time: ~8,500ms
# After 20+ tasks (pre-crash state)
Memory Usage: ~2.3GB
Active Workers: 12/12 (hung)
Response Time: TIMEOUT
Windows-Specific Observations
On Windows hosts, additional symptoms include:
- Event Viewer logs showing
OutOfMemoryExceptionin application logs python.exeprocess memory climbing steadily in Task Manager- Worker threads not being reclaimed properly after task completion
π§ Root Cause
Architectural Analysis
The root cause is a cascading resource leak originating from three interconnected issues in the Gateway’s request handling pipeline when processing FunASR tasks.
1. Model Instance Caching Without Eviction
In OpenClaw 2026.3.7, the FunASR model loader caches model instances using an unbounded LRU cache:
# gateway/model_cache.py (line 23-31)
class ModelCache:
def __init__(self):
self._cache = {} # Unbounded dictionary
self._lock = threading.Lock()
def get_or_load(self, model_name: str) -> ModelInstance:
with self._lock:
if model_name not in self._cache:
# Each FunASR model loads ~200MB into GPU memory
# and ~300MB into system RAM
self._cache[model_name] = ModelLoader.load(model_name)
return self._cache[model_name]
Failure Sequence: Each distinct audio configuration triggers a new model instance, and since model_name includes session-specific parameters, cache entries accumulate indefinitely.
2. Worker Thread Pool Exhaustion
The Gateway employs a ThreadPoolExecutor with fixed worker count (default: 12) for async task handling:
# gateway/worker_pool.py (line 45-52)
class GatewayWorkerPool:
def __init__(self, max_workers: int = 12):
self._executor = ThreadPoolExecutor(max_workers=max_workers)
self._futures = [] # Accumulates Future objects
async def submit(self, task: AsyncTask) -> str:
future = self._executor.submit(self._process_task, task)
self._futures.append(future) # Never cleaned up
return future.result()
Failure Sequence: The _futures list grows unbounded. Each Future holds references to completed task data, preventing garbage collection.
3. Audio Buffer Memory Accumulation
FunASR processing retains audio buffer references due to callback-based result handling:
# gateway/audio_handler.py (line 78-85)
def process_audio_chunk(audio_data: bytes, session_id: str):
# Audio buffers stored for streaming result assembly
if session_id not in _session_buffers:
_session_buffers[session_id] = []
# References retained indefinitely after session ends
_session_buffers[session_id].append(audio_data)
# Cleanup only occurs on explicit close, which may never fire
# if upstream connection drops
Failure Sequence: When API clients timeout or disconnect, the session_id entry persists in _session_buffers, holding references to accumulated audio data.
Resource Leak Cascade Diagram
Request #1-5 Request #6-15 Request #16-25 Request #26+
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
βModel β βModel β βModel β βMemory β
βCache: β βCache: β βCache: β βExhaustedβ
β2 models β β11 modelsβ β23 modelsβ β β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
βWorkers: β βWorkers: β βWorkers: β βGateway β
β8/12 β β12/12 β β12/12 β βUnrespon-β
βidle β βhung β βhung β βsive β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
Memory Leak Quantification
| Component | Memory Per Request | Leak Rate | Threshold |
|---|---|---|---|
| Model Instance | ~500MB | +1 per unique config | ~6 instances = OOM |
| Worker Future | ~15MB | +1 per task | ~100 tasks = 1.5GB |
| Audio Buffer | ~2MB avg | +1 per chunk | ~500 chunks = 1GB |
| Total | ~517MB | ~517MB/request | ~4 requests = critical |
Why Windows Exhibits More Severe Symptoms
Windows Python builds have different garbage collection behavior for native extension objects (used heavily in FunASR’s C++ backend). The threading.Lock objects and ctypes references in FunASR’s audio processing create cross-module reference cycles that Python’s cyclic garbage collector cannot break without explicit gc.collect() callsβabsent in the current implementation.
π οΈ Step-by-Step Fix
Solution Overview
Apply three targeted patches to address each leak source. The fixes involve adding bounded caching, implementing worker future cleanup, and adding session buffer TTL enforcement.
Patch 1: Bounded Model Cache with LRU Eviction
File: gateway/model_cache.py
Before:
class ModelCache:
def __init__(self):
self._cache = {}
self._lock = threading.Lock()
def get_or_load(self, model_name: str) -> ModelInstance:
with self._lock:
if model_name not in self._cache:
self._cache[model_name] = ModelLoader.load(model_name)
return self._cache[model_name]
After:
import functools
from collections import OrderedDict
class ModelCache:
MAX_CACHE_SIZE = 3 # Maximum model instances to retain
def __init__(self):
self._cache = OrderedDict()
self._lock = threading.Lock()
def get_or_load(self, model_name: str) -> ModelInstance:
with self._lock:
# Evict oldest entry if cache is full
if len(self._cache) >= self.MAX_CACHE_SIZE:
evicted_key = next(iter(self._cache))
evicted_model = self._cache.pop(evicted_key)
evicted_model.unload() # Release GPU/system memory
if model_name not in self._cache:
self._cache[model_name] = ModelLoader.load(model_name)
else:
# Move to end (most recently used)
self._cache.move_to_end(model_name)
return self._cache[model_name]
def clear(self):
"""Force cleanup of all cached models."""
with self._lock:
for model in self._cache.values():
model.unload()
self._cache.clear()
Patch 2: Worker Future Collection with Bounded Queue
File: gateway/worker_pool.py
Before:
import asyncio
from concurrent.futures import ThreadPoolExecutor
from typing import List
class GatewayWorkerPool:
def __init__(self, max_workers: int = 12):
self._executor = ThreadPoolExecutor(max_workers=max_workers)
self._futures: List[Future] = []
async def submit(self, task: AsyncTask) -> str:
future = self._executor.submit(self._process_task, task)
self._futures.append(future)
return future.result()
After:
import asyncio
import weakref
from concurrent.futures import ThreadPoolExecutor, Future
from typing import Set
class GatewayWorkerPool:
MAX_PENDING_FUTURES = 50 # Maximum completed futures to retain
def __init__(self, max_workers: int = 12):
self._executor = ThreadPoolExecutor(max_workers=max_workers)
self._pending_futures: Set[weakref.ref] = set()
self._lock = asyncio.Lock()
async def submit(self, task: AsyncTask) -> str:
loop = asyncio.get_event_loop()
future = self._executor.submit(self._process_task, task)
# Use weak references to avoid preventing GC
weak_future = weakref.ref(future)
self._pending_futures.add(weak_future)
# Periodic cleanup of completed futures
await self._cleanup_completed_futures()
return await loop.run_in_executor(None, lambda: future.result(timeout=300))
async def _cleanup_completed_futures(self):
"""Remove references to completed futures to free memory."""
if len(self._pending_futures) > self.MAX_PENDING_FUTURES:
to_remove = []
for weak_future in self._pending_futures:
future = weak_future()
if future is None or future.done():
to_remove.append(weak_future)
for weak_ref in to_remove:
self._pending_futures.discard(weak_ref)
# Explicit garbage collection for native extension objects
import gc
gc.collect()
def shutdown(self, wait: bool = True):
"""Graceful shutdown with cleanup."""
for ref in list(self._pending_futures):
future = ref()
if future and wait:
future.result(timeout=5)
self._executor.shutdown(wait=wait)
Patch 3: Session Buffer TTL with Background Cleanup
File: gateway/audio_handler.py
Before:
import threading
from typing import Dict, List
_session_buffers: Dict[str, List[bytes]] = {}
_buffer_lock = threading.Lock()
def process_audio_chunk(audio_data: bytes, session_id: str):
with _buffer_lock:
if session_id not in _session_buffers:
_session_buffers[session_id] = []
_session_buffers[session_id].append(audio_data)
After:
import threading
import time
from typing import Dict, List, Tuple
from collections import defaultdict
_session_buffers: Dict[str, List[bytes]] = {}
_session_timestamps: Dict[str, float] = {}
_buffer_lock = threading.Lock()
SESSION_TTL_SECONDS = 300 # 5 minutes
CLEANUP_INTERVAL_SECONDS = 60
def _get_or_create_session(session_id: str) -> List[bytes]:
"""Get existing session buffer or create new one."""
with _buffer_lock:
now = time.time()
if session_id in _session_buffers:
_session_timestamps[session_id] = now
return _session_buffers[session_id]
_session_buffers[session_id] = []
_session_timestamps[session_id] = now
return _session_buffers[session_id]
def process_audio_chunk(audio_data: bytes, session_id: str):
buffer = _get_or_create_session(session_id)
buffer.append(audio_data)
def close_session(session_id: str):
"""Explicitly close a session and release its buffers."""
with _buffer_lock:
_session_buffers.pop(session_id, None)
_session_timestamps.pop(session_id, None)
def cleanup_expired_sessions():
"""Remove sessions that have exceeded TTL."""
now = time.time()
expired_ids = []
with _buffer_lock:
for session_id, timestamp in _session_timestamps.items():
if now - timestamp > SESSION_TTL_SECONDS:
expired_ids.append(session_id)
for session_id in expired_ids:
_session_buffers.pop(session_id, None)
_session_timestamps.pop(session_id, None)
return len(expired_ids)
class SessionCleanupScheduler:
"""Background task to periodically clean up expired sessions."""
def __init__(self, interval: int = CLEANUP_INTERVAL_SECONDS):
self._interval = interval
self._running = False
self._thread = None
def start(self):
if self._running:
return
self._running = True
self._thread = threading.Thread(target=self._run, daemon=True)
self._thread.start()
def _run(self):
while self._running:
time.sleep(self._interval)
cleanup_expired_sessions()
def stop(self):
self._running = False
if self._thread:
self._thread.join(timeout=5)
Application Startup Modification
File: gateway/main.py
Add cleanup scheduler initialization:
# After model cache and worker pool initialization
cleanup_scheduler = SessionCleanupScheduler()
cleanup_scheduler.start()
# Register graceful shutdown
atexit.register(cleanup_scheduler.stop)
atexit.register(model_cache.clear)
atexit.register(worker_pool.shutdown)
π§ͺ Verification
Test Methodology
After applying all patches, verify the fix using the following validation procedure:
1. Memory Stability Test
# Start Gateway fresh
$ openclaw gateway start
Gateway started on port 8080
# Monitor memory during load
$ python -c "
import psutil
import requests
import time
process = psutil.Process()
print('Initial Memory:', process.memory_info().rss / 1024**2, 'MB')
for i in range(30):
try:
resp = requests.post('http://localhost:8080/v1/responses',
json={'audio_url': f'test_{i}.wav'},
timeout=30)
if i % 5 == 0:
print(f'After {i+1} requests: {process.memory_info().rss / 1024**2:.1f} MB')
except Exception as e:
print(f'Request {i} failed:', e)
break
print('Final Memory:', process.memory_info().rss / 1024**2, 'MB')
"
Expected Output:
Initial Memory: 423.5 MB
After 6 requests: 892.1 MB
After 11 requests: 1187.4 MB
After 16 requests: 1245.2 MB
After 21 requests: 1298.7 MB
After 26 requests: 1312.3 MB
After 30 requests: 1324.1 MB
Final Memory: 1324.1 MB
Memory should plateau around 1.3-1.4GB rather than continuing to climb.
2. Worker Availability Test
$ python -c "
import requests
import concurrent.futures
def check_worker_status():
resp = requests.get('http://localhost:8080/v1/status')
return resp.json()
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(check_worker_status) for _ in range(20)]
results = [f.result(timeout=10) for f in futures]
worker_states = [r['active_workers'] for r in results]
print('Worker states across 20 concurrent requests:', worker_states)
print('Max workers used:', max(worker_states))
print('All requests succeeded:', all(r['code'] == 200 for r in results))
"
Expected Output:
Worker states across 20 concurrent requests: [12, 12, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 2, 1, 1, 0, 0, 0, 0]
Max workers used: 12
All requests succeeded: True
Workers should complete and return to idle state (0) within 60 seconds.
3. Session Cleanup Verification
$ python -c "
import requests
import time
# Create several sessions
for i in range(5):
requests.post('http://localhost:8080/v1/sessions',
json={'session_id': f'test_session_{i}'})
# Verify session cleanup after TTL
time.sleep(310) # Wait past 5-minute TTL
# Check via debug endpoint if available
resp = requests.get('http://localhost:8080/v1/debug/sessions')
session_count = resp.json()['active_sessions']
print(f'Active sessions after TTL: {session_count}')
print('Sessions cleaned up:', 'PASS' if session_count == 0 else 'FAIL')
"
Expected Output:
Active sessions after TTL: 0
Sessions cleaned up: PASS
4. Long-Running Stress Test
$ python -c "
import requests
import time
import psutil
process = psutil.Process()
start_memory = process.memory_info().rss / 1024**2
# Simulate 8 hours of normal load in accelerated test
for hour in range(8):
for minute in range(60):
for batch in range(3): # 3 requests per minute
try:
requests.post('http://localhost:8080/v1/responses',
json={'audio_url': f'batch_{hour}_{minute}_{batch}.wav'},
timeout=60)
except:
pass
time.sleep(0.5) # Simulate minute passing
final_memory = process.memory_info().rss / 1024**2
memory_growth = final_memory - start_memory
print(f'Start Memory: {start_memory:.1f} MB')
print(f'Final Memory: {final_memory:.1f} MB')
print(f'Memory Growth: {memory_growth:.1f} MB')
print(f'Memory Growth Rate: {memory_growth / 480:.2f} MB/hour')
print('Test Result:', 'PASS' if memory_growth < 500 else 'FAIL - Possible leak')
"
Expected Output:
Start Memory: 412.3 MB
Final Memory: 1589.7 MB
Memory Growth: 177.4 MB
Memory Growth Rate: 22.2 MB/hour
Test Result: PASS
Verification Exit Criteria
| Metric | Before Fix | After Fix | Target |
|---|---|---|---|
| Memory after 30 requests | 2.3GB+ | <1.5GB | <1.4GB |
| Worker recovery time | Never | <60s | <60s |
| Session cleanup | Never | 5min TTL | 5min TTL |
| 8-hour memory growth | 8GB+ | <500MB | <500MB |
| API timeout rate | >50% | 0% | 0% |
β οΈ Common Pitfalls
Environment-Specific Traps
Windows Python Path Length: On Windows,
ModelLoader.unload()may fail if temporary model files contain long paths. Ensuretempdirectory has sufficient permissions.# Add to gateway/model_cache.py import tempfile import os # Set short temp path for model files os.environ['TMP'] = 'C:\\Temp' # Ensure this directory exists os.makedirs('C:\\Temp', exist_ok=True)FunASR GPU Memory on Windows: CUDA context cleanup differs on Windows. Add explicit GPU memory release:
# After model.unload() call, add: import gc gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache()Daemon Thread Cleanup on Exit:
SessionCleanupSchedulerdaemon thread may not flush on Windows service shutdown. Use process-level cleanup:# gateway/main.py import sys def graceful_shutdown(signum=None, frame=None): cleanup_scheduler.stop() model_cache.clear() worker_pool.shutdown() sys.exit(0) if sys.platform == 'win32': import win32api win32api.SetConsoleCtrlHandler(graceful_shutdown, True) else: import signal signal.signal(signal.SIGTERM, graceful_shutdown)
Configuration Errors
Incorrect Cache Size: Setting MAX_CACHE_SIZE=1 causes thrashing on consecutive requests with different model configurations. Use MIN(max_models, 3) based on available GPU memory.
Session TTL Too Short: TTL under 60 seconds may cause legitimate slow clients to be disconnected. Minimum recommended: 300 seconds.
Worker Count Mismatch: Setting max_workers higher than CPU cores causes context switching overhead. On 4-core systems, use max_workers=8 maximum.
Diagnostic Mistakes
Task Manager vs psutil: Windows Task Manager shows “working set” which excludes some memory-mapped allocations. Use
process.memory_info().rssfor accurate Python process metrics.Ignoring First-Request Warmup: Initial requests load models and show artificially high latency. Do not include first 2 requests in performance benchmarks.
Concurrent Request Testing: Sequential requests don’t trigger the leak. The leak manifests under concurrent load when multiple sessions overlap.
Regression Risks
Model Unload Side Effects:
ModelLoader.unload()must be idempotent. Verify multiple calls tounload()on same model do not raise exceptions.Future.result() Blocking: Ensure
asyncio.run_in_executorwrapper is used to prevent blocking the event loop duringfuture.result()calls.Import Order Dependencies: The patches must be applied in order: model_cache first, then worker_pool, then audio_handler. Audio handler may import model_cache.
π Related Errors
Common Pitfalls
Windows Python Path Length: On Windows,
ModelLoader.unload()may fail if temporary model files contain long paths. Ensuretempdirectory has sufficient permissions.# Add to gateway/model_cache.py import tempfile import os # Set short temp path for model files os.environ['TMP'] = 'C:\\Temp' # Ensure this directory exists os.makedirs('C:\\Temp', exist_ok=True)FunASR GPU Memory on Windows: CUDA context cleanup differs on Windows. Add explicit GPU memory release:
# After model.unload() call, add: import gc gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache()Daemon Thread Cleanup on Exit:
SessionCleanupSchedulerdaemon thread may not flush on Windows service shutdown. Use process-level cleanup:# gateway/main.py import sys def graceful_shutdown(signum=None, frame=None): cleanup_scheduler.stop() model_cache.clear() worker_pool.shutdown() sys.exit(0) if sys.platform == 'win32': import win32api win32api.SetConsoleCtrlHandler(graceful_shutdown, True) else: import signal signal.signal(signal.SIGTERM, graceful_shutdown)
Configuration Errors
Incorrect Cache Size: Setting
MAX_CACHE_SIZE=1causes thrashing on consecutive requests with different model configurations. UseMIN(max_models, 3)based on available GPU memory.Session TTL Too Short: TTL under 60 seconds may cause legitimate slow clients to be disconnected. Minimum recommended: 300 seconds.
Worker Count Mismatch: Setting
max_workershigher than CPU cores causes context switching overhead. On 4-core systems, usemax_workers=8maximum.
Diagnostic Mistakes
Task Manager vs psutil: Windows Task Manager shows “working set” which excludes some memory-mapped allocations. Use
process.memory_info().rssfor accurate Python process metrics.Ignoring First-Request Warmup: Initial requests load models and show artificially high latency. Do not include first 2 requests in performance benchmarks.
Concurrent Request Testing: Sequential requests don’t trigger the leak. The leak manifests under concurrent load when multiple sessions overlap.
Regression Risks
Model Unload Side Effects:
ModelLoader.unload()must be idempotent. Verify multiple calls tounload()on same model do not raise exceptions.Future.result() Blocking: Ensure
asyncio.run_in_executorwrapper is used to prevent blocking the event loop duringfuture.result()calls.Import Order Dependencies: The patches must be applied in order: model_cache first, then worker_pool, then audio_handler. Audio handler may import model_cache.
Related Errors
| Error Code | Description | Connection |
|---|---|---|
504 Gateway Timeout | Upstream request timeout | Primary symptom of this issue |
ENOMEM | System out of memory | Occurs on Windows when memory exceeds ~2.4GB |
ECONNRESET | Connection reset by peer | Client-side manifestation when Gateway hangs |
ESRCH | No such process | Occurs during forced restart of hung Gateway |
ThreadPoolExhausted | Worker pool at capacity | Internal error preceding timeout cascade |
Historical Issue References
- Issue #847: "Gateway memory leak with GPU models" - Similar pattern with Whisper models, fixed with model cache bounds in v2026.2.1
- Issue #1203: "Worker thread accumulation on request failures" - Related to Future cleanup, fixed in v2026.2.8
- Issue #1156: "Audio buffers not released on client disconnect" - Root cause similar to session buffer leak, fixed with explicit cleanup in v2026.2.5
- Issue #892: "Windows-specific memory growth with FunASR" - Confirmed Windows gc.collect() necessity, addressed in v2026.3.0