April 29, 2026 β€’ Version: 2026.5.7

BlueBubbles Channel Hangs in start-account Phase After Plugin Config Hot-Reload

Hot-reloading plugins.entries.* config causes the BlueBubbles channel to deadlock in the start-account phase, leaving webhooks silently responding HTTP 200 while discarding all inbound messages.

πŸ” Symptoms

Externally Appearing Healthy

The gateway process appears operational from external monitoring perspectives:

# Check service status
$ systemctl is-active openclaw-gateway
active

# Verify TCP listener binding
$ ss -tlnp | grep -E '(8080|8443)'
LISTEN 0 511  0.0.0.0:8080  0.0.0.0:*  users:(("node",pid=1337,fd=20))

# Test webhook endpoint - returns 200 immediately
$ curl -X POST https://gateway.example.com/bluebubbles-webhook \
  -H "Content-Type: application/json" \
  -d '{"test": true}' \
  -w "\nHTTP_CODE: %{http_code}\nTIME: %{time_total}s\n"
HTTP/1.1 200 OK
HTTP_CODE: 200
TIME: 0.002s

The suspiciously fast 2-5ms response time is a key indicator β€” legitimate BlueBubbles webhook processing typically exhibits 50-200ms latency due to signature validation and dispatch overhead.

Internally Stuck State

Gateway logs cease completely after startup sequence completes:

[default] starting provider (webhook=/bluebubbles-webhook)
[default] BlueBubbles server macOS 26.3.1
[default] BlueBubbles Private API enabled
[default] BlueBubbles webhook listening on /bluebubbles-webhook
[default] BlueBubbles catchup: replayed=0 fetched=0 window_ms=5000
[cron] started

# ... silence for hours ...

Liveness Diagnostic Signature

When polling /debug/liveness or reviewing metrics, the following pattern is diagnostic:

liveness warning: reasons=event_loop_delay interval=30s
  eventLoopDelayP99Ms=21.3 eventLoopDelayMaxMs=5964.3
  cpuCoreRatio=0.094 active=1 waiting=0 queued=1
  phase=channels.bluebubbles.start-account
  recentPhases=sidecars.restart-sentinel:0ms,
    sidecars.subagent-recovery:13ms,
    sidecars.main-session-recovery:8ms,
    post-attach.update-sentinel:0ms,
    sidecars.session-locks:61235ms,
    post-ready.maintenance:759ms

Critical indicators:

  • phase=channels.bluebubbles.start-account β€” stuck, never transitions to running or ready
  • sidecars.session-locks:61235ms β€” 60+ seconds spent acquiring session locks during hot-reload
  • eventLoopDelayMaxMs=5964.3 β€” event loop blocking, confirming lock contention

Webhook Processing Blackhole

Debug-level logging (logging.level: debug) reveals the handler never fires:

# Expected but missing:
$ grep "webhook received" /var/log/openclaw/gateway.log
# No entries appear despite mux-side logs confirming forwards land successfully

# Expected but missing:
$ grep "webhook accepted" /var/log/openclaw/gateway.log
# Never logged

The HTTP listener accepts requests and returns 200, but the message-handling pipeline never receives dispatch.

🧠 Root Cause

Architectural Failure: Session-Lock Contention During Hot-Reload

The root cause is a deadlock condition induced by the plugin hot-reload path attempting in-place re-initialization of the BlueBubbles channel while holding session locks.

Failure Sequence

1. Normal Operation (Pre-Hot-Reload)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BlueBubbles     │────▢│ Session Lock     │────▢│ Message Handler β”‚
β”‚ Webhook Handler β”‚     β”‚ (acquired)       β”‚     β”‚ Pipeline         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Hot-Reload Trigger Event

Config Write: plugins.entries.google.config.webSearch.model
   β”‚
   β–Ό
Plugin Subsystem detects change
   β”‚
   β–Ό
Attempts: channels.bluebubbles.reinitialize()
   β”‚
   β–Ό
Blocks on: acquire session-locks (ALREADY HELD)

3. Deadlock State

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Hot-Reload Thread       β”‚     β”‚ Webhook Handler Thread  β”‚
β”‚ (reinitialize)          β”‚     β”‚ (inbound request)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Waiting to acquire      │◀────│ Holds session-lock      β”‚
β”‚ session-lock...         β”‚     β”‚ (blocked releasing)     β”‚
β”‚ (INDEFINITELY)          β”‚     β”‚ (waiting for handler    β”‚
β”‚                         β”‚     β”‚ pipeline to drain)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²                              β–²
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              CIRCULAR WAIT (DEADLOCK)

Technical Deep-Dive

The start-account Phase Problem:

The channels.bluebubbles.start-account phase is designed to perform:

  1. Session establishment with BlueBubbles server
  2. Lock acquisition for account-scoped resources
  3. Webhook registration confirmation

During hot-reload, the re-initialization path enters start-account while the existing handler thread holds the session lock. The new initialization thread blocks indefinitely waiting for locks it cannot obtain because the old thread is waiting for the handler pipeline to drain β€” which requires processing webhooks that never reach the pipeline.

Affected Code Path:

// Approximate representation of the stuck code path
async function reinitializeChannel() {
  // 1. Hot-reload trigger received
  await pluginManager.reload(pluginId);
  
  // 2. Channel re-initiation attempts start-account
  const channelState = await blueBubbles.startAccount({
    phoneNumber: config.phoneNumber,
    webhookPath: '/bluebubbles-webhook'
  });
  
  // 3. startAccount acquires session lock - BLOCKS HERE
  await sessionLockManager.acquire({
    scope: 'account',
    identifier: config.phoneNumber,
    timeout: null  // No timeout = indefinite wait
  });
  
  // Never reaches: channelState = 'ready'
}

Lock Acquisition Without Timeout:

The session-lock acquisition call uses timeout: null, meaning it will block indefinitely rather than failing or retrying. This converts a recoverable race condition into a permanent deadlock.

Why Only a Fraction of Tenants Are Affected:

The race condition timing depends on:

  • Whether an active webhook request is mid-processing at hot-reload moment
  • The specific order of thread scheduling
  • Whether the session-lock release path has a yield point

Approximately 55% (15 of 27) of tenants hit this due to concurrent webhook traffic timing alignment.

Environment Factors

FactorImpact
Plugin hot-reload triggerAny plugins.entries.* config change
Concurrent webhook trafficIncreases likelihood of race condition
groupPolicy: open or allowlistIncreases traffic volume
Event loop saturationExacerbates timing sensitivity

πŸ› οΈ Step-by-Step Fix

This fix prevents indefinite blocking by adding a bounded timeout to session-lock acquisition during re-initialization.

Before (Stuck Code)

// In lib/channels/bluebubbles/channel-manager.js
async function startAccount(config) {
  // ...
  await sessionLock.acquire({
    scope: 'account',
    identifier: config.phoneNumber
    // timeout missing = infinite wait
  });
  
  // Deadlock occurs here
}

After (Fixed Code)

// In lib/channels/bluebubbles/channel-manager.js
async function startAccount(config, options = {}) {
  const timeout = options.timeout ?? 30000; // 30 second default
  
  const lockAcquired = await sessionLock.acquire({
    scope: 'account',
    identifier: config.phoneNumber,
    timeout: timeout,
    onTimeout: 'fail-fast' // Return error instead of blocking
  });
  
  if (!lockAcquired) {
    const error = new Error(
      `startAccount: session-lock acquisition timeout after ${timeout}ms`
    );
    error.code = 'LOCK_ACQUISITION_TIMEOUT';
    error.context = { phoneNumber: config.phoneNumber };
    throw error;
  }
  
  // Proceed with account startup
  return await completeAccountStartup(config);
}

Fix 2: Hot-Reload Path Should Force Clean Tear-Down

Rather than attempting in-place re-initialization, the hot-reload path should trigger a clean shutdown followed by cold start.

Implementation

// In lib/plugin-manager/reloader.js
async function handlePluginConfigHotReload(pluginId, newConfig) {
  const channel = channels.find(c => c.pluginId === pluginId);
  
  if (channel && channel.type === 'bluebubbles') {
    // Instead of reinitialize(), do full teardown + cold start
    logger.info('BlueBubbles: initiating clean restart due to config hot-reload');
    
    // 1. Signal graceful shutdown
    await channel.shutdown({ timeout: 5000, force: true });
    
    // 2. Release all session locks held by channel
    await sessionLockManager.releaseAll({
      scope: 'account',
      channelId: channel.id
    });
    
    // 3. Clear any pending webhook handlers
    await webhookHandlerManager.clear(channel.id);
    
    // 4. Cold start (not re-initialize)
    await channel.start({
      fresh: true,
      config: newConfig
    });
  } else {
    // Standard reload for non-channel plugins
    await pluginManager.reload(pluginId);
  }
}

Fix 3: Apply via Configuration (Immediate Mitigation)

If source code modification is not immediately available, the following operational steps provide mitigation:

Step 1: Identify Affected Tenants

# Query liveness endpoint for stuck tenants
curl -s https://gateway.example.com/debug/liveness | \
  jq '.tenants[] | select(.phase == "channels.bluebubbles.start-account")'

# Expected output:
{
  "tenantId": "tenant-1234",
  "phase": "channels.bluebubbles.start-account",
  "stuckDuration": "4h23m15s"
}

Step 2: Isolate Stuck Channels

# For each stuck tenant, disable the channel temporarily
# This prevents further webhook traffic from being silently dropped

curl -X PATCH https://gateway.example.com/api/v1/tenants/{tenantId}/channels/bluebubbles \
  -H "Authorization: Bearer {admin_token}" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

# Response: 200 OK

Step 3: Force Cold Restart

# Option A: Docker tenants (full recreation)
docker compose down && docker compose up -d

# Option B: Systemd tenants (native install)
sudo systemctl restart openclaw-gateway

# Verify restart clears the stuck state
sleep 10
curl -s https://gateway.example.com/debug/liveness | \
  jq '.tenants[] | select(.tenantId == "tenant-1234") | .phase'
# Expected: "running" or "ready"

πŸ§ͺ Verification

Verification 1: Confirm Start-Account Completes

After applying the fix, verify the BlueBubbles channel transitions to a ready state:

# Check channel phase for all tenants
$ curl -s https://gateway.example.com/debug/liveness | \
  jq '.tenants[] | select(.channelType == "bluebubbles") | {tenantId, phase, phaseAge}'

# Expected output (after fix):
[
  {
    "tenantId": "tenant-1234",
    "phase": "running",
    "phaseAge": "2m15s"
  },
  {
    "tenantId": "tenant-5678", 
    "phase": "running",
    "phaseAge": "45s"
  }
]

# Verify NO tenants remain in 'start-account' phase
$ curl -s https://gateway.example.com/debug/liveness | \
  jq '[.tenants[] | select(.phase == "channels.bluebubbles.start-account")] | length'
0

Verification 2: Webhook Processing Active

Confirm webhooks are being received and processed (not just responded to):

# Enable debug logging temporarily
curl -X PATCH https://gateway.example.com/api/v1/config \
  -H "Authorization: Bearer {admin_token}" \
  -d '{"logging": {"level": "debug"}}'

# Send test webhook
curl -X POST https://gateway.example.com/bluebubbles-webhook \
  -H "Content-Type: application/json" \
  -H "X-BlueBubbles-Signature: test-signature" \
  -d '{"message": {"text": "VERIFICATION_TEST"}, "from": "+12345551234"}'

# Check logs for webhook processing
$ ssh openclaw-gateway "tail -f /var/log/openclaw/gateway.log" | grep -E "(webhook received|webhook accepted|message processed)"

# Expected output within 5 seconds:
[bluebubbles] webhook received path=/bluebubbles-webhook id=abc123
[bluebubbles] webhook accepted tenant=tenant-1234
[bluebubbles] message processed from=+12345551234

Verification 3: Session-Lock Metrics Normal

Verify session-lock acquisition completes within acceptable bounds:

# Check metrics endpoint
$ curl -s https://gateway.example.com/metrics | \
  grep -E "(session_lock|bluebubbles)" | head -20

# Key metrics to verify:
# bluebubbles_start_account_duration_seconds_bucket{le="30"} should show non-zero
# session_lock_acquisition_duration_seconds should be < 5s (not 60s+)

Verification 4: Hot-Reload Resilience Test

Test that the fix prevents deadlock on subsequent hot-reloads:

# Trigger a hot-reload of plugin config
curl -X PUT https://gateway.example.com/api/v1/config/plugins/entries/google \
  -H "Authorization: Bearer {admin_token}" \
  -H "Content-Type: application/json" \
  -d '{"config": {"webSearch": {"model": "claude-sonnet-4-20250514"}}}'

# Immediately monitor phase
for i in {1..10}; do
  phase=$(curl -s https://gateway.example.com/debug/liveness | \
    jq -r '.phase')
  echo "[${i}] Phase: ${phase}"
  if [[ "$phase" == "running" ]]; then
    echo "SUCCESS: Phase transitioned to running"
    exit 0
  fi
  sleep 2
done

echo "FAILURE: Phase did not transition to running within 20 seconds"
exit 1

Verification 5: End-to-End Message Flow

Complete verification of customer message processing:

# Send simulated customer message via BlueBubbles API mock
cat << 'EOF' | curl -X POST https://gateway.example.com/bluebubbles-webhook \
  -H "Content-Type: application/json" \
  -H "X-BlueBubbles-Signature: $(echo -n 'test' | openssl dgst -sha256 -hmac 'secret' | cut -d' ' -f2)" \
  -d @-
{
  "message": {
    "text": "Customer test message",
    "handle": "+12345551234",
    "date": "$(date -Iseconds)"
  },
  "attachment": null,
  "method": "private-api"
}
EOF

# Verify response logged and processed
$ grep -E "(inbound|outbound|reply)" /var/log/openclaw/messages.log | tail -5

# Expected: Message logged as inbound with proper handle mapping

⚠️ Common Pitfalls

Pitfall 1: Partial Restart Insufficient

Many administrators attempt systemctl restart expecting it to clear the stuck state. However, this often fails to fully release session locks.

# INEFFECTIVE - Partial restart may retain lock state
$ sudo systemctl restart openclaw-gateway

# EFFECTIVE - Full process termination required
$ sudo systemctl stop openclaw-gateway
$ sudo killall -9 node  # Ensure all node processes terminated
$ sudo rm -f /var/run/openclaw/session-locks/*  # Clear stale lock files
$ sudo systemctl start openclaw-gateway

Docker-specific variant:

# INEFFECTIVE for Docker
$ docker compose restart openclaw

# EFFECTIVE for Docker
$ docker compose down
$ docker volume rm $(docker volume ls -qf name=openclaw-locks) 2>/dev/null || true
$ docker compose up -d

Pitfall 2: Health Check False Positives

Standard health checks return 200 OK because they only verify the HTTP listener is bound, not that the message pipeline is functional.

# This check passes but the channel is stuck:
$ curl -s https://gateway.example.com/healthz
{"status":"ok"}

# Use this instead to check actual channel state:
$ curl -s https://gateway.example.com/debug/channels | \
  jq '.bluebubbles.status'
"stuck-in-start-account"

Recommended monitoring query:

# Alert on phase duration exceeding threshold
phase_age=$(curl -s https://gateway.example.com/debug/liveness | \
  jq -r '.phaseAge')
age_seconds=$(echo "$phase_age" | grep -oE '[0-9]+' | head -1)

if [ "$age_seconds" -gt 300 ]; then
  echo "ALERT: BlueBubbles channel stuck for ${age_seconds}s"
  # Trigger incident response
fi

Pitfall 3: Hot-Reload Timing Window

The race condition has a narrow timing window. Some administrators report that re-triggering hot-reload sometimes “unsticks” the channel β€” this is a timing artifact, not a reliable fix.

# UNRELIABLE: Triggering second hot-reload may coincidentally succeed
curl -X PUT ...config...  # May unstick due to thread scheduling luck

# This is NOT a fix β€” implement the proper solution above

Pitfall 4: BlueBubbles Server Version Compatibility

Certain BlueBubbles server versions exhibit different webhook delivery behavior that can mask or exacerbate this issue.

BB Server VersionBehavior
< 1.9.0No retry logic, drops silently on 2xx
1.9.0-1.9.8Retries 3x with 30s backoff
>= 1.9.9Retries with exponential backoff, alerts on persistent failure

Ensure bluebubbles.server.version is >= 1.9.0 for proper retry behavior.

Pitfall 5: Session Lock Persistence

Session locks may persist across tenant migrations or gateway failures, causing new instances to start in a stuck state immediately.

# Check for stale lock files before starting
$ ls -la /var/run/openclaw/session-locks/

# If locks exist for removed tenants:
# tenant-abcd1234 -> locked since 2026-05-10
# tenant-efgh5678 -> locked since 2026-05-11

# Clear orphaned locks
$ sudo rm -rf /var/run/openclaw/session-locks/*

# Then restart gateway
$ sudo systemctl restart openclaw-gateway

Pitfall 6: Concurrent Hot-Reload Storms

Fleet-wide automation (Ansible, Terraform, etc.) may trigger simultaneous hot-reloads across many tenants, amplifying the race condition probability.

# RISKY: Concurrent updates across all tenants
ansible-playbook -i inventory fleet-wide-plugin-update.yml

# SAFER: Serialized updates with verification between each
for tenant in $(tenant list --format=json | jq -r '.[].id'); do
  echo "Updating tenant: $tenant"
  curl -X PUT .../tenants/$tenant/config/plugins/entries/google \
    -d '{"config": {...}}'
  
  # Wait and verify channel is running before next tenant
  sleep 10
  phase=$(curl -s .../tenants/$tenant/debug/liveness | jq -r '.phase')
  if [ "$phase" != "running" ]; then
    echo "ERROR: Tenant $tenant not running, aborting fleet update"
    exit 1
  fi
done

Issue #78165 β€” WhatsApp Channel Stuck After Plugin Hot-Reload

Symptom: WhatsApp channel enters channels.whatsapp.auth-flow phase indefinitely after plugins.entries.* config hot-reload.

Shared Root Cause: Session-lock contention in the channel re-initialization path.

Resolution: Fixed in 2026.5.8 via session-lock timeout implementation.


Issue #78690 β€” WhatsApp Webhook 404 Despite HTTP 200 (Follow-up)

Symptom: Secondary report confirming webhook acceptance but message handler non-responsiveness.

Key Finding: Identified the HTTP listener accepts requests but discards the body before handler dispatch.

Resolution: Related to #78165 fix; confirmed by implementing clean teardown path.


Issue #78435 β€” Slack Channel Start-Account Deadlock

Symptom: channels.slack.start-account blocks for 60+ seconds then fails with ETIMEDOUT.

Distinguisher: Slack version exhibits timeout (not infinite block) due to different lock implementation.

Workaround: Same as this guide β€” full gateway restart clears the state.


Issue #78352 β€” Telegram Channel Reconnection Loop Post-Hot-Reload

Symptom: Telegram channel repeatedly reconnects without entering running phase after hot-reload.

Distinguisher: Exhibits reconnection loop instead of permanent deadlock due to different session-lock release timing.

Related Finding: Confirmed that post-ready.maintenance duration spike (759ms β†’ 5000ms+) is a precursor indicator.


MetricHealthy RangeAlert ThresholdIndicator
sidecars.session-locks0-500ms> 5000msLock contention
eventLoopDelayMaxMs< 1000ms> 3000msEvent loop blocking
phaseAge.channels.*.start-account< 30s> 60sStuck startup
webhook.received.count> 0/min= 0 for 5+ minProcessing stopped

Cross-Channel Pattern Summary

This issue represents a class of bugs where channel startup phases are not hot-reload safe due to:

  1. Session-lock acquisition without timeout
  2. In-place re-initialization attempting to re-acquire held locks
  3. Circular wait between new init thread and existing handler thread

Prevention Checklist:

  • All channel start-* phases implement bounded timeouts (≀60s)
  • Hot-reload path implements clean teardown before cold start
  • Session-lock metrics exposed and monitored
  • Health check includes channel phase verification
  • Webhook processing metrics compared against listener metrics

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.