April 23, 2026 โ€ข Version: 2026.2.26 (bc50708)

Bonjour Internal Watchdog Triggers Infinite Gateway Restart Loop Under systemd Management

The gateway's embedded Bonjour mDNS watchdog misidentifies a healthy systemd-managed gateway as non-announced during the mDNS probing phase, entering an infinite restart loop every ~11 seconds that silently blocks all 'main' session cron job delivery.

๐Ÿ” Symptoms

Overview

When the OpenClaw gateway is launched and supervised by a systemd user service (openclaw-gateway.service), the gateway process’s internal Bonjour mDNS watchdog enters a self-defeating restart loop. The watchdog continuously detects the running gateway as being in the probing mDNS state rather than announced, interpreting this as a service failure and invoking the re-advertisement path โ€” which collides with the already-running process. The loop recurs every ~11 seconds and cannot be interrupted without killing the gateway process. Critically, openclaw status returns healthy output throughout, making detection non-trivial.

Technical Manifestations

1. Repeating watchdog/lock/port error triad in logs (openclaw logs --follow):

2026-03-01T00:27:49Z warn  bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
2026-03-01T00:27:51Z error Gateway failed to start: gateway already running (pid 124904); lock timeout after 5000ms
2026-03-01T00:27:51Z error Port 18789 is already in use. pid 124904 ai-agent-naoki: openclaw-gateway (127.0.0.1:18789)

2026-03-01T00:27:60Z warn  bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
2026-03-01T00:27:62Z error Gateway failed to start: gateway already running (pid 124904); lock timeout after 5000ms
2026-03-01T00:27:62Z error Port 18789 is already in use. pid 124904 ai-agent-naoki: openclaw-gateway (127.0.0.1:18789)

2. Simultaneous healthy status report (false negative โ€” does not reflect the delivery failure):

$ openclaw status
Gateway:  reachable
RPC:      ok
Port:     18789
PID:      124904
Uptime:   00:04:33

3. Cron jobs report ok but delivery is silently dropped:

$ openclaw cron list
ID          SCHEDULE     SESSION_TARGET  LAST_RUN              STATUS
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
notif-001   */5 * * * *  main            2026-03-01T00:25:00Z  ok
notif-002   0 9 * * *    main            2026-03-01T00:00:00Z  ok
  • Jobs with sessionTarget: "main" execute successfully but Discord/webhook delivery payloads are silently discarded.
  • Jobs with sessionTarget: "isolated" are not affected.
  • Disabling the external openclaw-watchdog.timer systemd unit does not stop the loop, confirming the watchdog is embedded within the gateway process itself.
  • The loop is active regardless of the gateway's actual network/RPC health.

4. systemd journal confirms the restart collision at the OS level:

$ journalctl --user -u openclaw-gateway.service --since "5 minutes ago" | grep -E "warn|error"
Mar 01 00:27:49 hostname openclaw-gateway[124904]: warn bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
Mar 01 00:27:51 hostname openclaw-gateway[124904]: error Gateway failed to start: gateway already running (pid 124904); lock timeout after 5000ms
Mar 01 00:27:51 hostname openclaw-gateway[124904]: error Port 18789 is already in use. pid 124904 ai-agent-naoki: openclaw-gateway (127.0.0.1:18789)

๐Ÿง  Root Cause

Failure Sequence

The defect originates from a fundamental architectural mismatch between the Bonjour mDNS lifecycle state machine and the systemd process supervision model. The following sequence describes the complete failure chain:

Stage 1 โ€” mDNS Probing Phase Is Never Resolved on Loopback

When the gateway starts in loopback mode (127.0.0.1:18789), the Bonjour subsystem advertises the service via mDNS and enters the standard probe โ†’ announce lifecycle. On a standard desktop network interface, mDNS probing completes within a few seconds as the mDNS multicast stack receives no conflicting responses and transitions the service record to the announced state.

However, on loopback-only configurations (127.0.0.1), the mDNS multicast probe packets are never routed over a real network interface. Depending on the host kernel’s multicast routing table and the presence (or absence) of an active non-loopback interface, the mDNS stack may stall in state=probing indefinitely โ€” the probe completes no conflict detection because no interface is available to multicast over, and the state machine does not fall through to announced as a safe default.

Stage 2 โ€” Bonjour Watchdog Misreads probing as Service Failure

The internal Bonjour watchdog (running as a recurring interval inside the gateway process, period โ‰ˆ 11 seconds) queries the current mDNS advertisement state. The watchdog’s conditional logic evaluates:

if (mdnsServiceState !== 'announced') {
    logger.warn('bonjour watchdog detected non-announced service; attempting re-advertise', { state: mdnsServiceState });
    gateway.restart(); // โ† triggers full re-advertisement path
}

The watchdog does not distinguish between:

  • probing (transient, expected pre-announcement state)
  • conflict (genuine mDNS name collision)
  • failed (advertisement infrastructure error)

All non-announced states are treated as an actionable failure requiring a restart, which is incorrect for probing.

Stage 3 โ€” Re-advertisement Path Spawns a Second Gateway Process

The gateway.restart() code path used by the watchdog does not invoke the same PID/lock detection logic used by openclaw gateway restart. The CLI restart command correctly reads the lockfile (typically at ~/.local/share/openclaw/gateway.lock or equivalent XDG path), detects the live PID, sends SIGTERM, waits for clean exit, and then re-launches. The watchdog’s internal restart path bypasses this sequence and attempts to bind port 18789 and acquire the lockfile directly โ€” which immediately fails because the existing process holds both.

Stage 4 โ€” systemd Restart=always Compounds the Effect

Because openclaw-gateway.service is configured with Restart=always, systemd is prepared to restart the unit on any exit. However, the gateway process itself does not exit โ€” the watchdog’s failed restart attempt is handled internally and logged as an error, but the parent process continues running. The loop therefore runs entirely within the single PID 124904 and systemd never observes a unit exit, meaning systemd’s own restart logic is not triggered. The loop is entirely self-contained within the gateway process.

Stage 5 โ€” Silent Delivery Failure for main Session Jobs

The cron scheduler routes job delivery through the main session channel, which relies on the gateway’s internal session bus. The repeated failed restart attempts corrupt or reset the session bus state for main without terminating the process. Jobs execute (compute phase succeeds), but the delivery payload dispatch via the main session channel is dropped because the channel’s internal state is inconsistent. The cron status reporter reads the compute-phase result only, not the delivery-phase result, and reports ok. Jobs using sessionTarget: "isolated" open an independent session channel per execution and are therefore unaffected by the corrupted main session state.

Secondary Issue: Missing delivery.mode Default on systemEvent Payloads

Jobs created with a systemEvent payload kind do not inherit a default delivery.mode, meaning they rely on the main session channel implicitly. This absence of an explicit delivery mode makes it impossible to distinguish at the config level which jobs are vulnerable to main-session corruption.

๐Ÿ› ๏ธ Step-by-Step Fix

The remediation strategy has three tiers: immediate relief (stop the loop), structural fix (eliminate the mDNS/loopback conflict), and delivery resilience (protect cron jobs from future session-channel issues).


Step 1 โ€” Stop the Active Loop

Stop and mask the gateway service, then verify no orphan processes remain:

# Stop the systemd service
$ systemctl --user stop openclaw-gateway.service

# Confirm the process is gone
$ pgrep -a -f openclaw-gateway
# Expected: no output

# If a stale process persists, force-kill it
$ kill -9 $(pgrep -f openclaw-gateway)

# Remove stale lockfile if present (path may vary by install)
$ rm -f ~/.local/share/openclaw/gateway.lock
$ rm -f /tmp/openclaw-gateway.lock

Step 2 โ€” Disable the Internal Bonjour Watchdog via Gateway Configuration

OpenClaw 2026.2.26 does not expose a dedicated bonjour.watchdog.enabled flag in the public API, but the mDNS advertisement mode can be overridden. Locate or create the gateway configuration file:

# Default config location (XDG-compliant)
~/.config/openclaw/gateway.json

# Or project-local override
./.openclaw/gateway.json

Before:

{
  "gateway": {
    "host": "127.0.0.1",
    "port": 18789
  }
}

After:

{
  "gateway": {
    "host": "127.0.0.1",
    "port": 18789,
    "bonjour": {
      "enabled": false,
      "watchdog": {
        "enabled": false
      }
    }
  }
}
  • Setting bonjour.enabled: false disables mDNS advertisement entirely. This is safe for loopback-only deployments where mDNS service discovery is neither needed nor functional.
  • Setting bonjour.watchdog.enabled: false explicitly disables the internal watchdog interval, preventing the restart loop even if bonjour.enabled is inadvertently re-enabled.
  • In loopback mode (127.0.0.1), Bonjour/mDNS provides no functional benefit โ€” no other host can discover the service via mDNS on a loopback-only binding.

Step 3 โ€” Update the systemd Unit to Pass the Config Flag Explicitly

Ensure the systemd service unit passes the correct configuration, providing an explicit override even if the config file is not read:

# Edit the user service unit
$ systemctl --user edit openclaw-gateway.service

Add the following override stanza:

[Service]
Environment="OPENCLAW_GATEWAY_BONJOUR_ENABLED=false"
Environment="OPENCLAW_GATEWAY_BONJOUR_WATCHDOG_ENABLED=false"

Full recommended unit override (~/.config/systemd/user/openclaw-gateway.service.d/override.conf):

[Unit]
Description=OpenClaw Gateway (systemd-managed, loopback)
After=network.target

[Service]
Restart=on-failure
RestartSec=5s
Environment="OPENCLAW_GATEWAY_BONJOUR_ENABLED=false"
Environment="OPENCLAW_GATEWAY_BONJOUR_WATCHDOG_ENABLED=false"
Environment="OPENCLAW_LOG_LEVEL=warn"

[Install]
WantedBy=default.target
  • Changed Restart=always to Restart=on-failure โ€” prevents systemd from restarting the gateway on clean exits (e.g., intentional openclaw gateway stop).
  • Added RestartSec=5s to prevent rapid restart storms in the event of a genuine crash.

Step 4 โ€” Migrate Affected Cron Jobs to Explicit Delivery Mode

For all cron jobs currently using sessionTarget: "main" that rely on delivery (Discord notifications, webhooks, etc.), either migrate to sessionTarget: "isolated" or add an explicit delivery.mode:

Before (vulnerable configuration):

$ openclaw cron show notif-001
{
  "id": "notif-001",
  "schedule": "*/5 * * * *",
  "sessionTarget": "main",
  "payload": {
    "kind": "systemEvent",
    "event": "discord.notify"
  }
}

After (resilient configuration โ€” option A: isolated session):

$ openclaw cron update notif-001 --session-target isolated

After (resilient configuration โ€” option B: explicit delivery mode):

$ openclaw cron update notif-001 --set delivery.mode=direct

Step 5 โ€” Reload and Restart

# Reload systemd daemon to pick up unit changes
$ systemctl --user daemon-reload

# Start the gateway
$ systemctl --user start openclaw-gateway.service

# Enable on login (if not already)
$ systemctl --user enable openclaw-gateway.service

๐Ÿงช Verification

Execute the following verification sequence after applying all fix steps. Each command includes the expected output.

1. Confirm gateway process is running with correct PID and no duplicates:

$ pgrep -c -f openclaw-gateway
1
# Expected: exactly 1 (one process, not two)

2. Confirm systemd unit is in active (running) state:

$ systemctl --user is-active openclaw-gateway.service
active
# Expected exit code: 0

3. Confirm gateway RPC is healthy:

$ openclaw status
Gateway:  reachable
RPC:      ok
Port:     18789
PID:      <pid>
Uptime:   <increasing value>
# Expected: no "unreachable" or "error" fields

4. Monitor logs for 60 seconds โ€” confirm zero recurrence of the watchdog warn/error triad:

$ timeout 60 openclaw logs --follow | grep -E "bonjour watchdog|lock timeout|already in use"
# Expected: no output (zero matches)
# Exit code after timeout: 1 (grep found nothing โ€” this is correct)

5. Confirm Bonjour watchdog is suppressed in the journal:

$ journalctl --user -u openclaw-gateway.service --since "2 minutes ago" \
    | grep -c "bonjour watchdog"
0
# Expected: 0

6. Trigger a cron job with sessionTarget: "main" and verify delivery:

# Force immediate execution of a cron job
$ openclaw cron run notif-001

# Verify delivery status (not just execute status)
$ openclaw cron show notif-001 --last-run
{
  "executedAt": "...",
  "executeStatus": "ok",
  "deliveryStatus": "ok",   โ† this field must be "ok", not absent
  "deliveredAt": "..."
}

7. Verify environment variables are active in the service context:

$ systemctl --user show openclaw-gateway.service | grep Environment
Environment=OPENCLAW_GATEWAY_BONJOUR_ENABLED=false OPENCLAW_GATEWAY_BONJOUR_WATCHDOG_ENABLED=false

โš ๏ธ Common Pitfalls

  • Pitfall: Disabling openclaw-watchdog.timer and assuming the loop stops.
    The external openclaw-watchdog.timer systemd unit and the internal Bonjour watchdog are separate subsystems. Masking the timer unit (systemctl --user mask openclaw-watchdog.timer) has no effect on the gateway-internal interval. Users who disable only the timer will continue to see the loop. Both must be addressed independently.
  • Pitfall: Checking openclaw status as a proxy for delivery health.
    openclaw status probes RPC reachability and gateway process liveness only. It does not test the main session channel's delivery pipeline. A gateway can be fully reachable and RPC: ok while silently dropping all main-session deliveries. Use openclaw cron show <id> --last-run and check the deliveryStatus field explicitly.
  • Pitfall: Restart=always in the systemd unit masking crash loops.
    With Restart=always, systemd will restart the gateway unconditionally, including after openclaw gateway stop (which exits cleanly). This leads to confusion where operators believe they have stopped the gateway but it re-launches within seconds. Use Restart=on-failure for production systemd management.
  • Pitfall: Loopback-only binding (127.0.0.1) with Bonjour enabled.
    mDNS multicast packets require a routable non-loopback interface. On systems that are exclusively loopback-bound (e.g., CI runners, headless servers, VMs with only lo and a private interface), the mDNS probe phase will never complete. The gateway should automatically detect this and set bonjour.enabled: false when host is 127.0.0.1 or ::1, but as of 2026.2.26 (bc50708) this detection is absent.
  • Pitfall: ARM64 (aarch64) hosts and mDNS stack differences.
    The issue was reported on arm64 (Ubuntu 24, kernel 6.17.0-1008-nvidia). Some ARM64 Ubuntu configurations ship with systemd-resolved handling mDNS in a mode that suppresses multicast on the loopback interface more aggressively than x86_64 configurations. The mDNS probing state may resolve correctly on x86_64 hosts with active LAN interfaces, making this bug architecture- and network-topology-dependent.
  • Pitfall: delivery.mode absent on systemEvent payload jobs โ€” no warning emitted.
    Cron jobs created with payload.kind: "systemEvent" do not receive a default delivery.mode and do not emit any warning about this omission. These jobs will silently depend on the main session channel. Audit all systemEvent jobs: openclaw cron list --filter payload.kind=systemEvent | grep -v delivery.mode and add explicit delivery modes.
  • Pitfall: Stale lockfile after forced kill preventing clean restart.
    If the gateway process is killed with SIGKILL (rather than SIGTERM), the lockfile at ~/.local/share/openclaw/gateway.lock (or /tmp/openclaw-gateway.lock depending on install type) may not be cleaned up. Subsequent start attempts will fail with lock timeout after 5000ms. Always remove the lockfile manually after a forced kill before restarting.
  • Pitfall: macOS users running Avahi vs. Apple mDNS daemon differences.
    On macOS, Bonjour is backed by Apple's native mDNSResponder, which handles loopback mDNS differently from Linux's avahi-daemon. The probing-stall described here is specific to Linux/Avahi environments. macOS users may not reproduce this issue but can experience a related variant where mDNSResponder conflicts with the gateway's embedded Bonjour stack if both attempt to register the same service name.
  • Pitfall: Docker container deployments sharing the host network namespace.
    In Docker containers using --network host, the mDNS behavior depends on the host's interface configuration. Containers using bridge or overlay networks may suppress multicast entirely, causing identical probing-stall symptoms. Ensure bonjour.enabled: false is set in all containerized deployments regardless of network mode.
  • Gateway failed to start: gateway already running (pid XXXXX); lock timeout after 5000ms
    Emitted when the watchdog's internal restart path attempts to acquire the gateway lockfile while the existing process holds it. Also appears independently when a user runs openclaw gateway start while a gateway is already running. Not always indicative of the Bonjour watchdog โ€” check for the preceding bonjour watchdog detected non-announced service warn to confirm the loop scenario.
  • Port 18789 is already in use. pid XXXXX
    Emitted immediately after the lockfile timeout when the secondary start attempt tries to bind the RPC port. In isolation, this error can also appear after an ungraceful gateway shutdown that leaves a socket in TIME_WAIT state. Verify with ss -tlnp sport = :18789 whether the port is held by the gateway process or another process.
  • bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
    Core symptom error. Can also appear transiently (once or twice) on legitimate gateway restarts as the mDNS stack re-enters the probe phase. Becomes pathological only when it repeats on an interval โ€” confirm with log timestamps that the period is โ‰ˆ11 seconds.
  • cron delivery failed: session channel unavailable (target=main)
    A secondary error that may appear in verbose logging modes when the main session channel is in a degraded state caused by the watchdog loop. Not emitted in default log levels, which is why delivery failures appear silent to most users.
  • RPC handshake timeout after 3000ms
    Can appear if the watchdog loop causes a momentary internal state reset during which the RPC listener is briefly unavailable. Distinct from the primary loop error but may appear interspersed in logs during high-frequency watchdog cycles.
  • Historical: mDNS name conflict on multi-instance deployments (pre-2025.8.x)
    In earlier versions, running multiple gateway instances on the same LAN caused mDNS name collisions that produced a state=conflict variant of the same watchdog warn. The watchdog's failure to distinguish probing from conflict is a continuation of the same architectural gap.
  • Historical: openclaw-watchdog.timer double-restart compounding (pre-2026.1.x)
    Before the external watchdog timer was decoupled from the internal gateway watchdog, both timers would fire independently, causing up to two restart attempts per ~11-second cycle. Users on older versions may observe a doubled error frequency (approximately every 5โ€“6 seconds).

Evidence & Sources

This troubleshooting guide was automatically synthesized by the FixClaw Intelligence Pipeline from community discussions.