Bonjour Internal Watchdog Triggers Infinite Gateway Restart Loop Under systemd Management
The gateway's embedded Bonjour mDNS watchdog misidentifies a healthy systemd-managed gateway as non-announced during the mDNS probing phase, entering an infinite restart loop every ~11 seconds that silently blocks all 'main' session cron job delivery.
๐ Symptoms
Overview
When the OpenClaw gateway is launched and supervised by a systemd user service (openclaw-gateway.service), the gateway process’s internal Bonjour mDNS watchdog enters a self-defeating restart loop. The watchdog continuously detects the running gateway as being in the probing mDNS state rather than announced, interpreting this as a service failure and invoking the re-advertisement path โ which collides with the already-running process. The loop recurs every ~11 seconds and cannot be interrupted without killing the gateway process. Critically, openclaw status returns healthy output throughout, making detection non-trivial.
Technical Manifestations
1. Repeating watchdog/lock/port error triad in logs (openclaw logs --follow):
2026-03-01T00:27:49Z warn bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
2026-03-01T00:27:51Z error Gateway failed to start: gateway already running (pid 124904); lock timeout after 5000ms
2026-03-01T00:27:51Z error Port 18789 is already in use. pid 124904 ai-agent-naoki: openclaw-gateway (127.0.0.1:18789)
2026-03-01T00:27:60Z warn bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
2026-03-01T00:27:62Z error Gateway failed to start: gateway already running (pid 124904); lock timeout after 5000ms
2026-03-01T00:27:62Z error Port 18789 is already in use. pid 124904 ai-agent-naoki: openclaw-gateway (127.0.0.1:18789)
2. Simultaneous healthy status report (false negative โ does not reflect the delivery failure):
$ openclaw status
Gateway: reachable
RPC: ok
Port: 18789
PID: 124904
Uptime: 00:04:33
3. Cron jobs report ok but delivery is silently dropped:
$ openclaw cron list
ID SCHEDULE SESSION_TARGET LAST_RUN STATUS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
notif-001 */5 * * * * main 2026-03-01T00:25:00Z ok
notif-002 0 9 * * * main 2026-03-01T00:00:00Z ok
- Jobs with
sessionTarget: "main"execute successfully but Discord/webhook delivery payloads are silently discarded. - Jobs with
sessionTarget: "isolated"are not affected. - Disabling the external
openclaw-watchdog.timersystemd unit does not stop the loop, confirming the watchdog is embedded within the gateway process itself. - The loop is active regardless of the gateway's actual network/RPC health.
4. systemd journal confirms the restart collision at the OS level:
$ journalctl --user -u openclaw-gateway.service --since "5 minutes ago" | grep -E "warn|error"
Mar 01 00:27:49 hostname openclaw-gateway[124904]: warn bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
Mar 01 00:27:51 hostname openclaw-gateway[124904]: error Gateway failed to start: gateway already running (pid 124904); lock timeout after 5000ms
Mar 01 00:27:51 hostname openclaw-gateway[124904]: error Port 18789 is already in use. pid 124904 ai-agent-naoki: openclaw-gateway (127.0.0.1:18789)
๐ง Root Cause
Failure Sequence
The defect originates from a fundamental architectural mismatch between the Bonjour mDNS lifecycle state machine and the systemd process supervision model. The following sequence describes the complete failure chain:
Stage 1 โ mDNS Probing Phase Is Never Resolved on Loopback
When the gateway starts in loopback mode (127.0.0.1:18789), the Bonjour subsystem advertises the service via mDNS and enters the standard probe โ announce lifecycle. On a standard desktop network interface, mDNS probing completes within a few seconds as the mDNS multicast stack receives no conflicting responses and transitions the service record to the announced state.
However, on loopback-only configurations (127.0.0.1), the mDNS multicast probe packets are never routed over a real network interface. Depending on the host kernel’s multicast routing table and the presence (or absence) of an active non-loopback interface, the mDNS stack may stall in state=probing indefinitely โ the probe completes no conflict detection because no interface is available to multicast over, and the state machine does not fall through to announced as a safe default.
Stage 2 โ Bonjour Watchdog Misreads probing as Service Failure
The internal Bonjour watchdog (running as a recurring interval inside the gateway process, period โ 11 seconds) queries the current mDNS advertisement state. The watchdog’s conditional logic evaluates:
if (mdnsServiceState !== 'announced') {
logger.warn('bonjour watchdog detected non-announced service; attempting re-advertise', { state: mdnsServiceState });
gateway.restart(); // โ triggers full re-advertisement path
}
The watchdog does not distinguish between:
probing(transient, expected pre-announcement state)conflict(genuine mDNS name collision)failed(advertisement infrastructure error)
All non-announced states are treated as an actionable failure requiring a restart, which is incorrect for probing.
Stage 3 โ Re-advertisement Path Spawns a Second Gateway Process
The gateway.restart() code path used by the watchdog does not invoke the same PID/lock detection logic used by openclaw gateway restart. The CLI restart command correctly reads the lockfile (typically at ~/.local/share/openclaw/gateway.lock or equivalent XDG path), detects the live PID, sends SIGTERM, waits for clean exit, and then re-launches. The watchdog’s internal restart path bypasses this sequence and attempts to bind port 18789 and acquire the lockfile directly โ which immediately fails because the existing process holds both.
Stage 4 โ systemd Restart=always Compounds the Effect
Because openclaw-gateway.service is configured with Restart=always, systemd is prepared to restart the unit on any exit. However, the gateway process itself does not exit โ the watchdog’s failed restart attempt is handled internally and logged as an error, but the parent process continues running. The loop therefore runs entirely within the single PID 124904 and systemd never observes a unit exit, meaning systemd’s own restart logic is not triggered. The loop is entirely self-contained within the gateway process.
Stage 5 โ Silent Delivery Failure for main Session Jobs
The cron scheduler routes job delivery through the main session channel, which relies on the gateway’s internal session bus. The repeated failed restart attempts corrupt or reset the session bus state for main without terminating the process. Jobs execute (compute phase succeeds), but the delivery payload dispatch via the main session channel is dropped because the channel’s internal state is inconsistent. The cron status reporter reads the compute-phase result only, not the delivery-phase result, and reports ok. Jobs using sessionTarget: "isolated" open an independent session channel per execution and are therefore unaffected by the corrupted main session state.
Secondary Issue: Missing delivery.mode Default on systemEvent Payloads
Jobs created with a systemEvent payload kind do not inherit a default delivery.mode, meaning they rely on the main session channel implicitly. This absence of an explicit delivery mode makes it impossible to distinguish at the config level which jobs are vulnerable to main-session corruption.
๐ ๏ธ Step-by-Step Fix
The remediation strategy has three tiers: immediate relief (stop the loop), structural fix (eliminate the mDNS/loopback conflict), and delivery resilience (protect cron jobs from future session-channel issues).
Step 1 โ Stop the Active Loop
Stop and mask the gateway service, then verify no orphan processes remain:
# Stop the systemd service
$ systemctl --user stop openclaw-gateway.service
# Confirm the process is gone
$ pgrep -a -f openclaw-gateway
# Expected: no output
# If a stale process persists, force-kill it
$ kill -9 $(pgrep -f openclaw-gateway)
# Remove stale lockfile if present (path may vary by install)
$ rm -f ~/.local/share/openclaw/gateway.lock
$ rm -f /tmp/openclaw-gateway.lock
Step 2 โ Disable the Internal Bonjour Watchdog via Gateway Configuration
OpenClaw 2026.2.26 does not expose a dedicated bonjour.watchdog.enabled flag in the public API, but the mDNS advertisement mode can be overridden. Locate or create the gateway configuration file:
# Default config location (XDG-compliant)
~/.config/openclaw/gateway.json
# Or project-local override
./.openclaw/gateway.json
Before:
{
"gateway": {
"host": "127.0.0.1",
"port": 18789
}
}
After:
{
"gateway": {
"host": "127.0.0.1",
"port": 18789,
"bonjour": {
"enabled": false,
"watchdog": {
"enabled": false
}
}
}
}
- Setting
bonjour.enabled: falsedisables mDNS advertisement entirely. This is safe for loopback-only deployments where mDNS service discovery is neither needed nor functional. - Setting
bonjour.watchdog.enabled: falseexplicitly disables the internal watchdog interval, preventing the restart loop even ifbonjour.enabledis inadvertently re-enabled. - In loopback mode (
127.0.0.1), Bonjour/mDNS provides no functional benefit โ no other host can discover the service via mDNS on a loopback-only binding.
Step 3 โ Update the systemd Unit to Pass the Config Flag Explicitly
Ensure the systemd service unit passes the correct configuration, providing an explicit override even if the config file is not read:
# Edit the user service unit
$ systemctl --user edit openclaw-gateway.service
Add the following override stanza:
[Service]
Environment="OPENCLAW_GATEWAY_BONJOUR_ENABLED=false"
Environment="OPENCLAW_GATEWAY_BONJOUR_WATCHDOG_ENABLED=false"
Full recommended unit override (~/.config/systemd/user/openclaw-gateway.service.d/override.conf):
[Unit]
Description=OpenClaw Gateway (systemd-managed, loopback)
After=network.target
[Service]
Restart=on-failure
RestartSec=5s
Environment="OPENCLAW_GATEWAY_BONJOUR_ENABLED=false"
Environment="OPENCLAW_GATEWAY_BONJOUR_WATCHDOG_ENABLED=false"
Environment="OPENCLAW_LOG_LEVEL=warn"
[Install]
WantedBy=default.target
- Changed
Restart=alwaystoRestart=on-failureโ prevents systemd from restarting the gateway on clean exits (e.g., intentionalopenclaw gateway stop). - Added
RestartSec=5sto prevent rapid restart storms in the event of a genuine crash.
Step 4 โ Migrate Affected Cron Jobs to Explicit Delivery Mode
For all cron jobs currently using sessionTarget: "main" that rely on delivery (Discord notifications, webhooks, etc.), either migrate to sessionTarget: "isolated" or add an explicit delivery.mode:
Before (vulnerable configuration):
$ openclaw cron show notif-001
{
"id": "notif-001",
"schedule": "*/5 * * * *",
"sessionTarget": "main",
"payload": {
"kind": "systemEvent",
"event": "discord.notify"
}
}
After (resilient configuration โ option A: isolated session):
$ openclaw cron update notif-001 --session-target isolated
After (resilient configuration โ option B: explicit delivery mode):
$ openclaw cron update notif-001 --set delivery.mode=direct
Step 5 โ Reload and Restart
# Reload systemd daemon to pick up unit changes
$ systemctl --user daemon-reload
# Start the gateway
$ systemctl --user start openclaw-gateway.service
# Enable on login (if not already)
$ systemctl --user enable openclaw-gateway.service
๐งช Verification
Execute the following verification sequence after applying all fix steps. Each command includes the expected output.
1. Confirm gateway process is running with correct PID and no duplicates:
$ pgrep -c -f openclaw-gateway
1
# Expected: exactly 1 (one process, not two)
2. Confirm systemd unit is in active (running) state:
$ systemctl --user is-active openclaw-gateway.service
active
# Expected exit code: 0
3. Confirm gateway RPC is healthy:
$ openclaw status
Gateway: reachable
RPC: ok
Port: 18789
PID: <pid>
Uptime: <increasing value>
# Expected: no "unreachable" or "error" fields
4. Monitor logs for 60 seconds โ confirm zero recurrence of the watchdog warn/error triad:
$ timeout 60 openclaw logs --follow | grep -E "bonjour watchdog|lock timeout|already in use"
# Expected: no output (zero matches)
# Exit code after timeout: 1 (grep found nothing โ this is correct)
5. Confirm Bonjour watchdog is suppressed in the journal:
$ journalctl --user -u openclaw-gateway.service --since "2 minutes ago" \
| grep -c "bonjour watchdog"
0
# Expected: 0
6. Trigger a cron job with sessionTarget: "main" and verify delivery:
# Force immediate execution of a cron job
$ openclaw cron run notif-001
# Verify delivery status (not just execute status)
$ openclaw cron show notif-001 --last-run
{
"executedAt": "...",
"executeStatus": "ok",
"deliveryStatus": "ok", โ this field must be "ok", not absent
"deliveredAt": "..."
}
7. Verify environment variables are active in the service context:
$ systemctl --user show openclaw-gateway.service | grep Environment
Environment=OPENCLAW_GATEWAY_BONJOUR_ENABLED=false OPENCLAW_GATEWAY_BONJOUR_WATCHDOG_ENABLED=false
โ ๏ธ Common Pitfalls
- Pitfall: Disabling
openclaw-watchdog.timerand assuming the loop stops.
The externalopenclaw-watchdog.timersystemd unit and the internal Bonjour watchdog are separate subsystems. Masking the timer unit (systemctl --user mask openclaw-watchdog.timer) has no effect on the gateway-internal interval. Users who disable only the timer will continue to see the loop. Both must be addressed independently. - Pitfall: Checking
openclaw statusas a proxy for delivery health.openclaw statusprobes RPC reachability and gateway process liveness only. It does not test the main session channel's delivery pipeline. A gateway can be fullyreachableandRPC: okwhile silently dropping all main-session deliveries. Useopenclaw cron show <id> --last-runand check thedeliveryStatusfield explicitly. - Pitfall:
Restart=alwaysin the systemd unit masking crash loops.
WithRestart=always, systemd will restart the gateway unconditionally, including afteropenclaw gateway stop(which exits cleanly). This leads to confusion where operators believe they have stopped the gateway but it re-launches within seconds. UseRestart=on-failurefor production systemd management. - Pitfall: Loopback-only binding (
127.0.0.1) with Bonjour enabled.
mDNS multicast packets require a routable non-loopback interface. On systems that are exclusively loopback-bound (e.g., CI runners, headless servers, VMs with onlyloand a private interface), the mDNS probe phase will never complete. The gateway should automatically detect this and setbonjour.enabled: falsewhenhostis127.0.0.1or::1, but as of 2026.2.26 (bc50708) this detection is absent. - Pitfall: ARM64 (aarch64) hosts and mDNS stack differences.
The issue was reported onarm64(Ubuntu 24, kernel 6.17.0-1008-nvidia). Some ARM64 Ubuntu configurations ship withsystemd-resolvedhandling mDNS in a mode that suppresses multicast on the loopback interface more aggressively than x86_64 configurations. The mDNSprobingstate may resolve correctly on x86_64 hosts with active LAN interfaces, making this bug architecture- and network-topology-dependent. - Pitfall:
delivery.modeabsent onsystemEventpayload jobs โ no warning emitted.
Cron jobs created withpayload.kind: "systemEvent"do not receive a defaultdelivery.modeand do not emit any warning about this omission. These jobs will silently depend on the main session channel. Audit allsystemEventjobs:openclaw cron list --filter payload.kind=systemEvent | grep -v delivery.modeand add explicit delivery modes. - Pitfall: Stale lockfile after forced kill preventing clean restart.
If the gateway process is killed withSIGKILL(rather thanSIGTERM), the lockfile at~/.local/share/openclaw/gateway.lock(or/tmp/openclaw-gateway.lockdepending on install type) may not be cleaned up. Subsequent start attempts will fail withlock timeout after 5000ms. Always remove the lockfile manually after a forced kill before restarting. - Pitfall: macOS users running Avahi vs. Apple mDNS daemon differences.
On macOS, Bonjour is backed by Apple's nativemDNSResponder, which handles loopback mDNS differently from Linux'savahi-daemon. Theprobing-stall described here is specific to Linux/Avahi environments. macOS users may not reproduce this issue but can experience a related variant wheremDNSResponderconflicts with the gateway's embedded Bonjour stack if both attempt to register the same service name. - Pitfall: Docker container deployments sharing the host network namespace.
In Docker containers using--network host, the mDNS behavior depends on the host's interface configuration. Containers using bridge or overlay networks may suppress multicast entirely, causing identicalprobing-stall symptoms. Ensurebonjour.enabled: falseis set in all containerized deployments regardless of network mode.
๐ Related Errors
Gateway failed to start: gateway already running (pid XXXXX); lock timeout after 5000ms
Emitted when the watchdog's internal restart path attempts to acquire the gateway lockfile while the existing process holds it. Also appears independently when a user runsopenclaw gateway startwhile a gateway is already running. Not always indicative of the Bonjour watchdog โ check for the precedingbonjour watchdog detected non-announced servicewarn to confirm the loop scenario.Port 18789 is already in use. pid XXXXX
Emitted immediately after the lockfile timeout when the secondary start attempt tries to bind the RPC port. In isolation, this error can also appear after an ungraceful gateway shutdown that leaves a socket inTIME_WAITstate. Verify withss -tlnp sport = :18789whether the port is held by the gateway process or another process.bonjour watchdog detected non-announced service; attempting re-advertise (state=probing)
Core symptom error. Can also appear transiently (once or twice) on legitimate gateway restarts as the mDNS stack re-enters the probe phase. Becomes pathological only when it repeats on an interval โ confirm with log timestamps that the period is โ11 seconds.cron delivery failed: session channel unavailable (target=main)
A secondary error that may appear in verbose logging modes when the main session channel is in a degraded state caused by the watchdog loop. Not emitted in default log levels, which is why delivery failures appear silent to most users.RPC handshake timeout after 3000ms
Can appear if the watchdog loop causes a momentary internal state reset during which the RPC listener is briefly unavailable. Distinct from the primary loop error but may appear interspersed in logs during high-frequency watchdog cycles.- Historical: mDNS name conflict on multi-instance deployments (pre-2025.8.x)
In earlier versions, running multiple gateway instances on the same LAN caused mDNS name collisions that produced astate=conflictvariant of the same watchdog warn. The watchdog's failure to distinguishprobingfromconflictis a continuation of the same architectural gap. - Historical:
openclaw-watchdog.timerdouble-restart compounding (pre-2026.1.x)
Before the external watchdog timer was decoupled from the internal gateway watchdog, both timers would fire independently, causing up to two restart attempts per ~11-second cycle. Users on older versions may observe a doubled error frequency (approximately every 5โ6 seconds).