Ghost peers and stuck busy state¶
Ghost peers¶
A peer shows online in list_peers but the agent process is gone (closed terminal, killed tmux pane, crashed runtime). Symptom: routing to it succeeds at the daemon but the agent never sees the message.
This happens when a peer exits without firing its SessionEnd / Stop hook — for example, a force-killed tmux pane, an OS-level kill, or a network partition during a remote session.
Recovery¶
Ghosts are evicted by lazy repair: the next MCP tool call from any peer triggers lazy_repair() (at most once per 30 seconds), which checks every peer's last_seen against the staleness threshold and demotes anything past it. There is no polling thread that does this — see lazy repair.
If you need an immediate eviction:
Removes all peers whose last_seen exceeds daemon.prune_max_age_hours (default 24h). For a faster cleanup, call any MCP tool from another peer — the next routing call will run lazy repair.
Stuck busy state¶
A peer shows busy long after the turn that triggered it has ended. The most common causes:
Stop/AfterAgenthook didn't fire. The peer never marked itselfonlineagain. The next user prompt should reset it; if not, re-runrepowire setupto rewrite the hook entries.- Hook script error. The hook ran but failed before reaching the status update. Check the hook log (visible directly in Gemini output; for Claude Code / Codex, look at
repowire serveforeground output or the user-service log). turn_state=awaiting_input. The peer is mid-turn waiting on user input (a permission prompt, aread -p, an MCP tool that suspended). This is not stuck — it's correctly reporting state. Send input to unblock it.turn_state=pending_first_turn. A spawn-seeded peer whose seed message never reached the agent. Re-send vianotify_peer.
Lazy repair also has a stale-state fallback for missed cancel/interrupt paths:
on the next normal routing or peer-list request, it resets peers that are still
busy with turn_state=working and have had no recent liveness progress for
longer than daemon.stale_busy_timeout_seconds (default 1800 seconds). It does
not touch awaiting_input, and it is not a guarantee that every backend emitted
a cancel event — it only reconciles stale daemon state without adding a polling
loop.
For Codex specifically, manual interrupt/Esc can abort the visible turn without
emitting a reliable Stop/cancel hook. In that case the daemon may continue to
show the peer as busy even though the TUI is ready for input again. This is a
known runtime-signal limitation, not a state Repowire should infer from timing
alone; lowering the stale-busy timeout too far can make long-running turns look
idle incorrectly. When Codex exposes a clean interrupt/cancel lifecycle event,
Repowire should use that instead of a heuristic.
Why no polling¶
Repowire deliberately has no heartbeat or watchdog thread. State catches up on the next routing call. If a fully idle mesh is leaving you with stale state for too long, that's a sign you should reach for repowire peer prune rather than ask repowire to poll.