fix(BA-5499): prevent health checker from recreating deleted etcd keys in Traefik mode#10666
fix(BA-5499): prevent health checker from recreating deleted etcd keys in Traefik mode#10666seedspirit wants to merge 8 commits intomainfrom
Conversation
Add _circuit_locks dict and _get_lock() helper to CircuitManager. Wrap update_circuit_routes() and unload_circuits() with per-circuit locks to prevent race conditions between health checker route updates and circuit deletion. Locks are cleaned up after unload completes. Refs: BA-5499 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ecker propagate_route_updates_to_workers() now fetches a fresh circuit from the database before writing to etcd. If the circuit was deleted (ObjectNotFound), it logs and skips the write, preventing stale service keys from being recreated in Traefik's weighted service pool. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…und variable Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Task 4 (live verification) requires manual QA with running services, inference endpoints, and scale operations that cannot be safely automated without disrupting the active environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Prevents Traefik-mode health checking from recreating deleted etcd keys by serializing per-circuit updates/unloads and ensuring route propagation uses fresh DB state (and skips deleted circuits).
Changes:
- Add per-circuit
asyncio.LockinCircuitManagerto serializeupdate_circuit_routes()andunload_circuits()per circuit. - Re-read circuits from DB in
propagate_route_updates_to_workers()and skip propagation when the circuit has been deleted. - Add changelog entry and a deviation report documenting manual verification steps.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/ai/backend/appproxy/coordinator/types.py | Introduces per-circuit locks and uses them to serialize route updates and circuit unloads in Traefik mode. |
| src/ai/backend/appproxy/coordinator/health_checker.py | Re-reads circuit state from DB prior to propagation to avoid writing stale/deleted state. |
| changes/10666.fix.md | Adds a changelog note for the regression fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async with self._get_lock(circuit.id): | ||
| log.debug("Acquired lock for circuit {} in update_circuit_routes", circuit.id) | ||
| if self.local_config.proxy_coordinator.enable_traefik: | ||
| await self.update_traefik_circuit_routes(circuit, old_routes) | ||
| else: | ||
| await self.update_legacy_circuit_routes(circuit, old_routes) |
There was a problem hiding this comment.
unload_circuits() releases the per-circuit lock and then removes it from _circuit_locks. If an update_circuit_routes() call was already queued waiting on that same lock, it will run immediately after unload completes and can still recreate Traefik/etcd state for a circuit you just unloaded/deleted. To make the serialization effective, keep the lock around and add an explicit 'unloaded/deleted' guard checked inside the lock (e.g., a per-circuit tombstone flag/set that makes update_circuit_routes() return early), and only remove the lock when you can guarantee no further work can be queued for that circuit (or during a broader manager shutdown/cleanup).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…review - Add _unloaded_circuits tombstone set to CircuitManager; update_circuit_routes() checks it before proceeding, preventing stale updates after unload - Keep lock alive after unload (no pop) so queued updates hit tombstone guard - Merge two separate readonly DB sessions into one in propagate_route_updates_to_workers() for consistent snapshot and less overhead Refs: BA-5499 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tombstone (_unloaded_circuits set) adds complexity with unclear cleanup semantics. The fresh DB re-read in health_checker.py already detects deleted circuits via ObjectNotFound, providing sufficient protection without the memory management concern. Refs: BA-5499 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
asyncio.LocktoCircuitManagerto serializeupdate_circuit_routes()andunload_circuits()for the same circuit, preventing race conditionspropagate_route_updates_to_workers()before etcd write; skip propagation if circuit was deleted (ObjectNotFound)route_infoinstead of stale in-memory snapshot to prevent stale backend URLs from persisting in Traefik's weighted service poolTest plan
pants test tests/unit/appproxy/coordinator::)pants check src/ai/backend/appproxy/coordinator::)pants lint --changed-since=origin/main)Resolves BA-5499