Skip to content

fix(BA-5499): prevent health checker from recreating deleted etcd keys in Traefik mode#10666

Open
seedspirit wants to merge 8 commits intomainfrom
BA-5499
Open

fix(BA-5499): prevent health checker from recreating deleted etcd keys in Traefik mode#10666
seedspirit wants to merge 8 commits intomainfrom
BA-5499

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

Summary

  • Add per-circuit asyncio.Lock to CircuitManager to serialize update_circuit_routes() and unload_circuits() for the same circuit, preventing race conditions
  • Re-read circuit from DB in propagate_route_updates_to_workers() before etcd write; skip propagation if circuit was deleted (ObjectNotFound)
  • Use fresh route_info instead of stale in-memory snapshot to prevent stale backend URLs from persisting in Traefik's weighted service pool

Test plan

  • Unit tests pass (pants test tests/unit/appproxy/coordinator::)
  • Type checking passes (pants check src/ai/backend/appproxy/coordinator::)
  • Lint passes (pants lint --changed-since=origin/main)
  • Manual QA: scale-in scenario does not leave stale session service keys in etcd (see DEVIATION.md)

Resolves BA-5499

seedspirit and others added 4 commits March 30, 2026 17:58
Add _circuit_locks dict and _get_lock() helper to CircuitManager.
Wrap update_circuit_routes() and unload_circuits() with per-circuit
locks to prevent race conditions between health checker route updates
and circuit deletion. Locks are cleaned up after unload completes.

Refs: BA-5499

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ecker

propagate_route_updates_to_workers() now fetches a fresh circuit from the
database before writing to etcd. If the circuit was deleted (ObjectNotFound),
it logs and skips the write, preventing stale service keys from being
recreated in Traefik's weighted service pool.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…und variable

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Task 4 (live verification) requires manual QA with running services,
inference endpoints, and scale operations that cannot be safely
automated without disrupting the active environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 30, 2026 09:05
@github-actions github-actions bot added size:M 30~100 LoC comp:app-proxy Related to App Proxy component labels Mar 30, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents Traefik-mode health checking from recreating deleted etcd keys by serializing per-circuit updates/unloads and ensuring route propagation uses fresh DB state (and skips deleted circuits).

Changes:

  • Add per-circuit asyncio.Lock in CircuitManager to serialize update_circuit_routes() and unload_circuits() per circuit.
  • Re-read circuits from DB in propagate_route_updates_to_workers() and skip propagation when the circuit has been deleted.
  • Add changelog entry and a deviation report documenting manual verification steps.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/ai/backend/appproxy/coordinator/types.py Introduces per-circuit locks and uses them to serialize route updates and circuit unloads in Traefik mode.
src/ai/backend/appproxy/coordinator/health_checker.py Re-reads circuit state from DB prior to propagation to avoid writing stale/deleted state.
changes/10666.fix.md Adds a changelog note for the regression fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +164 to +169
async with self._get_lock(circuit.id):
log.debug("Acquired lock for circuit {} in update_circuit_routes", circuit.id)
if self.local_config.proxy_coordinator.enable_traefik:
await self.update_traefik_circuit_routes(circuit, old_routes)
else:
await self.update_legacy_circuit_routes(circuit, old_routes)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unload_circuits() releases the per-circuit lock and then removes it from _circuit_locks. If an update_circuit_routes() call was already queued waiting on that same lock, it will run immediately after unload completes and can still recreate Traefik/etcd state for a circuit you just unloaded/deleted. To make the serialization effective, keep the lock around and add an explicit 'unloaded/deleted' guard checked inside the lock (e.g., a per-circuit tombstone flag/set that makes update_circuit_routes() return early), and only remove the lock when you can guarantee no further work can be queued for that circuit (or during a broader manager shutdown/cleanup).

Copilot uses AI. Check for mistakes.
seedspirit and others added 3 commits March 30, 2026 18:08
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…review

- Add _unloaded_circuits tombstone set to CircuitManager; update_circuit_routes()
  checks it before proceeding, preventing stale updates after unload
- Keep lock alive after unload (no pop) so queued updates hit tombstone guard
- Merge two separate readonly DB sessions into one in
  propagate_route_updates_to_workers() for consistent snapshot and less overhead

Refs: BA-5499

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tombstone (_unloaded_circuits set) adds complexity with unclear
cleanup semantics. The fresh DB re-read in health_checker.py
already detects deleted circuits via ObjectNotFound, providing
sufficient protection without the memory management concern.

Refs: BA-5499

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:app-proxy Related to App Proxy component size:M 30~100 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants