Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325
Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325cijothomas wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2325 +/- ##
==========================================
- Coverage 87.58% 87.55% -0.03%
==========================================
Files 571 571
Lines 194095 194550 +455
==========================================
+ Hits 169996 170339 +343
- Misses 23573 23685 +112
Partials 526 526
🚀 New features to boost your workflow:
|
| // Give pipelines a generous deadline to drain (60 s by default — | ||
| // matches the default Kubernetes terminationGracePeriodSeconds). | ||
| let deadline = | ||
| std::time::Instant::now() + std::time::Duration::from_secs(60); |
There was a problem hiding this comment.
nit: named constant for 60 secs?
|
|
||
| // ── Second signal: force exit ─────────────────────── | ||
| let signal_name = Self::recv_termination_signal().await; | ||
| otel_error!( |
There was a problem hiding this comment.
Currently it seems like it is sync, but if this ever uses a bufferred writer (which I doubt it will) this message may not get printed before exit. Maybe an eprintln! right under this for good measure?
There was a problem hiding this comment.
agree. Will swap to eprintln.
There was a problem hiding this comment.
This is the purpose of raw_error!
I would prefer to use it for consistency.
| let mut errors = Vec::new(); | ||
| for sender in &senders { | ||
| if let Err(e) = sender.try_send_shutdown( | ||
| deadline, |
There was a problem hiding this comment.
One correctness issue here: try_send_shutdown() can drop the shutdown request if the pipeline control channel is full at signal time. That makes graceful shutdown best-effort under backpressure.
For this PR, I think the smallest fix could be a bounded retry inside the existing rt.block_on(async { ... }) block, e.g. retry try_send_shutdown() a few times with a short tokio::time::sleep(...).await between attempts before giving up and logging the error. That avoids trait changes and closes the immediate gap.
As a follow-up, we can either add a proper async shutdown send on the trait or move shutdown onto a dedicated out-of-band signal such as a watch channel.
Longer term, a dedicated out-of-band shutdown signal (e.g. watch channel per pipeline) also gives us a clean reusable ShutdownHandle that supervisor and OpAMP can call directly - same shutdown path regardless of trigger source.
| } | ||
|
|
||
| #[cfg(not(unix))] | ||
| { |
There was a problem hiding this comment.
Here is the windows equivalent that handles both ctrl_c, ctrl_break.
#[cfg(windows)]
{
use tokio::signal::windows::{ctrl_c, ctrl_break};
let mut sigint = ctrl_c()
.expect("failed to register Ctrl-C handler");
let mut sigterm = ctrl_break()
.expect("failed to register Ctrl-Break handler");
tokio::select! {
_ = sigterm.recv() => "CTRL_BREAK (SIGTERM-equivalent)",
_ = sigint.recv() => "CTRL_C (SIGINT-equivalent)",
}
}
There was a problem hiding this comment.
FYI: On windows platform Ctrl+C can't be reliably sent to a process without console handle and it will be ignored. Safest option is to use CTRL_BREAK.
The main executable has no signal handling today— when K8s sent SIGTERM (or a local user hit Ctrl+C), the process was killed immediately without draining in-flight data.
This PR adds OS signal handling that follows the same double-signal convention as the Go OTel Collector:
First SIGINT/SIGTERM → sends graceful shutdown messages to all pipelines with a 60s drain deadline
Second signal → forces immediate exit via process::exit(1)