Skip to content

Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325

Draft
cijothomas wants to merge 1 commit intoopen-telemetry:mainfrom
cijothomas:cijothomas/shutdown
Draft

Handle OS signals (SIGTERM/SIGINT) for graceful pipeline shutdown#2325
cijothomas wants to merge 1 commit intoopen-telemetry:mainfrom
cijothomas:cijothomas/shutdown

Conversation

@cijothomas
Copy link
Copy Markdown
Member

The main executable has no signal handling today— when K8s sent SIGTERM (or a local user hit Ctrl+C), the process was killed immediately without draining in-flight data.

This PR adds OS signal handling that follows the same double-signal convention as the Go OTel Collector:

First SIGINT/SIGTERM → sends graceful shutdown messages to all pipelines with a 60s drain deadline
Second signal → forces immediate exit via process::exit(1)

@github-actions github-actions bot added the rust Pull requests that update Rust code label Mar 14, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.55%. Comparing base (bde436e) to head (573aa7d).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2325      +/-   ##
==========================================
- Coverage   87.58%   87.55%   -0.03%     
==========================================
  Files         571      571              
  Lines      194095   194550     +455     
==========================================
+ Hits       169996   170339     +343     
- Misses      23573    23685     +112     
  Partials      526      526              
Components Coverage Δ
otap-dataflow 89.57% <14.28%> (-0.05%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.61% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 52.44% <ø> (ø)
quiver 91.91% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

// Give pipelines a generous deadline to drain (60 s by default —
// matches the default Kubernetes terminationGracePeriodSeconds).
let deadline =
std::time::Instant::now() + std::time::Duration::from_secs(60);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: named constant for 60 secs?


// ── Second signal: force exit ───────────────────────
let signal_name = Self::recv_termination_signal().await;
otel_error!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it seems like it is sync, but if this ever uses a bufferred writer (which I doubt it will) this message may not get printed before exit. Maybe an eprintln! right under this for good measure?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. Will swap to eprintln.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the purpose of raw_error!
I would prefer to use it for consistency.

let mut errors = Vec::new();
for sender in &senders {
if let Err(e) = sender.try_send_shutdown(
deadline,
Copy link
Copy Markdown
Member

@lalitb lalitb Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One correctness issue here: try_send_shutdown() can drop the shutdown request if the pipeline control channel is full at signal time. That makes graceful shutdown best-effort under backpressure.

For this PR, I think the smallest fix could be a bounded retry inside the existing rt.block_on(async { ... }) block, e.g. retry try_send_shutdown() a few times with a short tokio::time::sleep(...).await between attempts before giving up and logging the error. That avoids trait changes and closes the immediate gap.

As a follow-up, we can either add a proper async shutdown send on the trait or move shutdown onto a dedicated out-of-band signal such as a watch channel.

Longer term, a dedicated out-of-band shutdown signal (e.g. watch channel per pipeline) also gives us a clean reusable ShutdownHandle that supervisor and OpAMP can call directly - same shutdown path regardless of trigger source.

}

#[cfg(not(unix))]
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the windows equivalent that handles both ctrl_c, ctrl_break.
#[cfg(windows)]
{
use tokio::signal::windows::{ctrl_c, ctrl_break};

let mut sigint = ctrl_c()
    .expect("failed to register Ctrl-C handler");
let mut sigterm = ctrl_break()
    .expect("failed to register Ctrl-Break handler");

tokio::select! {
    _ = sigterm.recv() => "CTRL_BREAK (SIGTERM-equivalent)",
    _ = sigint.recv()  => "CTRL_C (SIGINT-equivalent)",
}

}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: On windows platform Ctrl+C can't be reliably sent to a process without console handle and it will be ignored. Safest option is to use CTRL_BREAK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants