Perf test - try in windows runners#2309
Perf test - try in windows runners#2309cijothomas wants to merge 49 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2309 +/- ##
==========================================
- Coverage 87.73% 87.71% -0.02%
==========================================
Files 578 578
Lines 198325 198325
==========================================
- Hits 173992 173970 -22
- Misses 23807 23829 +22
Partials 526 526
🚀 New features to boost your workflow:
|
| @@ -0,0 +1,94 @@ | |||
| # 100kLRPS Performance Test (Windows) - OTLP in, OTLP out | |||
There was a problem hiding this comment.
A hybrid test that starts with 100K EPS and then goes back to the idle state in the same run would be useful.
It will show:
- Whether the CPU/Memory characteristics return back to the expected idle state characteristics
- How long it takes to return back to the idle state characteristics
There was a problem hiding this comment.
Excellent suggestion. I think we should have it for the existing ones as well, not just windows.
|
Since some environments will want to cap the max resource utilization (CPU and/or Memory) of the otel receiver/df engine, it will be good to have perf tests that run with these knobs enabled.
|
Blocked on open-telemetry#2194 Trying to introduce batch processor to Perf tests, so as to catch ^ issues earlier. And also to actually measure the perf impact of batching! --------- Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com> Co-authored-by: Laurent Quérel <l.querel@f5.com>
…2265) ### Summary The traffic generator's static data source produces hardcoded resource attributes with no way to customize them. This makes it impossible to load-test content-based routing pipelines (e.g. `content_router` routing by `tenant.id`) using the built-in generator. This adds an optional `resource_attributes` field to the traffic generator config. It accepts three forms: **Single map** (all batches carry the same attributes): ```yaml resource_attributes: tenant.id: prod service.namespace: frontend ``` **List of maps** (equal round-robin rotation per batch): ```yaml resource_attributes: - {tenant.id: prod, service.namespace: frontend} - {tenant.id: ppe, service.namespace: backend} ``` **Weighted list** (proportional batch split): ``` yaml resource_attributes: - attrs: {tenant.id: prod, service.namespace: frontend} weight: 3 - attrs: {tenant.id: ppe, service.namespace: backend} weight: 1 ``` The weighted form produces a 75%/25% batch split, simulating realistic skewed multi-tenant traffic on a single connection. All three forms are backward-compatible - existing configs are unaffected (defaults to empty). ### Implementation Rotation uses a precomputed index table built once at startup - [0, 0, 0, 1] for the 3:1 example above. The hot path is a single modulo lookup: slot = rotation[batch_rotation_index % rotation.len()] attrs = &entries[slot].attrs `batch_rotation_index` is a dedicated counter incremented once per emitted batch, fully independent of `signal_count`. Rotation advances exactly once per OTLP message regardless of batch size. ### Limitations - `resource_attributes` only applies to `data_source: static` - With `generation_strategy: pre_generated`, only the first attribute set is used - rotation requires fresh (or templates) - Rotation order is naive ([0, 0, 0, 1] for 3:1), not smooth interleaved; smooth weighted round-robin is left as a follow-up --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Laurent Quérel <laurent.querel@gmail.com> Co-authored-by: albertlockett <a.lockett@f5.com>
…hutdown on shared pdata channels (open-telemetry#2310) # Change Summary fix: SharedReceiver::try_recv maps Empty to Closed causing spurious shutdown on shared pdata channels. ## What issue does this PR close? spontaneous bug find. ## How are these changes tested? Unit tests. Temporarily reverted the bug and validated that `test_recv_when_false_shared_empty_alive_no_shutdown` test fails. ## Are there any user-facing changes? No Co-authored-by: Laurent Quérel <l.querel@f5.com>
open-telemetry#2294) To be consistent with other place.
…er. (open-telemetry#2262) # Change Summary The Geneva exporter was missing the `start_periodic_telemetry()` call, which meant `CollectTelemetry` messages were never delivered - all exporter metrics showed zero despite successful uploads. This PR fixes that and adds metrics to match what Azure Monitor exporter already tracks: - `batches_uploaded` / `batches_failed` — batch-level success/failure - `records_uploaded` / `records_failed` — individual record counts - `bytes_uploaded` — compressed payload throughput - `upload_duration` — upload latency (Mmsc) - `encode_duration` — encode + compress latency (Mmsc) - `conversion_errors`, `empty_payloads_skipped`, `unsupported_signals` — error path counters Tests updated to handle the new telemetry timer message. ## What issue does this PR close? * Closes #NNN ## How are these changes tested? through unit tests. ## Are there any user-facing changes? N/A
…2292) # Change Summary Next part of open-telemetry#1847 and open-telemetry#2086 Moves: * fanout_processor * filter_processor * signal_type_router * batch_processor ## How are these changes tested? * Unit tests / CI * Compiled and ran `df_engine` and confirmed all nodes are still available ## Are there any user-facing changes? No
# Change Summary Fixes the values pushed to the heartbeat table. ## How are these changes tested? Local, manual testing and unit tests. ## Are there any user-facing changes? No
…metry#2320) # Change Summary Missed adding this for the one step in open-telemetry#2279.
…ry#2314) # Change Summary Next part of open-telemetry#1847 and open-telemetry#2086 Moves: * attributes_processor * content_router * durable_buffer_processor * retry_processor * transform_processor ## How are these changes tested? * Unit tests / CI * Compiled and ran `df_engine` and confirmed all nodes are still available ## Are there any user-facing changes? No
Removed redundant exponential backoff The Azure SDK already performs exponential backoff internally (e.g., 6 retries over 72s for IMDS via ManagedIdentityCredential). Our additional exponential backoff (5s → 30s with jitter) on top of that added negligible value (4–30% extra wait) and unnecessary complexity. Replaced with a fixed 1-second pause to prevent tight-spinning between SDK retry cycles. Improved get_token_failed WARN message Added a message field that tells operators: Token acquisition failed The exporter will keep retrying (counteracting the SDK's inner error text which says "the request will no longer be retried") The "retries exhausted" language in the error refers to an internal retry layer, not the exporter's outer loop Full error details remain available at DEBUG level via get_token_failed.details. Before (two noisy WARN lines per failure, misleading retry timing): ```txt WARN get_token_failed [attempt=1, error=Auth error: ManagedIdentityCredential authentication failed. retry policy expired and the request will no longer be retried] WARN retry_scheduled [delay_secs=5.23] ``` After (single clear WARN per failure, self-explanatory): ```txt WARN get_token_failed [message=Token acquisition failed. Will keep retrying. The error may mention retries being exhausted; that refers to an internal retry layer, not this outer loop., attempt=1, error=Auth error (token acquisition): ManagedIdentityCredential authentication failed. retry policy expired and the request will no longer be retried] ```
This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [duckdb](https://redirect.github.com/duckdb/duckdb-python) ([changelog](https://redirect.github.com/duckdb/duckdb-python/releases)) | `==1.4.4` → `==1.5.0` |  |  | --- ### Release Notes <details> <summary>duckdb/duckdb-python (duckdb)</summary> ### [`v1.5.0`](https://redirect.github.com/duckdb/duckdb-python/releases/tag/v1.5.0): DuckDB Python 1.5.0 "Variegata" [Compare Source](https://redirect.github.com/duckdb/duckdb-python/compare/v1.4.4...v1.5.0) This is the 1.5.0 release of DuckDB's Python bindings. For a list of changes in DuckDB core, have a look at the [DuckDB release notes](https://redirect.github.com/duckdb/duckdb/releases/tag/v1.5.0) and [the blogpost](https://duckdb.org/2026/03/09/announcing-duckdb-150.html). ##### Breaking Changes - **Dropped Python 3.9 support.** The minimum supported version is now Python 3.10. - **Removed deprecated `duckdb.typing` and `duckdb.functional` modules.** These were deprecated in 1.4.0. Use `duckdb.sqltypes` and `duckdb.func` instead. - **Renamed `column` parameter to `expression`** in relational API functions (e.g., `min`, `max`, `sum`, `mean`, etc.) to better reflect that these accept expressions, not just column names. - **Deprecated `fetch_arrow_table()` and `fetch_record_batch()`** on connections and relations. Use the new `to_arrow_table()` and `to_arrow_reader()` methods instead. ##### New Features - **Polars LazyFrame projection and filter pushdown.** DuckDB can now push down projections and filters when scanning Polars LazyFrames, including support for cast nodes and unstrict casts. - **Polars Int128 / UInt128 support.** - **VARIANT type support** — Python conversion, NumPy array wrapping, and type stubs. - **TIME\_NS type support** — nanosecond-precision time values across Python, NumPy, and Spark type systems. - **Profiling API** — new `get_profiling_info()` and `get_profiling_json()` methods on connections, plus a refactored `query_graph` module with improved HTML visualization (dark mode, expandable phases, depth). - **`to_arrow_table()` and `to_arrow_reader()`** — new methods on connections and relations as the preferred Arrow export API. ##### Performance - **`__arrow_c_stream__` on relations** — relations now export via the Arrow PyCapsule interface using `PhysicalArrowCollector` for zero-copy streaming. - **Unified Arrow stream scanning** via `__arrow_c_stream__`, with filter pushdown only when pyarrow is present. - **Arrow schema caching** to avoid repeated lookups during scanning. - **Arrow object type caching** to avoid repeated detection. - **Empty params treated as None for `.sql()`** — avoids unnecessary parameter binding overhead. - **Simplified GIL management** for `FetchRow`. ##### Bug Fixes - **Fixed Python object leak in scalar UDFs** — `PyObject_CallObject` return values are now properly stolen to avoid reference count leaks. - **Fixed reference cycle** between connections and relations that could prevent garbage collection. - **Relations now hold a reference to their connection**, preventing premature connection closure. - **Fixed fsspec race condition** in the Python filesystem implementation. - **Fixed numeric conversion logic** — improved handling of large integers (fallback to VARCHAR) and UNION types. - **`pyarrow.dataset` import is now optional** — no longer fails if pyarrow is installed without the dataset module. - **Thrown a reasonable error** when an Arrow array stream has already been consumed. ##### Build & Packaging - **jemalloc enabled on Linux x86\_64 only** (aligned with DuckDB core), removed as a separately bundled extension. - **MSVC runtime linked statically** on Windows — eliminates the VS2019 workaround from [duckdb/duckdb#17991](https://redirect.github.com/duckdb/duckdb/issues/17991). </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 8am on Monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/open-telemetry/otel-arrow). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My42Ni40IiwidXBkYXRlZEluVmVyIjoiNDMuNjYuNCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…emetry#2331) This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [charset-normalizer](https://redirect.github.com/jawah/charset_normalizer) ([changelog](https://redirect.github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md)) | `==3.4.5` → `==3.4.6` |  |  | --- ### Release Notes <details> <summary>jawah/charset_normalizer (charset-normalizer)</summary> ### [`v3.4.6`](https://redirect.github.com/jawah/charset_normalizer/blob/HEAD/CHANGELOG.md#346-2026-03-15) [Compare Source](https://redirect.github.com/jawah/charset_normalizer/compare/3.4.5...3.4.6) ##### Changed - Flattened the logic in `charset_normalizer.md` for higher performance. Removed `eligible(..)` and `feed(...)` in favor of `feed_info(...)`. - Raised upper bound for mypy\[c] to 1.20, for our optimized version. - Updated `UNICODE_RANGES_COMBINED` using Unicode blocks v17. ##### Fixed - Edge case where noise difference between two candidates can be almost insignificant. ([#​672](https://redirect.github.com/jawah/charset_normalizer/issues/672)) - CLI `--normalize` writing to wrong path when passing multiple files in. ([#​702](https://redirect.github.com/jawah/charset_normalizer/issues/702)) ##### Misc - Freethreaded pre-built wheels now shipped in PyPI starting with 3.14t. ([#​616](https://redirect.github.com/jawah/charset_normalizer/issues/616)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 8am every weekday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/open-telemetry/otel-arrow). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My42Ni40IiwidXBkYXRlZEluVmVyIjoiNDMuNjYuNCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…etry#2332) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [azure_core](https://redirect.github.com/azure/azure-sdk-for-rust) | workspace.dependencies | minor | `0.32.0` → `0.33.0` | | [azure_identity](https://redirect.github.com/azure/azure-sdk-for-rust) | workspace.dependencies | minor | `0.32.0` → `0.33.0` | --- ### Release Notes <details> <summary>azure/azure-sdk-for-rust (azure_core)</summary> ### [`v0.33.0`](https://redirect.github.com/Azure/azure-sdk-for-rust/releases/tag/azure_identity%400.33.0) [Compare Source](https://redirect.github.com/azure/azure-sdk-for-rust/compare/azure_core@0.32.0...azure_core@0.33.0) #### 0.33.0 (2026-03-09) ##### Breaking Changes - Support for `wasm32-unknown-unknown` has been removed ([#​3377](https://redirect.github.com/Azure/azure-sdk-for-rust/issues/3377)) - `ClientCertificateCredential::new()` now takes `SecretBytes` instead of `Secret` for the `certificate` parameter. Pass the raw PKCS12 bytes wrapped in `SecretBytes` instead of a base64-encoded string wrapped in `Secret`. </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 8am on Monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/open-telemetry/otel-arrow). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My42Ni40IiwidXBkYXRlZEluVmVyIjoiNDMuNjYuNCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Network errors (connect failures, timeouts) were **invisible** in the exporter's metrics dashboard — all HTTP status counters showed 0 even during total export failure. Added a laclient_network_errors counter that increments on each failed HTTP attempt before a response is received, making connectivity issues immediately diagnosable. Tested by turning wifi off and running exporter. The new counters helps troubleshoot quickly
…lemetry#2337) This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [opentelemetry-api](https://redirect.github.com/open-telemetry/opentelemetry-python) | `==1.39.1` → `==1.40.0` |  |  | | [opentelemetry-exporter-otlp](https://redirect.github.com/open-telemetry/opentelemetry-python) | `==1.39.1` → `==1.40.0` |  |  | | [opentelemetry-exporter-otlp-proto-common](https://redirect.github.com/open-telemetry/opentelemetry-python) | `==1.39.1` → `==1.40.0` |  |  | | [opentelemetry-exporter-otlp-proto-grpc](https://redirect.github.com/open-telemetry/opentelemetry-python) | `==1.39.1` → `==1.40.0` |  |  | | [opentelemetry-exporter-otlp-proto-http](https://redirect.github.com/open-telemetry/opentelemetry-python) | `==1.39.1` → `==1.40.0` |  |  | | [opentelemetry-proto](https://redirect.github.com/open-telemetry/opentelemetry-python) | `==1.39.1` → `==1.40.0` |  |  | | [opentelemetry-sdk](https://redirect.github.com/open-telemetry/opentelemetry-python) | `==1.39.1` → `==1.40.0` |  |  | --- ### Release Notes <details> <summary>open-telemetry/opentelemetry-python (opentelemetry-api)</summary> ### [`v1.40.0`](https://redirect.github.com/open-telemetry/opentelemetry-python/blob/HEAD/CHANGELOG.md#Version-1400061b0-2026-03-04) [Compare Source](https://redirect.github.com/open-telemetry/opentelemetry-python/compare/v1.39.1...v1.40.0) - `opentelemetry-sdk`: deprecate `LoggingHandler` in favor of `opentelemetry-instrumentation-logging`, see `opentelemetry-instrumentation-logging` documentation ([#​4919](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4919)) - `opentelemetry-sdk`: Clarify log processor error handling expectations in documentation ([#​4915](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4915)) - bump semantic-conventions to v1.40.0 ([#​4941](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4941)) - Add stale PR GitHub Action ([#​4926](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4926)) - `opentelemetry-sdk`: Drop unused Jaeger exporter environment variables (exporter removed in 1.22.0) ([#​4918](https://redirect.github.com/open-telemetry/opentelemetry-python/issues/4918)) - `opentelemetry-sdk`: Clarify timeout units in environment variable documentation ([#​4906](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4906)) - `opentelemetry-exporter-otlp-proto-grpc`: Fix re-initialization of gRPC channel on UNAVAILABLE error ([#​4825](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4825)) - `opentelemetry-exporter-prometheus`: Fix duplicate HELP/TYPE declarations for metrics with different label sets ([#​4868](https://redirect.github.com/open-telemetry/opentelemetry-python/issues/4868)) - Allow loading all resource detectors by setting `OTEL_EXPERIMENTAL_RESOURCE_DETECTORS` to `*` ([#​4819](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4819)) - `opentelemetry-sdk`: Fix the type hint of the `_metrics_data` property to allow `None` ([#​4837](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4837)). - Regenerate opentelemetry-proto code with v1.9.0 release ([#​4840](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4840)) - Add python 3.14 support ([#​4798](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4798)) - Silence events API warnings for internal users ([#​4847](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4847)) - opentelemetry-sdk: make it possible to override the default processors in the SDK configurator ([#​4806](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4806)) - Prevent possible endless recursion from happening in `SimpleLogRecordProcessor.on_emit`, ([#​4799](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4799)) and ([#​4867](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4867)). - Implement span start/end metrics ([#​4880](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4880)) - Add environment variable carriers to API ([#​4609](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4609)) - Add experimental composable rule based sampler ([#​4882](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4882)) - Make ConcurrentMultiSpanProcessor fork safe ([#​4862](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4862)) - `opentelemetry-exporter-otlp-proto-http`: fix retry logic and error handling for connection failures in trace, metric, and log exporters ([#​4709](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4709)) - `opentelemetry-sdk`: avoid RuntimeError during iteration of view instrument match dictionary in MetricReaderStorage.collect() ([#​4891](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4891)) - Implement experimental TracerConfigurator ([#​4861](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4861)) - `opentelemetry-sdk`: Fix instrument creation race condition ([#​4913](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4913)) - bump semantic-conventions to v1.39.0 ([#​4914](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4914)) - `opentelemetry-sdk`: automatically generate configuration models using OTel config JSON schema ([#​4879](https://redirect.github.com/open-telemetry/opentelemetry-python/pull/4879)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 8am on Monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/open-telemetry/otel-arrow). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My42Ni40IiwidXBkYXRlZEluVmVyIjoiNDMuNjYuNCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…zation (open-telemetry#2318) # Change Summary - Pre-serialize resource+scope fields once per ScopeLogs as a JSON byte prefix - Write per-record fields directly to byte buffer using itoa/ryu, bypassing serde_json::Map entirely - Add write_json_string, write_json_hex, write_field_value_json for zero-allocation JSON output - Make config and metrics modules public for benchmark access - Add criterion benchmark under contrib-nodes/benches/exporters/azure_monitor_exporter/ - Added contrib bench to rust-bench workflow. Benchmark results (1000 records): Original: 1.60ms (~625K records/s) Hoisted: 1.36ms (~735K records/s) +17% Hoisted + Direct Serialization: 425us (~2.35M records/s) +275% ## What issue does this PR close? ## How are these changes tested? - Existing unit tests cover already mapping uniqueness, also added tests to make sure that overlapping fields are rejected by the config valiation across resource -> scope -> log hiearchy. - Added tests for validating against encoding issues for non ASCII characters. ## Are there any user-facing changes? None.
…ry#2336) This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [pydantic-core](https://redirect.github.com/pydantic/pydantic-core) | `==2.41.5` → `==2.42.0` |  |  | --- > [!WARNING] > Some dependencies could not be looked up. Check the [Dependency Dashboard](..open-telemetry/issues/417) for more information. --- ### Configuration 📅 **Schedule**: Branch creation - "before 8am on Monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/open-telemetry/otel-arrow). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My42Ni40IiwidXBkYXRlZEluVmVyIjoiNDMuNjYuNCIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
open-telemetry#2323) … to core-nodes crate # Change Summary Next part of open-telemetry#1847 and open-telemetry#2086 Moves: * perf_exporter * topic_exporter * internal_telemetry_receiver * topic_receiver ## How are these changes tested? * Unit tests / CI * Compiled and ran `df_engine` and confirmed all nodes are still available ## Are there any user-facing changes? No
# Change Summary Last Wednesday's SIG meeting discussion alluded to me that our OTAP query engine probably had some had some performance to be gained in its filtering implementation. I ran the profiler on the benchmarks and, lo and behold, we were spending a lot of time creating `RoaringBitmap`s and checking if IDs were contained therein. Note - when filtering we were using these `RoaringBitmap`s in two places: a) when the filtering be an attribute value in a query like `where attributes["x"] = "y"`, we filter the attribute record batch where `key == "x" and str == "y"`, then create a bitmap of the parent_id column, then we scan the parent record batch, determining which values from the `id` column are in this bitmap, and keep only those rows. In these situations, we may also and/or/not the bitmaps together if the query was something like `attributes["x1"] == "y1" and attributes["x2"] == "y2"`. b) when we're finished with the entire filter execution, we will have determined which rows to keep for the parent batch, and then we need to recursively filter the child record batches. We would do this by creating a bitmap of the ID column on the parent batch, then scanning the child `parent_id` column, checking which rows were in this bitmap and keeping only those rows. This PR replaces the usage of `RoaringBitmap` with a simpler implementation based on an implementation called `IdBitmap` which contains pages, each containing a flat bitmap of based on an array of u64s. This is much faster because, unlike `RoaringBitmap`, we never have to do search for the container during insert/contains. While the `RoaringBitmap` is designed for potentially sparse data, our data is dense where in the vast majority of cases the ID columns will be 0...max with few or no gaps, so a more traditional bitmap makes sense here. This PR also reuses the heap allocations of the bitmap between batches. Note that because we potentially create many of these bitmaps during a filter execution (see note above about logically combining attribute filters), we implement reuse via a pool. Benchmark results: | Benchmark | Batch Size | Before | After | Change | |---|---|---|---|---| | `simple_field_filter` | 32 | 7.58 µs | 7.61 µs | ~0% | | `simple_field_filter` | 1024 | 22.32 µs | 14.39 µs | **-35.5%** | | `simple_field_filter` | 8192 | 140.86 µs | 65.67 µs | **-53.4%** | | `simple_attr_filter` | 32 | 10.17 µs | 9.93 µs | **-2.4%** | | `simple_attr_filter` | 1024 | 41.43 µs | 22.75 µs | **-45.1%** | | `simple_attr_filter` | 8192 | 308.44 µs | 120.45 µs | **-60.9%** | | `attr_or_attr_filter` | 32 | 12.92 µs | 12.57 µs | **-2.7%** | | `attr_or_attr_filter` | 1024 | 56.77 µs | 31.04 µs | **-45.3%** | | `attr_or_attr_filter` | 8192 | 427.42 µs | 167.81 µs | **-60.7%** | | `attr_and_prop_filter` | 32 | 10.84 µs | 10.92 µs | ~0% | | `attr_and_prop_filter` | 1024 | 35.98 µs | 21.54 µs | **-40.1%** | | `attr_and_prop_filter` | 8192 | 263.80 µs | 100.78 µs | **-61.8%** | | `attr_and_attr_filter` | 32 | 4.32 µs | 4.38 µs | ~0% | | `attr_and_attr_filter` | 1024 | 13.40 µs | 9.41 µs | **-29.8%** | | `attr_and_attr_filter` | 8192 | 92.65 µs | 49.49 µs | **-46.6%** | | `attr_and_or_together_filter` | 32 | 16.31 µs | 16.37 µs | ~0% | | `attr_and_or_together_filter` | 1024 | 52.78 µs | 34.01 µs | **-35.6%** | | `attr_and_or_together_filter` | 8192 | 387.77 µs | 164.41 µs | **-57.6%** | | `or_short_circuit` | 32 | 3.38 µs | 3.25 µs | **-3.8%** | | `or_short_circuit` | 1024 | 24.19 µs | 8.49 µs | **-64.9%** | | `or_short_circuit` | 8192 | 127.50 µs | 48.67 µs | **-61.8%** | (Note that filter benchmarks `attr_and_or_together_filter` & `and_attrs_short_circuit` are omitted from the table because they filter out all the rows before any bitmap operations, so there is no performance change). <!-- Replace with a brief summary of the change in this PR --> ## What issue does this PR close? <!-- We highly recommend correlation of every PR to an issue --> * Closes open-telemetry#2330 ## How are these changes tested? Unit tests ## Are there any user-facing changes? No ## Future work/followups We should probably look at other places we use `RoaringBitmap` for this same purpose to see if we can get some performance benefit from using this new IdBitmap type. For example, I think this same type of procedure is also done in filter processor.
# Change Summary Allow users to define test containers that will start up and run alongside the pipeline engine to test pipeline nodes that need an external system to talk to. Allow users to describe connections to test containers in the Scenario. For example users and can configure their Generator/Capture to send/receive from a test container. Can also specify suv Pipeline nodes that should connect to a test container. All test container wiring is handled by the framework, user just needs to specify what internal port they want to talk to. Updated README.md with new functions and examples of the test container feature ## What issue does this PR close? * Closes open-telemetry#2258 ## How are these changes tested? Added unit tests ## Are there any user-facing changes? no --------- Co-authored-by: Drew Relmas <drewrelmas@gmail.com>
Old runner: - name: `oracle-bare-metal-64cpu-512gb-x86-64` - 512gb memory - Oracle Linux 8 New runner: - name: `oracle-bare-metal-64cpu-1024gb-x86-64-ubuntu-24` - 1024gb memory - Ubuntu 24 I realize this could have some impact on benchmark baselines, so please post on open-telemetry/community#3333 once you have migrated and are comfortable with the old one being removed.
…pen-telemetry#2306) # Implement zero-copy view for OTAP Traces Fixes open-telemetry#2053 (Traces portion). This PR introduces `.traces` sub-module inside `pdata/src/views/otap.rs`, implementing the `OtapTracesView`. ## Changes made - Created `OtapTracesView` in `crates/pdata/src/views/otap/traces.rs`. - Added zero-copy traversal elements mirroring the OTLP traces model: - `OtapResourceSpansView` - `OtapScopeSpansView` - `OtapSpanView` - `OtapEventView`, `OtapLinkView`, `OtapStatusView` - Exposed the `traces` module in `otap.rs`. - Adapted array access patterns to use standard traits like `ByteArrayAccessor` and `StringArrayAccessor`. - Modified `OtapAttributeView` in `logs.rs` to expose `key` and `value` fields using `pub(crate)` natively so it can be re-used by `traces.rs`. ## Validation results - `cargo test -p otap-df-pdata` passes. - No memory leaks introduced; logic is completely zero-copy across all RecordBatch abstractions for traces. - Unit tests (`test_create_otap_traces_view`, `test_span_fields`, `test_span_status`, `test_missing_optional_columns`, `test_events_iteration`) were run and executed successfully without lifetime / compilation panics. Co-authored-by: albertlockett <a.lockett@f5.com>
…elemetry#2351) # Change Summary As noted in [this workflow run](https://github.com/open-telemetry/otel-arrow/actions/runs/23181837916/job/67356323365?pr=2306) - recent Renovate updates to some python dependencies for perf testing didn't work properly. open-telemetry#2336 updated `pydantic-core` from `2.41.5` to `2.42.0`. However, this is an indirect dependency of `pydantic`, which is a direct dependency in the requirements file: https://github.com/open-telemetry/otel-arrow/blob/59ef72fdbe2003a1425bb5c700d3de0579ffb050/tools/pipeline_perf_test/orchestrator/requirements.txt#L6 `pydantic` `2.12.5` requires EXACTLY `pydantic-core` `2.41.5`, so it was a bad Renovate update. Based on Renovate docs, we should be able to disable indirect dependency updates like this by matching on `matchDepTypes: ["indirect"]` and disabling. In addition, it seems we have some mismatch between python versions used in the repo: `3.11` and `3.14`. This can also lead to bad side effects if lock files are generated with a version different from the one being used at runtime during workflow runs.
…rtup stalls (open-telemetry#2335) # Change Summary Fixes a startup race where pipeline cores could get stuck in `Pending` on high-core machines. Engine lifecycle events (`Admitted`, `Ready`) shared the same fixed-size bounded observed-state channel as lossy async log events. Under startup burst, `Admitted` could be dropped when `send_timeout(1ms)` expired. When `Ready` arrived later, the state machine rejected the `Pending -> Ready` transition as invalid, leaving the core stuck. This change separates reliability classes: engine lifecycle events now go through a dedicated unbounded channel, while async log events stay on the existing bounded lossy path. The unbounded channel is intentional: engine events are low-volume, correctness-critical, and naturally bounded by pipeline/core lifecycle activity. **Alternate approaches considered:** - Increasing the bounded channel size would only reduce the probability of failure under burst; it would not guarantee delivery of lifecycle events. - Making the state machine accept `Pending` -> `Ready` would mask dropped lifecycle events instead of fixing delivery. ## What issue does this PR close? * Closes open-telemetry#2328 ## How are these changes tested? ```bash $ cargo test -p otap-df-state -- --nocapture $ cargo check -p otap-df-controller -p otap-df-state -p otap-df-telemetry -p otap-df-config ``` ## Are there any user-facing changes? No
… crate (open-telemetry#2339) # Change Summary Next part of open-telemetry#1847 and open-telemetry#2086 ## How are these changes tested? * Unit tests / CLI * Compiled and ran `df_engine` and confirmed all nodes are still available ## Are there any user-facing changes? No
…2346) # Change Summary I previously added payload definitions in open-telemetry#2240 which were mostly focused on solving the dictionary key size problem. This PR builds on that work, but with a more refined implementation that also includes things like required vs optional columns as a pre-req for open-telemetry#2289. The major changes: - Added an otap `Schema` type and redefined the payloads according to that. This is a better construct that is somewhat symmetrical to arrow Schemas, allows for defining recursive Struct and List types, and subsequently removes the requirement to have nested lookup tables. - Added deep equality checks between record batches and `Schema` types - Updated some small bits of the `transform` module for the changes ## What issue does this PR close? * Part of open-telemetry#2289 ## How are these changes tested? Unit ## Are there any user-facing changes? No --------- Co-authored-by: albertlockett <a.lockett@f5.com>
…erator (open-telemetry#2347) Need to see the pipeline performance when using large payloads etc. Its repeating same string, so compresses well - addressing that (by true randomn values) would be a future addition.
…elemetry#2211) Add a `node.processor` metric set with a `process.success.duration` and `process.failed.duration` Mmsc instrument for measuring the wall-clock duration of the work done in a process() call. A closure is used to prevent inclusion of async-await points in the measurement. The metric is registered via the node telemetry context. This is intended to be gated by MetricLevel >= Normal Fixes open-telemetry#2210. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
|
|
|
https://github.com/open-telemetry/community/blob/main/assets.md#large-windows-runners Looks like we have access to Windows runners, so we can try that instead of Github runners which are not very powerful and gives flaky behavior. |
Attempting to see if we can replicate subset of perf runs on Windows. Given there is no dedicated perf machines, we have to try the GitHub runners.
Starting with just the idle state to see if this is even feasible in CI.