Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1528 +/- ##
==========================================
- Coverage 87.73% 87.71% -0.02%
==========================================
Files 578 578
Lines 198334 198413 +79
==========================================
+ Hits 174012 174044 +32
- Misses 23796 23843 +47
Partials 526 526
🚀 New features to boost your workflow:
|
lquerel
left a comment
There was a problem hiding this comment.
I really like this document.
I think we should also include the OTLP to OTLP scenario in the different sections since it will be one of the most common scenarios, at least in the beginning.
I also think we should add the wait_for_result mode in the otel-arrow section because it provides a true end to end unified ack/nack mechanism, which I believe is not fully supported by the Go collector.
Fixed one TODO! #1528 - Still working on this separately, which will include actual numbers for key scenarios, so readers don't have to go through the graphs themselves!
|
This pull request has been marked as stale due to lack of recent activity. It will be closed in 30 days if no further activity occurs. If this PR is still relevant, please comment or push new commits to keep it active. |
|
This pull request has been marked as stale due to lack of recent activity. It will be closed in 30 days if no further activity occurs. If this PR is still relevant, please comment or push new commits to keep it active. |
| performance characteristics and efficient resource utilization across varying | ||
| load conditions. The engine uses a [thread-per-core | ||
| architecture](#thread-per-core-design) where resource consumption scales with | ||
| the number of configured cores. |
There was a problem hiding this comment.
resource consumption scales with the number of configured cores
I found this a bit hard to interpret. Do you mean "the throughput scales with the number of configured CPU cores, almost in a linear fashion?"
There was a problem hiding this comment.
ya it reads weird.. I'll update with a better wording
(I meant throughput scales linearly, but so does memory consumption)
| All performance tests are executed on bare-metal compute instance with the | ||
| following specifications: | ||
|
|
||
| - **CPU**: 64 physical cores / 128 logical cores (x86-64 architecture) |
There was a problem hiding this comment.
Consider calling out the number of NUMA groups?
There was a problem hiding this comment.
ya. We just confirmed that the CNCF machine has 2 sockets
https://github.com/open-telemetry/otel-arrow/actions/runs/23278418373#summary-67686308469
So far no tests were run with engine running on more than 32 cores (and they were all in same node). I have to see if we can actually do that, given load-gen and fake-backend also needs cores to run on.
|
|
||
| ### Test Environment | ||
|
|
||
| All performance tests are executed on bare-metal compute instance with the |
There was a problem hiding this comment.
| All performance tests are executed on bare-metal compute instance with the | |
| All performance tests are executed on a dedicated bare-metal compute instance with the |
There was a problem hiding this comment.
Not sure if this is a shared resource or dedicated, consider calling it out.
There was a problem hiding this comment.
it is dedicated. Will update
| *Note: CPU usage is normalized (percentage of total system capacity). Memory | ||
| usage scales with core count due to the [thread-per-core | ||
| architecture](#thread-per-core-design).* |
There was a problem hiding this comment.
Memory usage could be confusing (cached, shared, non-paged pool, virtual memory vs. physical memory, ...), consider aligning and pointing to the OTel System metrics semantic conventions https://github.com/open-telemetry/semantic-conventions/blob/main/docs/system/system-metrics.md.
| This represents the optimal scenario where the dataflow engine operates with its | ||
| native protocol end-to-end, eliminating protocol conversion overhead. | ||
|
|
||
| ##### Standard Load - OTLP -> OTLP (Standard Protocol) |
There was a problem hiding this comment.
Which OTLP? (gRPC, proto via HTTP 1.1, JSON, TLS enabled vs. not)
| engine and the OpenTelemetry Collector, we use **Syslog (UDP/TCP)** as the | ||
| ingress protocol for both systems. | ||
|
|
||
| #### Rationale for Syslog-Based Comparison |
There was a problem hiding this comment.
| #### Rationale for Syslog-Based Comparison | |
| #### Rationale for Syslog-based Comparison |
|
|
||
| Scaling Efficiency = (Throughput at N cores) / (N * Single-core throughput) | ||
|
|
||
| ### Architecture |
There was a problem hiding this comment.
It is a bit weird to have an architecture section in the benchmark document (unless it is talking about the benchmarking environment's own architecture).
This docs is an attempt at the schema for our Phase 2 performance summary, when phase 2 is completed. It defines the key scenarios (Idle, 100k Load, Saturation) and the comparative analysis with OTLP/Collector. I've put TBD for actual numbers, as this is just attempting to finalize what we want to have in an easy to consume format. Actual numbers will be filled in later. This can also be used to see if there are gaps in the perf test suites that we want to add.
The existing pages like https://open-telemetry.github.io/otel-arrow/benchmarks/nightly/backpressure/ are still retained. This doc will have distilled information from them.