This repository demonstrates the importance of batching data and how that affects persistent queuing in OpenTelemetry Collectors. This repository is a companion for the talk Crash-Proofing Your OpenTelemetry Collector, which compares two different approaches for handling batches with persistent queues.
When an OpenTelemetry Collector crashes or is terminated unexpectedly, any telemetry data in memory will be lost. This demo shows why we need to move away from the batch processor and how to prevent data loss using persistent storage with OTLP batch.
The demo consists of:
- OpenTelemetry Collector - Receives and exports telemetry data
- Jaeger - Backend for visualizing traces
- Telemetrygen - Tool for generating sample traces
- File Storage - Persistent storage to store telemetry data
- Docker and Docker Compose installed
- Bash shell (Linux/macOS)
The automated test script (run_test.sh) performs the following steps:
- Cleans up any existing file storage to start fresh
- Starts the Collector and Jaeger backend using Docker Compose
- Sends 100 traces to the Collector using
telemetrygen - Waits 15 seconds
- Forcefully kills the Collector container to simulate a crash scenario
- Prompts you to check Jaeger (you should see no traces yet)
- Instructs you to restart the Collector to observe two different scenarios:
- Data loss with the batch processor
- Data recovery with OTLP batch
./run_test.sh batch_processor./run_test.sh otlp_batcherAfter the Collector is killed, open Jaeger in your browser: http://localhost:16686
You should see no traces because they were still in the buffer when the Collector crashed.
Restart the Collector with:
# For batch_processor test
docker compose --profile batch_processor up otelcol-batch_processorNow refresh Jaeger. No matter how long you wait, there is no data to be sent. You have a telemetry data loss!
# For otlp_batcher test
docker compose --profile otlp_batcher up otelcol-otlp_batcherNow refresh Jaeger. After a couple of seconds you should see the 100 traces that were recovered from persistent storage!
Location: otelcol-batch_processor/otelcol-config.yaml
processors:
batch:
timeout: 30s
exporters:
otlp:
sending_queue:
storage: file_storageThis configuration uses:
- The batch processor with 30s timeout
- File-backed sending queue in the exporter
Location: otelcol-otlp_batcher/otelcol-config.yaml
exporters:
otlp:
sending_queue:
storage: file_storage
sizer: "items"
queue_size: 10000
batch:
min_size: 10000
flush_timeout: 30sThis configuration uses:
- Batching configuration directly in the exporter
- File-backed sending queue with item-based sizing
- Batch flush timeout of 30s.
To stop all containers and clean up file storage:
./run_test.sh cleanupThis will:
- Stop and remove all Docker containers from both profiles
- Clean up both file storage directories
- OpenTelemetry Collector Resiliency Documentation
- Batch Processor
- Exporter Helper (contains OTLP batch)
- Storage Extension
This project is open source and available under the Apache License 2.0.
