Crash-Proofing Your OpenTelemetry Collector

This repository demonstrates the importance of batching data and how that affects persistent queuing in OpenTelemetry Collectors. This repository is a companion for the talk Crash-Proofing Your OpenTelemetry Collector, which compares two different approaches for handling batches with persistent queues.

Overview

When an OpenTelemetry Collector crashes or is terminated unexpectedly, any telemetry data in memory will be lost. This demo shows why we need to move away from the batch processor and how to prevent data loss using persistent storage with OTLP batch.

Architecture

The demo consists of:

OpenTelemetry Collector - Receives and exports telemetry data
Jaeger - Backend for visualizing traces
Telemetrygen - Tool for generating sample traces
File Storage - Persistent storage to store telemetry data

Prerequisites

Docker and Docker Compose installed
Bash shell (Linux/macOS)

What the `run_test` script does

The automated test script (run_test.sh) performs the following steps:

Cleans up any existing file storage to start fresh
Starts the Collector and Jaeger backend using Docker Compose
Sends 100 traces to the Collector using telemetrygen
Waits 15 seconds
Forcefully kills the Collector container to simulate a crash scenario
Prompts you to check Jaeger (you should see no traces yet)
Instructs you to restart the Collector to observe two different scenarios:
1. Data loss with the batch processor
2. Data recovery with OTLP batch

Quick Start

Running the demo with the batch processor

./run_test.sh batch_processor

Running the demo with the OTLP batch

./run_test.sh otlp_batcher

Verifying the Results

Step 1: Check for Missing Data

After the Collector is killed, open Jaeger in your browser: http://localhost:16686

You should see no traces because they were still in the buffer when the Collector crashed.

Step 2: Restart the Collector and Verify Data Loss/Recovery

Restart the Collector with:

Batch processor (data loss)

# For batch_processor test
docker compose --profile batch_processor up otelcol-batch_processor

Now refresh Jaeger. No matter how long you wait, there is no data to be sent. You have a telemetry data loss!

OTLP batch (data recovery)

# For otlp_batcher test
docker compose --profile otlp_batcher up otelcol-otlp_batcher

Now refresh Jaeger. After a couple of seconds you should see the 100 traces that were recovered from persistent storage!

Configuration Comparison

batch processor

Location: otelcol-batch_processor/otelcol-config.yaml

processors:
  batch:
    timeout: 30s

exporters:
  otlp:
    sending_queue:
      storage: file_storage

This configuration uses:

The batch processor with 30s timeout
File-backed sending queue in the exporter

OTLP batch

Location: otelcol-otlp_batcher/otelcol-config.yaml

exporters:
  otlp:
    sending_queue:
      storage: file_storage
      sizer: "items"
      queue_size: 10000
      batch:
        min_size: 10000
        flush_timeout: 30s

This configuration uses:

Batching configuration directly in the exporter
File-backed sending queue with item-based sizing
Batch flush timeout of 30s.

Cleanup

To stop all containers and clean up file storage:

./run_test.sh cleanup

This will:

Stop and remove all Docker containers from both profiles
Clean up both file storage directories

Learn More

License

This project is open source and available under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
assets		assets
otelcol-batch_processor		otelcol-batch_processor
otelcol-otlp_batcher		otelcol-otlp_batcher
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
compose.yaml		compose.yaml
renovate.json		renovate.json
run_test.sh		run_test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crash-Proofing Your OpenTelemetry Collector

Overview

Architecture

Prerequisites

What the `run_test` script does

Quick Start

Running the demo with the batch processor

Running the demo with the OTLP batch

Verifying the Results

Step 1: Check for Missing Data

Step 2: Restart the Collector and Verify Data Loss/Recovery

Batch processor (data loss)

OTLP batch (data recovery)

Configuration Comparison

batch processor

OTLP batch

Cleanup

Learn More

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crash-Proofing Your OpenTelemetry Collector

Overview

Architecture

Prerequisites

What the run_test script does

Quick Start

Running the demo with the batch processor

Running the demo with the OTLP batch

Verifying the Results

Step 1: Check for Missing Data

Step 2: Restart the Collector and Verify Data Loss/Recovery

Batch processor (data loss)

OTLP batch (data recovery)

Configuration Comparison

batch processor

OTLP batch

Cleanup

Learn More

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

What the `run_test` script does