Our eBPF Observability Probes Created Heisenbugs
The promise of eBPF is the holy grail for any engineer: deep, kernel-level visibility into your systems with minimal performance overhead. It’s a revolutionary technology that lets you observe the inner workings of your applications without changing their code. But what happens when the tool you use to find problems is the problem? We learned this lesson the hard way when we discovered that our eBPF observability probes created Heisenbugs, turning our production environment into a quantum mechanics experiment we never signed up for.
This isn't a story about abandoning eBPF. It's a cautionary tale about the immense power it wields and the respect it demands. If you're running eBPF in production, or plan to, understanding how observability can change the observed is critical to avoiding the phantom issues that haunted our team for weeks.
The Seductive Promise of eBPF Observability
Like many engineering teams, we were captivated by the power of eBPF (extended Berkeley Packet Filter). The ability to attach small, sandboxed programs directly to kernel hooks like tracepoints, kprobes, and network events felt like a superpower. We could finally answer complex performance questions without cumbersome agents or application-level instrumentation.
Our goals were to:
- Trace application requests across microservices.
- Monitor network latency with granular detail.
- Identify sources of system call overhead.
We invested heavily in building a suite of eBPF-powered observability tools. We deployed probes across our fleet, collecting metrics and traces that gave us unprecedented insight. For a while, it was perfect. We solved bugs faster and gained a deeper understanding of our system's behavior. But then, the strange reports started trickling in.
The Observer Effect: When Monitoring Introduces Bugs
The first signs of trouble were subtle and frustratingly intermittent. A critical service would experience random latency spikes, but only under heavy load. When we tried to drill down with more detailed tracing, the problem would vanish. We were dealing with a classic Heisenbug—a bug that alters or disappears when you try to study it.
Our dashboards would show a clean bill of health, yet our users (and our downstream services) were reporting timeouts. Our on-call engineers were chasing ghosts. We blamed everything from network hardware to garbage collection pauses in the application runtime. The one thing we didn't suspect was the very tool we were using to find the problem: our shiny new eBPF probes. The irony was painful. The eBPF-induced Heisenbugs were a direct result of our attempt to create a more stable system.
A Technical Deep Dive: How Our eBPF Probes Broke Production
After weeks of dead ends, we finally correlated the strange behavior with the deployment of a specific set of eBPF probes. The root cause wasn't a single catastrophic failure but a combination of subtle performance degradations that, under load, cascaded into system-wide instability.
The High Cost of Frequent kprobes
Our most significant mistake was attaching a kprobe to a very high-frequency function in the kernel's networking stack. A kprobe allows you to dynamically break into almost any kernel function, which is incredibly powerful for debugging. However, this power comes at a cost.
We had a probe attached to tcp_sendmsg, a function called for nearly every packet sent. Our eBPF program was simple, just grabbing a few details and updating a map. But even a few hundred nanoseconds of overhead per execution becomes a massive performance penalty when the function is called millions of times per second.
Consider this simplified bpftrace example, which is conceptually similar to what we deployed:
# WARNING: Attaching to high-frequency functions can impact performance. bpftrace -e 'kprobe:tcp_sendmsg { @bytes[comm] = count(); }'
This one-liner counts calls to tcp_sendmsg by process name. While useful, the overhead of the kprobe mechanism itself, plus the map update operation, introduced enough CPU pressure to create a bottleneck. This is a prime example of how eBPF performance issues can manifest not as a crash, but as a subtle drag on the entire system.
Memory Pressure and Unintended Kernel Lock Contention
The second factor was more insidious. Our eBPF programs used maps to aggregate data in kernel space before sending it to user space. While efficient, these maps consume non-swappable kernel memory. One of our probes had a bug that led to a slow leak in an eBPF map, gradually increasing memory pressure on the kernel.
More critically, the execution of our probe within a sensitive kernel function slightly extended the time a spinlock was held. In a low-traffic environment, this was unnoticeable. But under production load, this tiny delay created massive lock contention. Other CPUs trying to acquire the same lock had to wait, leading to a domino effect of micro-stalls that we perceived as random application latency. Our eBPF observability probes were creating the very bugs we were trying to find.
Mitigating the Observer Effect: Best Practices for Safe eBPF Deployment
Our debugging saga forced us to develop a more disciplined approach to using eBPF. It's a tool that operates at the heart of the OS, and it must be handled with care. Here are the best practices we now follow to prevent observability from causing instability:
- Favor Tracepoints Over kprobes: Whenever possible, use static kernel tracepoints instead of kprobes. Tracepoints are stable, well-defined API points in the kernel designed for tracing. They are significantly more efficient and less likely to break between kernel versions.
- Profile Your Probes: Before deploying a probe to production, profile its CPU and memory overhead. Use tools like
perfto measure the cost of your eBPF program. Never assume a probe is "zero-cost." - Avoid High-Frequency Functions: Be extremely cautious when attaching probes to functions in the critical path of the scheduler, network stack, or memory management subsystems. A tiny overhead here can have an outsized impact.
- Implement a Global Kill Switch: Have a simple, reliable mechanism to disable all custom eBPF probes across your fleet instantly. When you're chasing a production fire, you need to be able to rule out your monitoring tools as the cause quickly.
- Use Bounded Maps: Always use bounded eBPF maps (e.g., arrays and hash maps with a
max_entriesvalue) to prevent unbounded kernel memory growth. Set up alerts to monitor map sizes. - Stage Your Rollouts: Deploy new or updated probes gradually. Start in a staging environment, then move to a small canary set of production servers. Monitor key performance indicators (CPU, latency, error rates) at each stage.
Conclusion: Wielding Power Responsibly
eBPF remains one of the most powerful tools in our observability arsenal. It provides insights we simply cannot get any other way. However, our experience was a stark reminder that there is no such thing as a "pure" observer in a complex system. The act of measurement can, and sometimes will, affect the system being measured.
The key takeaway is not to fear eBPF, but to respect its position within the kernel. By understanding its potential pitfalls, profiling for performance, and deploying with caution, you can harness its incredible power safely. Don't let your observability solution become your next production outage.
Have you ever been bitten by an observability-induced Heisenbug, with eBPF or another tool? We’d love to hear your story in the comments below.

Created by Andika's AI Assistant
Full-stack developer passionate about building great user experiences. Writing about web development, React, and everything in between.
