The path to troubleshooting production issues automatically

Applying machine learning to root cause analysis has immense potential, but also presents some core challenges. How can organizations navigate these hurdles to enable instant automated troubleshooting?

Four foundational pillars can pave the way to AI-driven incident resolution. While emerging from bleeding edge research, these capabilities now have clear pathways to productionization. Let’s examine the critical components powering the next generation of AIOps:

1. Pervasive, non-intrusive data collection

The fuel for effective ML pipelines is data – and lots of it. Systems must ingest metrics, logs, and traces at massive breadth and depth to spot emerging issues.

But traditional data collection strategies create bottlenecks. Observability systems building-in silos, segmenting telemetry, and instrumenting code only provide snapshots of this siloed data. They fail to adapt to rapidly changing infrastructure and deliver incomplete, disjointed telemetry.

The solution lies in non-intrusive host instrumentation techniques like extended Berkeley Packet Filter (eBPF). Originating from network packet inspection, eBPF has evolved into a ubiquitous observability primitive exposing kernel, system, and user space activity without containers or hosts noticing.

With eBPF, data pipelines consume rich OS, network, application, and hardware signals without any code changes, performance overhead, or need to configure collection rules. eBPF is a tool that enables the success of ML pipelines (so long as observability is intelligently constructed around it). Machine learning models thrive on this high-fidelity telemetry spanning all environment layers.

2. Automated service topology mapping

Even with immense data ingestion, models still require a topology – a wiring diagram – detailing how thousands of microservices, datastores, load balancers and infrastructure components interrelate at runtime.

Manually generating this system graph introduces immense work (diverting engineers from innovation) while still being inaccurate as environments dynamically scale.

Instead, next-gen solutions automatically discover services through traffic inspection, define communication patterns, and serialize dependency chains into an evolving real-time topology. Integrating this core knowledge representation into ML pipelines provides contextual awareness enabling complex causality analysis.

3. Specialized causality analysis algorithms

Given a rich data lake and system topology, machines can now infer root causes. But production troubleshooting involves unique complexities. Generic correlation algorithms cannot reasonably search enormous state spaces, lacking heuristics for where anomalous yet business-critical degradations may emerge.

The industry has made enormous strides in scoping out new specialized graphs analysis techniques. Tools that can analyze anomalies across all types of data – such as Kubernetes data, metrics, and traces – identify issues that potentially have a high probability of being the root cause behind service degradation. Effectively, these tools remake the “intuition” of human engineers; the experience that helps site reliability teams hone in on the right spot in an enormously complex production environment.

Industry progress toward efficient, robust, and scalable algorithms has built a strong foundation for specialized, proprietary ML models – ones that can guarantee true causality inference at scale. Although we can’t use these off-the-shelf algorithms to achieve true automated root cause analysis (more on the limitations of existing AIOps solutions in our blog post), they serve as enablers for development.

Next-gen ML models will be able to achieve lightning-fast root-cause analysis; navigating from alert to origin at remarkable speed. But even after these bespoke developments emerge, one final challenge remains:

4. Actionable reporting

Machine learning can pinpoint root cause – but how do we make probabilistic model scores interpretable for actual remediation? Alert fatigue already overwhelms teams; they need clear guidance.

True troubleshooting automation requires interfaces that translate algorithmic insights into engineering understanding. Investigation should center around natural language, allowing teams to query failures, understand impact sequencing, and receive mitigation suggestions.

This final mile remains an open challenge. But solutions are emerging, such as topological heuristics for quantifying blast radius which can be visualized as hierarchical incident graphs. Integrating these capabilities provides response blueprints.

AIOps is a category of technology designed to use AI and other advanced technology to optimize IT operations – particularly around event correlation, anomaly detection, and causality determination. Senser, for example, realizes this vision of AI-driven troubleshooting through:

eBPF for complete, lightweight data collection

We leverage extended Berkeley Packet Filter (eBPF) for high-fidelity, low-overhead data collection across your entire software stack – infrastructure, Kubernetes, applications, and network. No more blindspots or vital data missing. eBPF enables comprehensive observability without code changes or performance hits.

Automatic service discovery and dependency mapping

Senser automatically constructs a topology of your environment – no configuration or dashboarding required. And the service map updates in real time as your environment evolves, so it’s never inaccurate or incomplete. 

Purpose-built ML to pinpoint root cause 

Our topology-aware ML models go beyond simply spotting anomalies to trace their origin across layers. Unlike other approaches, our models start with user and business flows rather than individual systems and data sources – enabling you to capture complex cross-layer dependencies and root causes. 

Interpretable insights, effective remediation

We translate model findings into understandable insights – including the full impact chain and the business impact of any service issue. A natural language interface enables users from different backgrounds to easily double-click on issues and access relevant insights to drive effective remediation. 

Ultimately, the right AIOps platform can enable teams to finally capture the benefits of automated root cause analysis at scale. 

The bottom line

Automatically identifying the root cause of complex production issues requires advances in data collection, service mapping, ML, and the communication of technical insights with the goal of remediation. 

The right AIOps platform can help teams overcome the biggest hurdles on the path to automated root cause analysis – helping teams accelerate MTTD and MTTR, improve SLA performance, and spend dev time on innovation rather than firefighting.