{"id":930,"date":"2024-03-12T13:22:32","date_gmt":"2024-03-12T13:22:32","guid":{"rendered":"https:\/\/senser.tech\/?p=930"},"modified":"2024-03-27T13:34:28","modified_gmt":"2024-03-27T13:34:28","slug":"the-path-to-troubleshooting-production-issues-automatically","status":"publish","type":"post","link":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/","title":{"rendered":"The path to troubleshooting production issues automatically"},"content":{"rendered":"\n
Applying machine learning to root cause analysis has immense potential, but also presents some core challenges. How can organizations navigate these hurdles to enable instant automated troubleshooting?<\/p>\n\n\n\n
Four foundational pillars can pave the way to AI-driven incident resolution. While emerging from bleeding edge research, these capabilities now have clear pathways to productionization. Let’s examine the critical components powering the next generation of AIOps:<\/p>\n\n\n\n
1. Pervasive, non-intrusive data collection<\/strong><\/p>\n\n\n\n The fuel for effective ML pipelines is data \u2013 and lots of it. Systems must ingest metrics, logs, and traces at massive breadth and depth to spot emerging issues.<\/p>\n\n\n\n But traditional data collection strategies create bottlenecks. Observability systems building-in silos, segmenting telemetry, and instrumenting code only provide snapshots of this siloed data. They fail to adapt to rapidly changing infrastructure and deliver incomplete, disjointed telemetry.<\/p>\n\n\n\n The solution lies in non-intrusive host instrumentation techniques like extended Berkeley Packet Filter (eBPF). Originating from network packet inspection, eBPF has evolved into a ubiquitous observability primitive exposing kernel, system, and user space activity without containers or hosts noticing.<\/p>\n\n\n\n With eBPF, data pipelines consume rich OS, network, application, and hardware signals without any code changes, performance overhead, or need to configure collection rules. eBPF is a tool that enables the success of ML pipelines (so long as observability is intelligently constructed around it). Machine learning models thrive on this high-fidelity telemetry spanning all environment layers.<\/p>\n\n\n\n 2. Automated service topology mapping<\/strong><\/p>\n\n\n\n Even with immense data ingestion, models still require a topology \u2013 a wiring diagram \u2013 detailing how thousands of microservices, datastores, load balancers and infrastructure components interrelate at runtime.<\/p>\n\n\n\n Manually generating this system graph introduces immense work (diverting engineers from innovation) while still being inaccurate as environments dynamically scale.<\/p>\n\n\n\n Instead, next-gen solutions automatically discover services through traffic inspection, define communication patterns, and serialize dependency chains into an evolving real-time topology. Integrating this core knowledge representation into ML pipelines provides contextual awareness enabling complex causality analysis.<\/p>\n\n\n\n 3. Specialized causality analysis algorithms<\/strong><\/p>\n\n\n\n Given a rich data lake and system topology, machines can now infer root causes. But production troubleshooting involves unique complexities. Generic correlation algorithms cannot reasonably search enormous state spaces, lacking heuristics for where anomalous yet business-critical degradations may emerge.<\/p>\n\n\n\n The industry has made enormous strides<\/a> in scoping out new specialized graphs analysis techniques. Tools that can analyze anomalies across all types of data \u2013 such as Kubernetes data, metrics, and traces \u2013 identify issues that potentially have a high probability of being the root cause behind service degradation. Effectively, these tools remake the \u201cintuition\u201d of human engineers; the experience that helps site reliability teams hone in on the right spot in an enormously complex production environment.<\/p>\n\n\n\n Industry progress toward efficient, robust, and scalable algorithms has built a strong foundation for specialized, proprietary ML models \u2013 ones that can guarantee true causality inference at scale. Although we can\u2019t use these off-the-shelf algorithms to achieve true automated root cause analysis (more on the limitations of existing AIOps solutions in our blog post<\/a>), they serve as enablers for development.<\/p>\n\n\n\n Next-gen ML models will be able to achieve lightning-fast root-cause analysis; navigating from alert to origin at remarkable speed. But even after these bespoke developments emerge, one final challenge remains:<\/p>\n\n\n\n 4. Actionable reporting<\/strong><\/p>\n\n\n\n Machine learning can pinpoint root cause \u2013 but how do we make probabilistic model scores interpretable for actual remediation? Alert fatigue already overwhelms teams; they need clear guidance.<\/p>\n\n\n\n True troubleshooting automation requires interfaces that translate algorithmic insights into engineering understanding. Investigation should center around natural language, allowing teams to query failures, understand impact sequencing, and receive mitigation suggestions.<\/p>\n\n\n\n This final mile remains an open challenge. But solutions are emerging, such as topological heuristics for quantifying blast radius which can be visualized as hierarchical incident graphs. Integrating these capabilities provides response blueprints.<\/p>\n\n\n\nHow AIOps platforms can help<\/h2>\n\n\n\n