{"id":930,"date":"2024-03-12T13:22:32","date_gmt":"2024-03-12T13:22:32","guid":{"rendered":"https:\/\/senser.tech\/?p=930"},"modified":"2024-03-27T13:34:28","modified_gmt":"2024-03-27T13:34:28","slug":"the-path-to-troubleshooting-production-issues-automatically","status":"publish","type":"post","link":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/","title":{"rendered":"The path to troubleshooting production issues automatically"},"content":{"rendered":"\n<p>Applying machine learning to root cause analysis has immense potential, but also presents some core challenges. How can organizations navigate these hurdles to enable instant automated troubleshooting?<\/p>\n\n\n\n<p>Four foundational pillars can pave the way to AI-driven incident resolution. While emerging from bleeding edge research, these capabilities now have clear pathways to productionization. Let&#8217;s examine the critical components powering the next generation of AIOps:<\/p>\n\n\n\n<p><strong>1. Pervasive, non-intrusive data collection<\/strong><\/p>\n\n\n\n<p>The fuel for effective ML pipelines is data \u2013 and lots of it. Systems must ingest metrics, logs, and traces at massive breadth and depth to spot emerging issues.<\/p>\n\n\n\n<p>But traditional data collection strategies create bottlenecks. Observability systems building-in silos, segmenting telemetry, and instrumenting code only provide snapshots of this siloed data. They fail to adapt to rapidly changing infrastructure and deliver incomplete, disjointed telemetry.<\/p>\n\n\n\n<p>The solution lies in non-intrusive host instrumentation techniques like extended Berkeley Packet Filter (eBPF). Originating from network packet inspection, eBPF has evolved into a ubiquitous observability primitive exposing kernel, system, and user space activity without containers or hosts noticing.<\/p>\n\n\n\n<p>With eBPF, data pipelines consume rich OS, network, application, and hardware signals without any code changes, performance overhead, or need to configure collection rules. eBPF is a tool that enables the success of ML pipelines (so long as observability is intelligently constructed around it). Machine learning models thrive on this high-fidelity telemetry spanning all environment layers.<\/p>\n\n\n\n<p><strong>2. Automated service topology mapping<\/strong><\/p>\n\n\n\n<p>Even with immense data ingestion, models still require a topology \u2013 a wiring diagram \u2013 detailing how thousands of microservices, datastores, load balancers and infrastructure components interrelate at runtime.<\/p>\n\n\n\n<p>Manually generating this system graph introduces immense work (diverting engineers from innovation) while still being inaccurate as environments dynamically scale.<\/p>\n\n\n\n<p>Instead, next-gen solutions automatically discover services through traffic inspection, define communication patterns, and serialize dependency chains into an evolving real-time topology. Integrating this core knowledge representation into ML pipelines provides contextual awareness enabling complex causality analysis.<\/p>\n\n\n\n<p><strong>3. Specialized causality analysis algorithms<\/strong><\/p>\n\n\n\n<p>Given a rich data lake and system topology, machines can now infer root causes. But production troubleshooting involves unique complexities. Generic correlation algorithms cannot reasonably search enormous state spaces, lacking heuristics for where anomalous yet business-critical degradations may emerge.<\/p>\n\n\n\n<p>The industry <a href=\"https:\/\/deepmind.google\/discover\/blog\/2023-a-year-of-groundbreaking-advances-in-ai-and-computing\/\" target=\"_blank\" rel=\"noreferrer noopener\">has made enormous strides<\/a> in scoping out new specialized graphs analysis techniques. Tools that can analyze anomalies across all types of data \u2013 such as Kubernetes data, metrics, and traces \u2013 identify issues that potentially have a high probability of being the root cause behind service degradation. Effectively, these tools remake the \u201cintuition\u201d of human engineers; the experience that helps site reliability teams hone in on the right spot in an enormously complex production environment.<\/p>\n\n\n\n<p>Industry progress toward efficient, robust, and scalable algorithms has built a strong foundation for specialized, proprietary ML models \u2013 ones that can guarantee true causality inference at scale. Although we can\u2019t use these off-the-shelf algorithms to achieve true automated root cause analysis (more on the limitations of existing AIOps solutions in our <a href=\"https:\/\/senser.tech\/why-existing-approaches-to-root-cause-analysis-fall-short\/\" target=\"_blank\" rel=\"noreferrer noopener\">blog post<\/a>), they serve as enablers for development.<\/p>\n\n\n\n<p>Next-gen ML models will be able to achieve lightning-fast root-cause analysis; navigating from alert to origin at remarkable speed. But even after these bespoke developments emerge, one final challenge remains:<\/p>\n\n\n\n<p><strong>4. Actionable reporting<\/strong><\/p>\n\n\n\n<p>Machine learning can pinpoint root cause \u2013 but how do we make probabilistic model scores interpretable for actual remediation? Alert fatigue already overwhelms teams; they need clear guidance.<\/p>\n\n\n\n<p>True troubleshooting automation requires interfaces that translate algorithmic insights into engineering understanding. Investigation should center around natural language, allowing teams to query failures, understand impact sequencing, and receive mitigation suggestions.<\/p>\n\n\n\n<p>This final mile remains an open challenge. But solutions are emerging, such as topological heuristics for quantifying blast radius which can be visualized as hierarchical incident graphs. Integrating these capabilities provides response blueprints.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-text-color has-link-color wp-elements-06e7b27d7a8f57e75b2085657ca909bd\" style=\"color:#197dff\">How AIOps platforms can help<\/h2>\n\n\n\n<p>AIOps is a category of technology designed to use AI and other advanced technology to optimize IT operations \u2013 particularly around event correlation, anomaly detection, and causality determination. <a href=\"https:\/\/senser.tech\/infrastructure-and-application-observability\/\" target=\"_blank\" rel=\"noreferrer noopener\">Senser<\/a>, for example, realizes this vision of AI-driven troubleshooting through:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">eBPF for complete, lightweight data collection<\/h3>\n\n\n\n<p>We leverage extended Berkeley Packet Filter (eBPF) for high-fidelity, low-overhead data collection across your entire software stack \u2013 infrastructure, Kubernetes, applications, and network. No more blindspots or vital data missing. eBPF enables comprehensive observability without code changes or performance hits.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Automatic service discovery and dependency mapping<\/strong><\/h3>\n\n\n\n<p>Senser automatically constructs a topology of your environment \u2013 no configuration or dashboarding required. And the service map updates in real time as your environment evolves, so it\u2019s never inaccurate or incomplete.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Purpose-built ML to pinpoint root cause<\/strong>&nbsp;<\/h3>\n\n\n\n<p>Our topology-aware ML models go beyond simply spotting anomalies to trace their origin across layers. Unlike other approaches, our models start with <em>user and business flows<\/em> rather than individual systems and data sources \u2013 enabling you to capture complex cross-layer dependencies and root causes.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Interpretable insights, effective remediation<\/strong><\/h3>\n\n\n\n<p>We translate model findings into understandable insights \u2013 including the full impact chain and the business impact of any service issue. A natural language interface enables users from different backgrounds to easily double-click on issues and access relevant insights to drive effective remediation.&nbsp;<\/p>\n\n\n\n<p>Ultimately, the right AIOps platform can enable teams to finally capture the benefits of automated root cause analysis at scale.&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<h4 class=\"wp-block-heading\">The bottom line<\/h4>\n\n\n\n<p>Automatically identifying the root cause of complex production issues requires advances in data collection, service mapping, ML, and the communication of technical insights with the goal of remediation.&nbsp;<\/p>\n\n\n\n<p>The right AIOps platform can help teams overcome the biggest hurdles on the path to automated root cause analysis \u2013 helping teams accelerate MTTD and MTTR, improve SLA performance, and spend dev time on innovation rather than firefighting.&nbsp;<\/p>\n<\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Explore the 4 critical components powering the next generation of AIOps.<\/p>\n","protected":false},"author":4,"featured_media":931,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The path to troubleshooting production issues automatically -<\/title>\n<meta name=\"description\" content=\"Automatic root cause analysis requires advances in data collection, service mapping, ML, and actionable reporting.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The path to troubleshooting production issues automatically -\" \/>\n<meta property=\"og:description\" content=\"Automatic root cause analysis requires advances in data collection, service mapping, ML, and actionable reporting.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-12T13:22:32+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-03-27T13:34:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/senser.tech\/wp-content\/uploads\/2024\/03\/The-path-to-troubleshooting-issues.png\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"400\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Yuval Lev\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Yuval Lev\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/\",\"url\":\"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/\",\"name\":\"The path to troubleshooting production issues automatically -\",\"isPartOf\":{\"@id\":\"https:\/\/senser.tech\/#website\"},\"datePublished\":\"2024-03-12T13:22:32+00:00\",\"dateModified\":\"2024-03-27T13:34:28+00:00\",\"author\":{\"@id\":\"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab\"},\"description\":\"Automatic root cause analysis requires advances in data collection, service mapping, ML, and actionable reporting.\",\"breadcrumb\":{\"@id\":\"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/senser.tech\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"The path to troubleshooting production issues automatically\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/senser.tech\/#website\",\"url\":\"https:\/\/senser.tech\/\",\"name\":\"\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/senser.tech\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab\",\"name\":\"Yuval Lev\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/senser.tech\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g\",\"caption\":\"Yuval Lev\"},\"url\":\"https:\/\/senser.tech\/author\/yuvalsenser-tech\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The path to troubleshooting production issues automatically -","description":"Automatic root cause analysis requires advances in data collection, service mapping, ML, and actionable reporting.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/","og_locale":"en_US","og_type":"article","og_title":"The path to troubleshooting production issues automatically -","og_description":"Automatic root cause analysis requires advances in data collection, service mapping, ML, and actionable reporting.","og_url":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/","article_published_time":"2024-03-12T13:22:32+00:00","article_modified_time":"2024-03-27T13:34:28+00:00","og_image":[{"width":800,"height":400,"url":"https:\/\/senser.tech\/wp-content\/uploads\/2024\/03\/The-path-to-troubleshooting-issues.png","type":"image\/png"}],"author":"Yuval Lev","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Yuval Lev","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/","url":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/","name":"The path to troubleshooting production issues automatically -","isPartOf":{"@id":"https:\/\/senser.tech\/#website"},"datePublished":"2024-03-12T13:22:32+00:00","dateModified":"2024-03-27T13:34:28+00:00","author":{"@id":"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab"},"description":"Automatic root cause analysis requires advances in data collection, service mapping, ML, and actionable reporting.","breadcrumb":{"@id":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/senser.tech\/the-path-to-troubleshooting-production-issues-automatically\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/senser.tech\/"},{"@type":"ListItem","position":2,"name":"The path to troubleshooting production issues automatically"}]},{"@type":"WebSite","@id":"https:\/\/senser.tech\/#website","url":"https:\/\/senser.tech\/","name":"","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/senser.tech\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab","name":"Yuval Lev","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/senser.tech\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g","caption":"Yuval Lev"},"url":"https:\/\/senser.tech\/author\/yuvalsenser-tech\/"}]}},"_links":{"self":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts\/930"}],"collection":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/comments?post=930"}],"version-history":[{"count":3,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts\/930\/revisions"}],"predecessor-version":[{"id":951,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts\/930\/revisions\/951"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/media\/931"}],"wp:attachment":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/media?parent=930"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/categories?post=930"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/tags?post=930"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}