{"id":836,"date":"2024-01-08T12:29:59","date_gmt":"2024-01-08T12:29:59","guid":{"rendered":"https:\/\/senser.tech\/?p=836"},"modified":"2024-04-30T13:08:33","modified_gmt":"2024-04-30T13:08:33","slug":"why-root-cause-analysis-is-no-longer-optional","status":"publish","type":"post","link":"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/","title":{"rendered":"Why automated root cause analysis is no longer optional"},"content":{"rendered":"\n<p>In today&#8217;s world of complex, interconnected production environments, outages and performance issues have never been costlier. A few minutes of downtime can mean millions in lost revenue and irreparable damage to a company&#8217;s reputation.&nbsp;<\/p>\n\n\n\n<p>This environment of hyper-fragility is only getting more severe. Trends like cloud-native architectures, hybrid cloud, and microservices introduce flexibility but also exponentially increase complexity. Additionally, microservices offer the liberty to use different programming languages and tech stacks for individual services \u2013 which provides more flexibility and independence, but also adds yet another layer of complexity. The typical enterprise now runs on a tangled web of interdependent services and infrastructure that make debugging failures vastly more difficult.<\/p>\n\n\n\n<p>When an issue arises, traditional approaches to troubleshooting fall painfully short. Teams waste countless hours trying to trace error messages and alerts back to a root cause. Is the problem in the network, the database, the load balancer? The culprit constantly seems to shift as teams play whack-a-mole with symptoms.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Manual troubleshooting: a losing battle<\/strong><\/h4>\n\n\n\n<p>Let&#8217;s walk through a <a href=\"https:\/\/senser.tech\/anatomy-of-a-production-incident\/\" target=\"_blank\" rel=\"noreferrer noopener\">typical example<\/a> of how troubleshooting happens today. Say an e-commerce site experiences a surge of 503 errors during peak traffic. The site reliability engineer (SRE) gets paged at 3am:<\/p>\n\n\n\n<p><strong>Step 1<\/strong>: The SRE checks dashboards and sees high memory utilization on the web servers. They restart the web pods and add more instances to handle load (or use an autoscaler). But errors persist.&nbsp;<\/p>\n\n\n\n<p><strong>Step 2:<\/strong> Logging shows timeout errors calling the product catalog microservice. The SRE suspects the catalog service can&#8217;t scale, so they provision more instances. But errors continue.<\/p>\n\n\n\n<p><strong>Step 3:<\/strong> Finally, after a war room is called and multiple teams get pulled in, the root cause emerges: the Kubernetes cluster autoscaler has hit resource quotas, preventing PODs and services from launching. Fixing that eliminates the errors.<\/p>\n\n\n\n<p>This process took 4+ hours of precious engineering time to find a root cause that spanned multiple layers. And that is just one example \u2013 the number of cross-functional dependencies in modern environments makes scenarios like this increasingly common.<\/p>\n\n\n\n<h4 class=\"wp-block-heading has-text-color has-link-color wp-elements-f430505d501da81aa75a528bda5e1ebd\" style=\"color:#197dff\"><strong>The costs add up<\/strong><\/h4>\n\n\n\n<p>In addition to the hard costs of downtime during outages and performance issues, poor root cause analysis <a href=\"https:\/\/betanews.com\/2023\/11\/03\/leveraging-aiops-to-keep-pace-with-cloud-native-complexity\/\" target=\"_blank\" rel=\"noreferrer noopener\">drives up costs<\/a> in multiple ways:<\/p>\n\n\n\n<ul>\n<li>Latency increases from performance degradation issues<\/li>\n\n\n\n<li>Engineer time wasted chasing false leads<\/li>\n\n\n\n<li>Delayed innovation as teams fight fires<\/li>\n\n\n\n<li>Customer trust and loyalty erosion<\/li>\n\n\n\n<li>Revenue impacts from subpar user experiences<\/li>\n<\/ul>\n\n\n\n<p>As environments get more complex, companies need ways to radically improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues. The solution is to augment human intelligence with machine intelligence&nbsp; \u2013&nbsp;replacing manual troubleshooting with automated root cause analysis.<\/p>\n\n\n\n<h4 class=\"wp-block-heading has-text-color has-link-color wp-elements-333375ce10cef0cdca77c04d326a7a28\" style=\"color:#197dff\"><strong>The challenges of automated root cause analysis<\/strong><\/h4>\n\n\n\n<p>Machine learning (ML) promises a path to automatically trace issues back to their origin. But most organizations struggle to build effective ML pipelines for root cause analysis. A few main challenges arise:<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">1. <strong>Lack of a service topology<\/strong><\/h5>\n\n\n\n<p>Without a graph of all system components and their dependencies, ML models have no context for pinpointing failures. Manually creating an accurate topology is enormously difficult.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">2. <strong>Specialization<\/strong><\/h5>\n\n\n\n<p>Even with a topology, searching it to find root causes requires specialized algorithms. Commercial ML libraries fail here.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">3. <strong>Making inferences understandable<\/strong><\/h5>\n\n\n\n<p>ML models identify statistical correlations between events, but translating that into actionable insights for engineers is tough. The root cause needs to be expressed in clear business\/user impact language.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">4. <strong>Disparate and dynamic environments<\/strong><\/h5>\n\n\n\n<p>Each environment is unique, and behaves differently. Atop this differentiation is an added layer of dynamism: new versions, new services, new APIs, new users and new nodes all contribute to a shifting ecosystem for ML models to parse. And telemetry (logs, metrics, traces, events) enrich the different layers, from infrastructure to APIs.<\/p>\n\n\n\n<p>Solving these challenges requires an integrated approach combining automation, machine learning pipelines built for time-series data at scale, and interfaces optimized for human understanding. Most organizations lack these capabilities today.<\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<h4 class=\"wp-block-heading\">The bottom line<\/h4>\n\n\n\n<p>Production environments are becoming more complex and distributed. And it\u2019s increasingly essential that teams tasked with maintaining service quality and uptime embrace automated root cause analysis to quickly discover and remediate production issues. But that can be difficult to do at scale. Engineers need a data-driven solution to automatically turn production chaos into clarity, illuminating the precise root cause of any issue as soon as it emerges.<\/p>\n<\/div><\/div>\n\n\n\n<p><em><a href=\"https:\/\/senser.tech\/aiops-for-production-intelligence\/\" target=\"_blank\" rel=\"noreferrer noopener\">Learn more<\/a> about how Senser helps SRE and DevOps teams go from production chaos to intelligence.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Teams tasked with maintaining service quality must embrace automated root cause analysis to quickly discover and remediate production issues.<\/p>\n","protected":false},"author":4,"featured_media":837,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Why automated root cause analysis is no longer optional -<\/title>\n<meta name=\"description\" content=\"A new approach to root cause analysis is necessary to improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Why automated root cause analysis is no longer optional -\" \/>\n<meta property=\"og:description\" content=\"A new approach to root cause analysis is necessary to improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/\" \/>\n<meta property=\"article:published_time\" content=\"2024-01-08T12:29:59+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-04-30T13:08:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/senser.tech\/wp-content\/uploads\/2024\/01\/Frame-65158.png\" \/>\n\t<meta property=\"og:image:width\" content=\"3200\" \/>\n\t<meta property=\"og:image:height\" content=\"1600\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Yuval Lev\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Yuval Lev\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/\",\"url\":\"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/\",\"name\":\"Why automated root cause analysis is no longer optional -\",\"isPartOf\":{\"@id\":\"https:\/\/senser.tech\/#website\"},\"datePublished\":\"2024-01-08T12:29:59+00:00\",\"dateModified\":\"2024-04-30T13:08:33+00:00\",\"author\":{\"@id\":\"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab\"},\"description\":\"A new approach to root cause analysis is necessary to improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues.\",\"breadcrumb\":{\"@id\":\"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/senser.tech\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Why automated root cause analysis is no longer optional\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/senser.tech\/#website\",\"url\":\"https:\/\/senser.tech\/\",\"name\":\"\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/senser.tech\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab\",\"name\":\"Yuval Lev\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/senser.tech\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g\",\"caption\":\"Yuval Lev\"},\"url\":\"https:\/\/senser.tech\/author\/yuvalsenser-tech\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Why automated root cause analysis is no longer optional -","description":"A new approach to root cause analysis is necessary to improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/","og_locale":"en_US","og_type":"article","og_title":"Why automated root cause analysis is no longer optional -","og_description":"A new approach to root cause analysis is necessary to improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues.","og_url":"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/","article_published_time":"2024-01-08T12:29:59+00:00","article_modified_time":"2024-04-30T13:08:33+00:00","og_image":[{"width":3200,"height":1600,"url":"https:\/\/senser.tech\/wp-content\/uploads\/2024\/01\/Frame-65158.png","type":"image\/png"}],"author":"Yuval Lev","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Yuval Lev","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/","url":"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/","name":"Why automated root cause analysis is no longer optional -","isPartOf":{"@id":"https:\/\/senser.tech\/#website"},"datePublished":"2024-01-08T12:29:59+00:00","dateModified":"2024-04-30T13:08:33+00:00","author":{"@id":"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab"},"description":"A new approach to root cause analysis is necessary to improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues.","breadcrumb":{"@id":"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/senser.tech\/why-root-cause-analysis-is-no-longer-optional\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/senser.tech\/"},{"@type":"ListItem","position":2,"name":"Why automated root cause analysis is no longer optional"}]},{"@type":"WebSite","@id":"https:\/\/senser.tech\/#website","url":"https:\/\/senser.tech\/","name":"","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/senser.tech\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/senser.tech\/#\/schema\/person\/437d82e941487b35320df36aaa3ae6ab","name":"Yuval Lev","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/senser.tech\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a985ba845187eff5799c272e9a0792f0?s=96&d=mm&r=g","caption":"Yuval Lev"},"url":"https:\/\/senser.tech\/author\/yuvalsenser-tech\/"}]}},"_links":{"self":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts\/836"}],"collection":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/comments?post=836"}],"version-history":[{"count":8,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts\/836\/revisions"}],"predecessor-version":[{"id":970,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/posts\/836\/revisions\/970"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/media\/837"}],"wp:attachment":[{"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/media?parent=836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/categories?post=836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/senser.tech\/wp-json\/wp\/v2\/tags?post=836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}