There’s no better way to check the pulse of the observability problem space than by attending industry events and speaking directly with the SREs, IT professionals, and architects who face these challenges every day.
That’s why this conference season, we attended all of the top events—KCD UK, SRECon EMEA, KubeCon US, and (soon) AWS re:Invent. As the year winds down, we’re sharing the most common and pressing challenges dominating conversations as we head into 2025.
Through candid discussions at the Senser booths, insights from keynotes, and casual lunch chats, here’s some of what our team overheard.
#1: “We still have too many blind spots from scaling and manual instrumentation”
Blind spots remain a nagging frustration for SREs. Monitoring, detection, and resolution are still too cumbersome when various application and infrastructure components like databases, microservices, and distributed workloads operate in silos. Each component often comes with its own monitoring tool or telemetry source, and their incompatibilities create visibility gaps. These gaps allow small, under-the-radar issues to escalate into major incidents requiring immediate remediation. Worse, the same complexities that allowed minor issues to fester are the same ones that turn root cause analysis into a stress-inducing riddle.
Scaling compounds these problems. SREs frequently rely on custom integrations and manual instrumentation to piece together data across environments. While effective in smaller systems, this fragmented, overly-bespoke approach becomes unmanageable as systems grow. Maintaining these integrations adds a resource-heavy burden, slowing incident response and diverting attention from other priorities.
#2: “It’s hard to tell if we’re even monitoring the right metrics.”
Collecting data is easy; skillfully sifting through it and prioritizing the right metrics isn’t. SREs need to see and act on the metrics that matter most, in real time, and with enough context for decision-making. This is especially challenging in distributed systems, where the volume of telemetry data can be overwhelming.
In Kubernetes environments, metrics like pod memory usage, CPU utilization, and network latency are fundamental. But to fully understand system health, teams must also track data from other internal and external sources—databases, APIs, and legacy systems.
SREs end up monitoring these metrics across separate tools, making it harder to prioritize relevant data in real-time. In critical situations, such as incident response, manually piecing together incomplete information from Kubernetes and external systems introduces delays that can impact service reliability and response time.
#3: “Alert fatigue is real, and it’s caused by a daily flood of notifications lacking proper context and prioritization.”
Alert fatigue isn’t just a buzzword; it’s a pervasive problem. SREs continue to face a daily barrage of alerts, pinging from different monitoring tools with varying levels of severity and relevance. Without a clear hierarchy or context, each alert demands attention, leading to unnecessary disruptions, missed priorities, and a constant scramble to determine which issues require immediate action and which can be put on the back burner or ignored altogether. Worse, alerts often come in faster than they can be investigated, leading to risky best-guess prioritization.
In many cases, alerts lack the contextual information that would allow SREs to adequately understand their impact or urgency. For instance, an alert indicating high memory usage on a single pod might sound urgent, but without additional context—like whether this pod is part of a critical service or merely a low-priority background task—the alert may lead to unnecessary escalation. Similarly, repetitive or duplicate alerts for transient issues can divert focus from more critical incidents, overwhelming teams and reducing their ability to respond to genuine threats.
The core issue is not just the volume of alerts but the lack of actionable insight into what each alert means, what’s causing it, and how it relates to other alerts and other components in the system.
Future-Proofing Observability with Visibility, Identification, and Analysis
The ideal observability solution addresses these key challenges—blind spots, metric overload, and alert fatigue—by providing intelligent, automated coverage and insights across complex environments.
To tackle blind spots, the solution should deliver instantaneous, comprehensive coverage of the environment without extensive manual configuration or custom integrations. It would gather and correlate data from all critical sources—whether infrastructure, application layers, or third-party services—without requiring SREs to manually stitch together data streams (hint: eBPF can help). By automating this data collection and ensuring components are consistently and accurately observed within the context of their relationships, blind spots can be minimized, allowing SREs to respond to issues before they escalate.
To help monitor the right metrics and prevent metric overload, an advanced observability tool would dynamically identify and prioritize key metrics based on system behavior and SRE needs, presenting actionable data in a single pane of glass. By continuously adjusting to reflect the most relevant data streams and contextual information, this solution eliminates the need for excessive searching or manual configuration, allowing SREs to make informed, rapid decisions. This enables a shift from reactive monitoring to proactive management, helping teams respond more effectively to potential issues as they arise.
Finally, to combat alert fatigue, the ideal observability system would automatically assess and contextualize alerts based on factors like severity, potential impact, and historical patterns. By prioritizing alerts and filtering out noise, teams can focus on high-impact issues and critical incidents without constant disruptions from low-priority alerts. This context-driven approach reduces unnecessary noise and interruptions, empowering SREs to maintain greater control over their environments with minimal manual intervention.
At Senser, we’re building solutions with these challenges in mind, creating tools that deliver intelligent, automated observability to support resilient, reliable systems.
The Bottom Line
The challenges of blind spots, metric overload, and alert fatigue are all three products of the growing complexity of modern production environments, causing them to appear as moving targets. But, future-proof solutions are beginning to emerge, offering long-awaited support for SREs and operations teams. The future of observability lies in smart, automated tools that provide comprehensive visibility, insightful metrics, and meaningful alerts, enabling teams to build reliable, resilient systems while reducing the operational burden.
We look forward to continuing along the conference trail and deepening our understanding of the challenges ahead for the observability space. As an AIOps company, Senser is always innovating to close gaps like the ones discussed in this post. Though many of them will remain top issues heading into 2025, there’s more hope than ever that SREs will be able to put these challenges behind them in the near term and focus on higher-value, proactive initiatives. If you’re interested in learning more, please reach out to set up an introductory conversation or demo.