“I have Grafana. Now what?” – Building a better observability stack

As an organization moves from legacy monoliths to Kubernetes, its DevOps team must answer the question; “how do I structure my new monitoring and alerting techstack to produce high-quality, holistic data for observability?” 

If the team asks any other ITOps professional working in a complex cloud environment, odds are they’ll get a litany of responses – but all of these will involve some hodgepodge of open-source tools: 

  • “We use Loki, Grafana, Tempo and Mimir (LGTM), with OTel as protocol.”
  • “We use OTel collectors, Thanos for metrics, Grafana for visualization.”
  • “We’ve got Grafana for dashboarding, Prometheus as a data source, and cAdvisor for container resource and performance monitoring.”

(If you look closely, you’ll see that all of the examples above feature at least one Grafana product. More on this below.)

You get the idea. How should this team think about the tradeoffs involved in building an efficient observability stack that delivers insights to improve production stability, uptime, and performance? 

Let’s imagine you’re an SRE at a mid-market org that is moving away from a commercial APM tool to an open-source observability (OSO) stack. After weeks of research and product advocacy to your leadership team, you end up with an observability stack consisting of Prometheus, Jaeger, ELK, and Grafana.

At a high level, let’s break down what each of those frontends and backends are used for:

  • Prometheus – hosts time series telemetry data
  • Jaeger – visualizes distributed tracing and API request flow, and maps the flow of requests and data
  • ELK – Elasticsearch, Logstash, and Kibana (open-source tools that make up the Elastic Stack). Used to centralize logs from operational systems and structure telemetry data for efficient search.
  • Grafana – visualizes the telemetry data

Let’s double-click on Grafana here, and why many DevOps teams either start with it, or quickly realize it’s a necessity for their stack.

By consolidating data from various sources and presenting it in customizable dashboards, Grafana is a boon for monitoring and alerting teams. Not only does it help practitioners visualize the data in their environment, but it helps stakeholders and decision-makers view that data in an easily-digestible way; a must-have for ITOps folks who often need to report back to external teams.

If we go back to the three example open-source stacks above – the “LGTM,” “Thanos/Otel,” etc. – we’ll notice a common throughline is the presence of Grafana. Regardless of how teams build their open-source stack, the common denominator is the need for a powerful consolidation and dashboarding tool. 

That said, full observability of all layers in a production environment (infrastructure, apps, K8s, etc…) still requires a suite of tools (even if many of them are available under the Grafana umbrella), rather than one unified platform. Building the rest of observability functionality is often the biggest hurdle for teams to clear.

For lean teams growing their cloud environment with Kubernetes, and/or making the switch from APM (Application Performance Management), building a fully open-source observability stack seems to be the most cost-efficient and customizable way to build strong observability. Organizations priced out of SaaS observability platforms will turn to increasingly complicated open-source stacks that essentially replicate the full-service functionality of a New Relic or a Datadog. But even these open-source tools can quickly lead to ballooning costs.

The obvious challenge lies in the sheer volume of tools to configure and maintain – although open source software may be free to install, it is rarely free to support or deploy. Couple that with the need to keep up with five or six annual release cycles across your entire stack, and you end up with a labor and resource-intensive process. 

The problem is, you need all of these frontending and backending tools to actually build strong observability. Grafana itself offers a “core stack” that handles logs, metrics, traces, and visualization (known as “LGTM”), but the observability stack doesn’t end there. Open-source advocates will tell you they supplement LGTM with Open Telemetry collectors, application telemetry, labeling, dashboard configuration, and alerts configuration.

So we come back to the question: “I have Grafana. Now what?” Going with a suite of open source tools to complement your Grafana investment seems like a solution, but as we outline below, there can be many issues with attempting to cobble together robust observability from multiple tools. Ostensibly, the solution is a SaaS or self-hosted platform – but most of these are either prohibitively expensive, or don’t implement in the ways necessary for your DevOps team’s environment – for example, Grafana still demanding teams work hard to turn data into actionable insights, and serve it to human engineers in a digestible way.

Let’s go back to that hypothetical SRE at a mid-market org, looking to build an OSO. They’re confident in their Prometheus, ELK, Grafana and Jaeger observability stack. But a few months into the endeavor, the supposed cost-saving power of an open source stack isn’t all it’s cracked up to be. Their stack is experiencing growing pains:

  • A slow onboarding process caused by heavy instrumentation requirements and complex configurations is costing the team time and money. 
  • Incomplete telemetry is leading to blind spots. 
  • Monitoring agents running across 5 namespaces are consuming 3x the resources of main workloads just to handle logging, metric and traces. 

Decision-makers at the organization want to switch to a paid version of a leading SaaS provider, but keep Grafana for dashboarding. The SRE ends up with a totally different but equally convoluted stack as when they started; Prometheus/Grafana and now, NewRelic.

This all-too familiar story illustrates the double-edged sword of building an observability stack. You can “save money” by constructing an open-source stack around a “core stack” package like Grafana, only to experience configuration and resourcing issues as you try to manage the disparate components of the stack. 
You end up switching to the paid version of a SaaS model at high cost, only to still keep some elements of your open-source stack for ease of dashboarding. You get the worst of both worlds and the best of none.

Most pro-open-source SREs will tell you that the ease of dashboarding and visualization provided by Grafana makes it a must-have for site reliability teams. We tend to agree. Backending – to actually source the data visualized by Grafana tools – is where most teams end up adding a plethora of different tools to build out their observability. 

On paper, open-source tools exist to provide reliability practitioners with all the functionality offered by highly expensive SaaS solutions at a fraction of the cost. The issue here is that Grafana cannot stand on its own, and the sheer number of tools needed to support it lead to an exponential amount of onboarding woes site reliability teams must manage before enacting strong observability. 

Another significant limitation (common to both open source and commercial solutions) emerges when it comes to providing comprehensive coverage and facilitating effective root cause analysis. Open source stacks compartmentalize data into silos, without offering a unified view that spans all layers of an application or infrastructure. Even with the ease of dashboarding and visualization offered by Grafana, this segmentation results in more time spent sifting through dashboards looking for the root cause during a time-sensitive production incident.

Senser is a cost-effective alternative to both ultra-premium SaaS solutions and the “hidden” costs of open-source observability. It provides zero-instrumentation production insights via advanced data sampling techniques (eBPF), and deploys in minutes with no extra effort required from IT teams. By using eBPF to create a dynamic, real-time system topology and leveraging that data to empower machine-learning root cause analysis, Senser handles all of the backend work – metrics, logs, traces, events – and creates a single pane of glass for observability. And Senser deploys in minutes with no complex configurations or code changes needed, reducing the organizational cost to implement and maintain an observability backend. 

By mapping a complete picture of any given production environment, and delivering actionable insights for all layers, Senser helps teams avoid the pitfall of onboarding an expensive SaaS tool – only to realize it doesn’t cover all the gaps they’re looking to fill (such as holistic causality tracing through all production layers). In short, switching to Senser saves teams the trouble of bailing on their OSO in favor of an incomplete SaaS alternative, only to return to some OSO tools to make their Grafana-based observability stack work. 

In summary, Senser complements Grafana in three ways:

1. Complementing telemetry coverage

  • Filling coverage gaps with eBPF-powered data collection across all applications, infrastructure, networks, and APIs
  • Providing a single pane of glass for visibility into all environments, layers, and telemetry data types

2. Enhancing observability

  • Mapping complex dependencies across all layers of the environment
  • Generating dashboards and reports automatically

3. Turning data into actionable insights

  • Identifying issues automatically (especially “unknown unknowns”) to drive proactive rather than reactive problem-solving
  • Accelerating troubleshooting by correlating issues across the environment and automating root cause analysis

    Senser integrates without re-instrumentation and permits SREs to continue visualization via their existing Grafana flows. It is a true complement to the Grafana-foundation, using highly scalable and cloud-friendly tech to deliver on the promise of a litany of other, disparate open source tools.