Shubham Srivastava from our team had the pleasure of meeting Andreas Grabner at KubeCon + CloudNativeCon Europe earlier this year. Andreas wears many hats in his daily work, primarily serving as a DevOps Activist at Dynatrace, where he has dedicated over 16 years to shape the Observability solutions we see today. He is also a Developer Advocate at Keptn – helping teams automate and orchestrate their deployments end-to-end and plays an active role as an Ambassador in the CNCF community. 

With extensive experience in the Observability and Cloud Native space, Andreas is a recognized authority on the needs of DevOps and Site Reliability professionals today.

What are some misconceptions about OpenTelemetry? Is AI the future of observability? What does a good observability framework look like?  We talk about these questions and more with Andreas! 

Tune in to hear our chat with Andreas Grabner!

Shubham: What were you looking forward to at KubeCon?

Andreas: I'm actively involved with the CNCF community, particularly around the open-source project Keptn. Additionally, OpenFeature is significant for us at Dynatrace since we launched it as an open-source project.

Another area we're very interested in is OpenTelemetry, especially in terms of observability. Dynatrace has been focusing on observability for about 18 to 20 years, and I've been part of the team for 16 years. Historically, we've used our agents for observability. However, we now appreciate having OpenTelemetry as an open standard. It enables developers to integrate telemetry information into their apps and define what is important for them.

So, they get this information, whether it's logs, metrics, or traces. OpenTelemetry is a big deal. What I would like to get out of this conference is a deeper understanding of the use cases for observability that we haven't considered yet. Additionally, I want to learn what extra resources developers and the community need to make better use of observability.

We know that while observability has been around for a long time, often referred to as monitoring, and telemetry has also been around for a while, many people are still new to these concepts.

There is still a lack of information, tutorials, documentation, and best practices. We're trying to figure out what is needed so we can create better content to address these gaps.

Shubham: It's strange to hear that observability is still not a day-one concern for many teams building software today. There's been significant development in this area though, with many competitors emerging. Companies have been bringing up the Datadog $65 million bill controversy and have proved that the market is indeed looking for fresher options. What do you think of the space? Do you feel there are just a lot of vendors running towards the money, or is there real innovation happening?

Andreas: There's always innovation happening. Having multiple people trying to solve a problem is always good because otherwise, it would be a monopoly, and there wouldn't be any innovation at all.

I think it's good, as you said, that we are collaborating.

Even though some of us are competitors, we work together towards the common goal of making it easier to get observability data.

The reason for this is that it's really hard to build and maintain agents that handle all the instrumentation.

Building one agent is easy, but building it for 10 or 15 different runtimes and languages and keeping it up to date is hard. That's why OpenTelemetry is beneficial for everyone—for the developers who use it and for the vendors. I think that's important. Another significant topic is eBPF, which has been relevant for a while.

We see a lot of new vendors entering that space, which also compels existing vendors to explore this technology because these newcomers often introduce innovative ways to capture observability data. 

"One misconception I've noticed is that people sometimes believe OpenTelemetry is a silver bullet. "

OpenTelemetry is just one part of the story. It defines how we gather data and what types of data we collect. However, you still need mechanisms to extract this data from the applications that generate metrics, logs, and traces according to that standard. Then, you must transfer and store this data somewhere before analyzing it. 

Many people still mistakenly believe that OpenTelemetry alone is sufficient. No, OpenTelemetry is not all you need. It allows developers to specify which logs, metrics, traces, or spans they want to capture. However, you still need to address how to efficiently collect and transport this data at scale to where it needs to be stored, managed, and analyzed.

In between, there are components like the OpenTelemetry Collector, which collects data and forwards it to backend storage. This is something I've observed in my conversations—many people understand that OpenTelemetry is not the entire solution, but not everyone is aware of this fact.

The Reliability Stories You Won’t Hear on LinkedIn
In a conversation with Ponmani, a LinkedIn SRE, we discussed his career path, the challenges he’s faced in reliability engineering, and the best practices he’s developed. Click here to find out!

Shubham: You know, we were talking about how every organization today needs to have a good observability framework. There's one thing that I've seen missing in this space so far. You can maybe correct me on this, but the user experience for someone being onboarded onto observability tools is still not very straightforward. It's a pretty steep learning curve, especially for younger engineers.

Andreas: Well, I think this is where platform engineering comes into play. Whether you call it platform engineering, DevOps, or any other name, it's about ensuring that the tools and processes for observability are accessible and usable for everyone involved.

We've had many discussions about the terminology, but I think what platform engineering is capable of doing is making observability easily accessible. In the end, as I recall from a ThoughtWorks blog, platform engineering is about centralizing expertise while decentralizing innovation. This means if you have people in your organization who understand observability, data collection, and analysis – empower them to build a self-service platform. This way, developers can easily access the data they need.

Developers like to configure things with YAML or JSON, so we need to make observability as easily accessible for developers, like saying yes on a YAML file, right? 

For instance, I've seen some of our customers use a configuration file where developers simply specify what they need: logs, traces, metrics, and so on. 

They commit that file, and the platform takes care of collecting these data types and pushing them back to developers in their preferred tools, such as Backstage. Alternatively, the platform might aggregate logs and metrics daily and send them back to a Slack channel. I believe it's crucial for developers to easily access and consume this data without having to learn numerous new tools.

Shubham: Gen AI shot up last year. You know, a lot of people are talking about how AI could make observability easier as well. We saw some companies trying to innovate on how you could “talk” to your metrics. Do you see more features aligned with this very trend coming up?

Andreas: Yeah, it's interesting because, at Dynatrace, we've been working on this for a while.

About eight years ago, we introduced our own AI named "Davis", which is still actively performing analysis. We were also among the first to implement a natural language interface to our AI. This included integrations with Alexa and Google, allowing users to interact through chat or voice.

You could ask questions like, "Hey, Davis, show me the response time of my service," and it would provide textual responses and even render graphs six or seven years ago.

Back then, we noticed that only a few people were using it—it was more of a novelty. Perhaps now with the current surge in AI technology, the timing might be more favorable.

People are increasingly accepting the idea of using natural language to ask questions and receive responses, and we're fortunate to have been in this space for so long and pioneered these capabilities.

However, the latest advancements in Large Language Models (LLMs) are also helping us make our offerings more accessible. So, to answer your question, yes, I believe these advancements will certainly benefit us.

On the other hand, when it comes to more seasoned developers, I predict they would still prefer the familiar ways of communicating and interfacing with systems. I feel they still want to have a YAML file where they can use checkboxes, use tables, command-line tools, or whatever they need to execute within their IDE  – very straightforward.

Bob Lee’s Proven Strategies for Scaling Systems Reliably
Know everything about Bob’s journey as Lead DevOps Engineer at Twingate, and uncover the strategies powering progress.

Shubham: I've spoken with many larger enterprises that have been established for a while and are quite set in their ways. When a new open-source technology gains popularity, there's often hesitation from these enterprises. They lack trust in open-source solutions, uncertain about their maturity and potential rough edges that might impact their infrastructure. They typically wait until the technology has been thoroughly tested by the industry before adopting it.

For open-source maintainers, do you have any advice on making your projects seem more mature and building trust with enterprises?

Andreas:  The interesting thing is that organizations can be hesitant to adopt new technologies, especially larger enterprises with a long history. They've likely experienced various hype cycles and might have been burned in the past by jumping on trends too quickly.

For open-source projects or any software, it's crucial to demonstrate real adoption. One of the biggest challenges for open-source projects is gauging how many people are using them, since tracking users isn't usually allowed. Engaging with the user community and encouraging them to come forward and share their success stories is essential. Adding their names to an adopters file can build confidence in potential users by showing that multiple organizations are successfully using the project.

Diversity in contributors is also important. A project should not be driven by just one or two companies. A good mix of contributors from various organizations helps ensure the project's robustness and longevity.

Ultimately, whether you're dealing with vendor lock-in or community lock-in, the same principle applies: you need to invest in a sustainable community. If the community behind an open-source project isn't thriving, the project itself may not last. Making investments in fostering a long-lasting and active community is crucial for the success and trustworthiness of open-source projects.

Shubham: And that’s a perfect segue to move on to Keptn now, where you're also a developer advocate. Can you let us know more about this project and the vision behind it? 

Andreas: Let me explain the problem we're addressing.

Kubernetes offers great tools like argo, Flux, or even simple kubectl apply for deploying workloads. When deployments go smoothly, everything is perfect. But what if deployments suddenly fail or become slower? How do you troubleshoot why Kubernetes is struggling to deploy? This is where Keptn comes in.

We provide observability into Kubernetes deployments by extending the Kubernetes pod scheduler using webhooks. Whether you're deploying with Argo, Flux, or kubectl apply, Keptn sits at the center, creating OpenTelemetry traces that feed into distributed traces. These traces precisely show what the pod scheduler is doing as it places pods.

You can see dependencies between pods, initialization container wait times, and any errors encountered. Essentially, we make deployments observable and easily troubleshootable. If a deployment that used to take 10 seconds suddenly takes a minute, you can see exactly why.

In addition to traces, we generate metrics such as deployment frequency, duration, and failure rates—classic DORA metrics—to give you insights into deployment performance.

But that's not all. Keptn also allows you to execute tasks before and after deployment to ensure the environment is ready and the application is healthy. You can define tasks and use SLO-based evaluations to validate conditions before deployment—like checking resource availability or database readiness. After deployment, you can verify application health using integrations with tools like Prometheus, Dynatrace, New Relic, or Datadog to ensure metrics and logs are flowing correctly and everything is functioning as expected.

We emphasize that just because a pod is ready doesn't necessarily mean the application is ready, which is why Keptn helps you validate every step of the deployment process.

Shubham: Have you been in a tense War Room in your many years of working in the software industry? What role does proper observability play in such situations?

Andreas: I have been involved in some critical situations at times. Sometimes, I've even caused those issues myself by making changes to systems that led to failures. This is where observability becomes crucial because, during a war Room scenario — although I believe it should be called something more positive, like a "peace room"— you need insights and observability to understand the impact of a problem.

It's essential to differentiate between a real crisis and a false alarm. Observability helps us gauge the true impact of an issue. You also need observability to pinpoint the root cause. Knowing only the symptoms makes it challenging to resolve the problem effectively.

When we discuss observability, we often talk about logs, metrics, and traces, but events are also crucial. Events, especially change events, help track who made changes, deployed new versions, altered service routing, or turned on a feature flag. This information is important for identifying the root cause of issues. If changes go unnoticed or unrecorded, troubleshooting becomes nearly impossible.

This is why I'm a strong advocate for GitOps. Changes to configurations should be made in Git, which provides automatic version control and a process for approvals. This transparency allows you to see who made, approved, and deployed changes, making it easier to identify the root cause of problems.

Behind the Scenes with an Observability Advocate- Akshay
Join Akshay, an observability advocate, behind the scenes.Learn about the challenges and rewards of navigating complex systems, onboarding new engineers, and intriguing war room stories.

Shubham: Zenduty focuses on making noiseless alerting possible for fast-moving teams around the world. This requires our customers to set the right thresholds and get the most out of their observability tools. How can organizations optimize their use of observability tools to achieve effective, noiseless alerting?

Do you have any advice for setting up observability tools to minimize excessive alerts? How can engineers ensure they receive only the most critical notifications, freeing up their bandwidth for other tasks?

Andreas: Setting static thresholds is something we've been cautious about at my company over the past 16 years. Instead, we emphasize dynamic baselining to understand normal system behavior and alert on anomalies. Different metrics may have seasonal variations or consistent patterns, requiring various mathematical approaches.

Even with modern AI capabilities, eliminating the need for thresholds isn't feasible. It's crucial to determine what aspects of your software performance directly impact your business or user experience—like page load times affecting user satisfaction or revenue generation. These specific indicators should have clear thresholds set, ideally in collaboration with your business stakeholders.

For other metrics, relying on anomaly detection based on learned baselines is advisable. While these systems can adapt to observed patterns, they can't interpret your business goals. Therefore, combining automated anomaly detection with well-defined threshold settings aligned with business objectives is key to effective alert management.

Shubham: My last question for you, Andreas: As a member from a leading Observability vendor, what would you like incident alerting and response tools like Zenduty to work on to make the lives of our joint end-users easier?

Andreas: One thing we've been focusing on to make all our lives easier is enhancing integration. We're working on extracting more metadata from the systems we observe, which helps streamline the incident alerting process and improves the overall efficiency of our observability solutions.

So, what I mean by this is that when deploying a new app on Kubernetes, you should include ownership information with the deployment. For instance, simply add a label to your workload that states "I am team Andy" and indicates that team Andy prefers to receive notifications via Slack.

Specifying this metadata directly on the deployment allows observability tools like ours to recognize when there's an uptick in error logs. We can then trigger your tool to alert you, saying, "Hey, we're seeing an increase in error logs on this service, and Team Andy, who prefers Slack notifications, is responsible." This approach ensures that the right teams are notified promptly based on their preferences.

So, then you can take this and give more context, so the message to developers or platform engineers is: don't just think about observability in terms of logs, metrics, and traces. Think about including enough metadata so that any tool analyzing this data understands the context.

For example, specify that a service is critical, belongs to a specific team, and prefers notifications in a certain way. I believe that's crucial. Fortunately, Kubernetes and its resource definitions allow us to augment metadata with additional information through annotations.

Shubham: And I think there are a lot of players, including us, who would appreciate developments like this. Perhaps by the next time we chat, progress on this front will already be underway.

Andreas: Yeah, and the good news is that there are also initiatives within the CNCF to standardize some of these labels and annotations. It's not only about ownership but also includes information like the version of the workload and which application it belongs to. This standardization effort aims to enhance clarity and consistency across observability practices within the industry.

So folks, if you're listening, check out the Kubernetes.io labels for components like name and version. There are various standard labels and annotations promoted by the CNCF working group on platform engineering as best practices. These resources can help you implement consistent and effective observability strategies in your deployments that make your incident response processes faster, easier, and more effective.

If you're fascinated by reliability and the intricate process of recovering from downtime, check out our podcast - Incidentally Reliable, where veterans from Docker, Amazon, Walmart, and other industry-leading organizations, share their experiences, challenges, and success stories from the Cloud Native world.

Incidentally Reliable Podcast

If you're someone who is looking to streamline your incident management process, Zenduty can enhance your MTTA & MTTR by at least 60%. With our platform, engineers receive timely alerts, reducing fatigue and boosting productivity.

Sign up for a free trial today!

Anjali Udasi

Along with Shubham Srivastava

As a technical writer, I love simplifying technical terms and write on latest technologies.