The Reliability Stories You Won’t Hear on LinkedIn

We had the pleasure of meeting Ponmani Palanisamy, a Staff Site Reliability Engineer at LinkedIn, at a recent SRE Meetup in Bangalore.

Ponmani gave an insightful talk on "Improving data redundancy and rebalancing data in HDFS." We were captivated by his talk and eager to learn more about his experience in the reliability space.

We talked about everything including his journey, experiences, and of course, his most memorable war room stories over a steady career of 17 years. Here's what he had to share.

Anjali:

You started your journey as a Software Engineer and eventually moved to a SRE/DevOps role. How did your priorities change during this transition? Do you miss being a bigger part of the development cycle?

Ponmani: I started as a Software Developer, switched to a DevOps engineer after 4 years, switched back to a Developer 3 years later, and played SDE+SRE for the next 5 years. And now an SRE again for the last 2 years. I love the entire spectrum and don’t think one is better than the other.

In my various roles, priorities differed greatly. Software engineers are all about building features fast. Their primary objective is to deliver functionality to customers as quickly as possible. This involves designing and building efficient, reliable, and scalable services.

While they consider non-functional requirements (NFRs) like security and performance, these are secondary to core functionalities. SWE/SDE roles are fast-paced, working closely with product owners to continuously churn out new features. The main metrics for success are speed to market and generating revenue.

An SRE's mission is all about keeping the house running smoothly for the customers. Unlike SWEs who focus on building new features, SREs are the guardians of reliability and security. Their top priorities are:

Guaranteed uptime: Ensuring the website or application is always available and accessible to users (keeping up SLOs).
Flawless performance: Maintaining high-quality service with minimal errors or slowdowns.
Fortress-like security: Protecting the system from vulnerabilities and keeping data safe.

Moving on to address the second part of the question,

"I am a firm believer that software development skills do help an SRE to understand system design better so they can build efficient tools and platforms. In the same way, having a good understanding of System Engineering and aspects of reliability makes a SWE develop more resilient software."

The current SRE landscape is changing so fast that SREs have to build large complex platforms and observability tools and automate a lot of operational work. So I don’t miss being a SWE since most of the SWE aspects are still part of my job.

Anjali:

LinkedIn averages over 300 million monthly active users. What kind of unpredictable problems follow along with this unprecedented scale?

Ponmani: Large systems, like those at LinkedIn, are under constant pressure, and that pressure creates vulnerabilities. Just like in mechanics, stress over time can cause cracks to appear. In software systems, these cracks become exposed under high load, leading to failures.

Scale acts as a magnifying glass, revealing weaknesses throughout the system. This includes not just core applications and data storage, but also seemingly unrelated services like DNS, monitoring systems, and even logging.

The complex web of dependencies in large organizations creates additional risk. A seemingly unimportant service can cause a critical system to fail because of these interconnected layers. Even routine maintenance activities like upgrades can disrupt operations at scale.

To overcome these challenges, a multi-pronged approach is crucial. Designing with resilience in mind, implementing robust change management practices, and employing automatic failovers are all essential. Additionally, comprehensive monitoring and alerting systems are crucial for catching problems before they impact users.

Anjali:

When building microservices for handling millions of transactions daily, what does your process entail? Where do you start and what kind of considerations do younger teams miss to include?

Ponmani: In high-scale systems with thousands of microservices handling millions of requests per second, building resiliency into every service is crucial. This means defining clear SLOs (Service Level Objectives) for each service, certifying them for a maximum load, and constantly monitoring performance to ensure they meet those targets.

"Skipping the SLO definition is a recipe for disaster."

With thousands of microservices serving millions of QPS, engineers need to ensure they have solid platforms to take care of the following aspects:

Deployment
Auto Scaling
Config management
Fleet management
Traffic Management
Load balancing
Service discovery
BCP and Failovers
Observability (Logs, Events, Metrics and Traces)

They need to be given utmost importance and not treated as an afterthought.

Let's dive into some common pitfalls that younger teams might encounter when operating high-scale systems.

Handling failures gracefully:

While retries seem like a simple solution to service failures, they can backfire without proper safeguards.

"Uncontrolled retries can lead to "thundering herd" problems, where a surge in retries overwhelms the struggling service further."

"The key is to implement exponential backoff and circuit breakers."

Exponential backoff increases the retry interval after each attempt, preventing a concentrated barrage of requests. Circuit breakers act as a safety net, automatically stopping retries for a period when failures become excessive.

This approach extends beyond application logic to resource management. Data stores, like database clusters, should also employ quota-based throttling. This ensures no single service can monopolize resources and bring down the entire cluster.

Autoscaling has a limit:

"While scaling stateless services with CPU and memory seems straightforward, it's not always the silver bullet. "

Databases, a critical component in many systems, often present a bottleneck. Scaling them can be complex, often involving time-consuming data repartitioning and migration.

"Rather than relying solely on scaling, it's wise to build in failsafe mechanisms and quota-based throttling."

This proactive approach ensures graceful degradation under heavy load and prevents a single service from hogging resources.

"Remember, infinite scaling is a myth – plan for realistic limitations."

Observability in Microservices:

Instrumenting and capturing metrics, logs, traces, and events is mandatory. With microservices, it is not wise not to have them. A good observability stack is paramount in the world of microservices.

Anjali:

Automation in DevOps and SRE processes has become essential for ensuring the reliability and efficiency of modern systems. Would you like to provide a few examples of successful automation initiatives you've implemented for our readers?

Ponmani:

"The mantra for high-scale systems: Automate everything you can!"

Any mitigation step or action documented in a runbook is a prime candidate for automation. This approach reduces manual workload, improves consistency, and minimizes human error.

To illustrate the power of automation, here are some examples from my work at LinkedIn, alongside industry favourites that impressed me:

Before joining LinkedIn, I tackled the challenge of setting up an on-demand load-testing environment in Kubernetes at my previous company. Our goal was to comprehensively test performance, including databases, caches, Kafka, and other dependencies.

"The load test needs to be scripted, which we did. But we didn’t want this cluster to be up and running all the time and cost us a fortune. The key challenge was optimizing resource utilization."

We didn't want a constantly running cluster. The solution? Automating the entire workflow.

This included:

Cluster creation on demand: Kubernetes spun up a cluster only when needed for load tests.
Service deployment: All required services like databases and caches were automatically deployed within the cluster.
Data population: Test data was efficiently copied into the environment.
Load test execution: The scripted load test ran its course, capturing performance metrics.
Metrics upload: Test results were uploaded for analysis.
Cluster teardown: Once the test was completed, the entire cluster was automatically shut down to minimize cost.

Infrastructure as code (IaC) proved to be a powerful tool, enabling us to manage the entire process with ease and control.

At LinkedIn, we have a framework called Nurse to add auto-remediation steps as action to any alert.

This framework provides a menu of pre-built actions, allowing them to address common alerts swiftly and efficiently. Additionally, custom actions can be integrated for more specialized situations.

For example:

Hardware issues: If an alert indicates hardware problems on a host, the Nurse can automatically exclude that host from the pool, preventing further disruptions.
Application performance: In the case of an application failure due to high GC, the Nurse can trigger a heap dump capture to aid in troubleshooting before restarting the application.

LinkedIn’s OS Upgrades Automation, Automated Live Load Testing framework and production testing with Dark canaries are some great examples of automation making lives easier, reducing failures and saving cost.

Anjali:

And lastly Ponmani, any exciting war room stories you’d like to share with our readers?

Ponmani: We encountered this problem in 2018, most likely. While the initial impact wasn't significant, identifying and fixing it turned out to be very complex.

"I was working for a retail company. We were trying to go to production with Kubernetes for the first time for a new business initiative. Most of the Kubernetes components were in beta back then. We deployed and opened the feature for a small set of customers. We right away started to notice that a lot of errors and logs were filled with HTTP timeout error messages."

We identified that any network IO call made outside of the Kubernetes cluster was timing out. The deployment setup was using KubeDNS and Weaver as CNI.

On analyzing further, we could identify that the DNS system calls were taking too long and timing out. This was due to the default search domains added by the Kubernetes and the default ndots value set to 5 to /etc/resolv.conf on all the pods.

DNS syscall was trying to get A and AAAA records in a single UDP socket though we weren’t using IPV6. Since the IPV6 records were not sent, the socket waited for 5 seconds and timed out.

Given Kubernetes added 3 search domains and the external URLs did not have FQDN (with a trailing dot), it would take 15+ seconds to finally resolve the DNS. All our clients had a 10-second timeout. So every single call to the external world was timing out.

We tried to set the DNS config option of single-request-reopen so that the A and AAAA records are fetched using separate sockets which would resolve the issue. But it didn’t work because only glibc honors this and our containers were built on top of alpine Linux which was using libmusl instead of glibc for DNS resolution.

So, we had to change the default value of ndots to 1, so that any URL with a single dot was resolved as FQDN and that fixed it. We had to redeploy all the containers with an additional pod config option. A night to remember for the learnings.

Ahh, wasn't that worth the read? It's always insightful talking to the folks upholding the reliability of the apps we use every day, never realising what it takes to keep the gears moving.

Stay tuned for more stories and experiences coming your way every month.

If you're fascinated by reliability and the intricate process of digital recovery from downtime, checkout our podcast - Incidentally Reliable, where veterans from Docker, Amazon, Walmart, Flipkart and other industry-leading organizations, share their experiences, challenges, and success stories!

If you're someone who is looking to streamline your incident management process, Zenduty can enhance your MTTA & MTTR by at least 60%. With our platform, engineers receive timely alerts, reducing fatigue and boosting productivity.

The Reliability Stories You Won’t Hear on LinkedIn

Summarize with:

Share this article:

Handling failures gracefully:

Autoscaling has a limit:

Observability in Microservices:

Anjali Udasi

11 Top Load Balancers for 2025 (Cloud, Enterprise & Open Source)

How Dev and SRE Teams Split Work

Incident Management Challenges and What to Do About Them

How AI Is Transforming Observability and Incident Management in 2025

Be Prepared for Incident Response with Zenduty