diet-okikae.com

The Curious Case of Our 3000 Apache Servers Crashing on New Year's Day

Written on

Chapter 1: The Unforeseen Outage

Picture a serene Saturday morning, the dawn of a new year. You awaken to a notification that your entire infrastructure has gone offline! This was the unfortunate reality for one of my colleagues on January 1, 2022.

The immediate priority was to restore all services to reduce the downtime. After rebooting all Apache servers, they appeared to function normally. However, we were left grappling with a pressing question: why did all servers crash on the very first day of the year? Surely, it couldn't just be a coincidence.

The only clue we had was the error log from our servers:

AH00171: Graceful restart requested, doing restart libgomp: could not create thread pool destructor.

libgomp? To honor the programmer's instinct, our first step was to search online for this issue. We stumbled upon a similar case on Server Fault, but unfortunately, it did not provide a usable solution. What struck us as odd was the mention in that report that the crashes occurred every 24 to 36 hours!

Section 1.1: Investigating the Logs

Every morning, we perform a routine log rotation, which involves restarting the servers to create a fresh log file. Although Apache seemed to reload properly, it quickly crashed due to the libgomp error.

Disheartened from sifting through countless web pages, we began examining the libgomp source code to better understand the problem. But what exactly is libgomp? According to its documentation, it is an implementation of OpenMP for C, C++, and Fortran compilers, designed to simplify parallel programming for various GNU systems. But how could this lead to our issues?

A search through the gomp source code revealed a single occurrence of the relevant message.

Subsection 1.1.1: Understanding Thread Keys

Libgomp source code snippet

It appeared that a thread key was being created, but an error was occurring. A look at the pthread_key_create manpage explained that this function creates a thread-specific data key visible to all threads in a process.

What are the possible return values? The function fails if:

  • There are insufficient resources to create another key or the system limit on keys per process (PTHREAD_KEYS_MAX) has been exceeded.
  • There is inadequate memory to create the key.

Upon reviewing the function code, we discovered the upper limit for PTHREAD_KEYS_MAX is defined as between 0 and 1024, exclusive. These keys are assigned through a simple Compare-And-Swap operation, but there must be a method to release them.

We believed we had found the culprit: we needed to increase PTHREAD_KEYS_MAX. However, this value is fixed. We found a discussion indicating that some Apache setups with various modules could malfunction on platforms like NetBSD due to a low PTHREAD_KEYS_MAX value.

Section 1.2: Digging Deeper into the Problem

Reloading Apache also triggered a PHP module called Imagick. So, what is Imagick? It’s a PHP extension for creating and modifying images using the ImageMagick library.

Imagick PHP extension

Disabling Imagick's threading could potentially resolve the maximum thread issue by simply setting an environment variable. While this seemed like a sensible solution, we were still perplexed by the timing of the outage on January 1.

Why did this happen after so many years? Was there a breaking update? We couldn't accept this as the final answer. Our investigation continued, diving deeper into the code for both Apache and libgomp, yet nothing seemed amiss.

Chapter 2: The Mystery Unfolds

As we continued our search, we even encountered humorous forum posts discussing odd date-related bugs, such as a premature Year 2038 problem. Still, none of these offered a satisfactory explanation.

We examined the ChangeLogs for Imagick and discovered updates aimed at preventing segmentation faults during shutdown. This hinted that setting Imagick's thread limit to 1 might resolve our issues, but we were still left pondering the mysterious timing.

Then, we conducted more targeted searches, seeking anyone who had experienced similar issues in January. One link led us to reconsider our situation: what if the thread keys had never been released?

We calculated that with 1024 possible keys, and if we reloaded Apache every morning, it could take nearly three years to exceed that limit. What if a thread key that never released had been assigned every single morning for the last ~1024 days?

Section 2.1: Reproducing the Issue

With a flicker of hope, we created a test environment and executed a simple bash script to reload Apache 1100 times. To our surprise, we successfully replicated the issue; Apache became inactive after a little over 1024 reloads, returning the same libgomp error.

It was now time to explore whether environmental variables like MAGICK_THREAD_LIMIT (and OMP_THREAD_LIMIT in newer versions) could resolve this problem. Unfortunately, the issue persisted. The next step was to upgrade Imagick to version 3.5.0 or higher, a solution that appeared to work even after numerous reloads.

Imagick version update

Final Thoughts: Lessons Learned

The lingering question remained: did the updated Imagick delete the thread key? To answer this, we employed a tool called ltrace, which monitors dynamic library calls made by an executed process.

First, we ran ltrace on the older version of libgomp (v3.4.4) to examine its behavior. The results confirmed our suspicion: it was only calling pthread_key_create without ever deleting the key.

After running the same command on the updated version (v3.6.0), we found that it no longer created or destroyed any keys.

In conclusion, while we managed to resolve the issue, we were left with an unsettling thought: had we really gone so long without restarting any of these services? We chose not to dwell on this question, as Sherlock Holmes famously stated, “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.”

As we reflected on resolving this issue, we felt a mix of pride and concern about how many other long-standing services might also be on the brink of failure, waiting for just the right moment to stop functioning—perhaps on a quiet Saturday morning, on the first day of January.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Essential Resources for Frontend Interview Success

Comprehensive guide to resources that aid in preparing for frontend interviews, featuring videos and tools for effective practice.

Title: Understanding the Challenges of Quitting Social Media

Exploring the difficulties of stepping away from social media and the emotional impacts behind it.

Overcoming Learned Helplessness: A Path to Empowerment

Explore the concept of learned helplessness and discover strategies to regain control and motivation in life.

Aging and Metabolism: Understanding the Four Phases of Life

Explore the four metabolic stages throughout life and how they affect caloric needs and energy balance.

Mastering the Art of Listening: 7 Essential Techniques

Discover seven effective listening techniques to enhance communication with your boss and improve your workplace relationships.

Finding Freedom Through Action: How to Overcome Life's Hurdles

Discover how taking action can help you overcome life's challenges and create new, positive experiences.

Choosing the Ideal SIM-Only Plan in the UK: A Comprehensive Guide

Discover how to choose the best SIM-only deal in the UK with tips on provider selection, signal strength, and understanding your needs.

The Dangers and Opportunities of Debt: A Wealth Perspective

An exploration of debt's impact on wealth, highlighting its risks and potential when used wisely.