Recently, an automation that creates new nginx hosts stopped working properly. The process would go through flawlessly, but when I tried to access the site, it redirected to another host. The same thing also occurred in certificate renewals. Only after a full nginx restart would the site work correctly.
This behavior led me down a debugging rabbit hole. In this post, I’ll share how I diagnosed this issue and the solution I found to fix it.
The TL;DR (For Those in a Hurry)
Your Nginx is hitting its maximum file descriptor limit, preventing it from properly reloading. You need to:
- Increase the
worker_rlimit_nofile
setting in your nginx.conf - Add a systemd override to increase the service limit
- Restart (not reload) the service
Keep reading for the step-by-step diagnostic process and solution.
The Problem: “I Swear I Reloaded Nginx!”
When misterious bugs like this appear, my first intinct is to check the logs. But in this case, it didn’t yeld any results.
I’ve double-check file permissions, certificate paths, configuration… Everything looks perfect. Yet somehow, Nginx is was ignoring my changes.
The thing is, when I ran an systemctl restart nginx
, it worked normally, my
automation runs reload
: BINGO!
THe difference between restart
and reload
See, when you reload
nginx updates your configuration in a very smart way:
- Reads the new configuration and validates it
- Starts new worker processes with the updated configuration
- Signals old worker processes to gracefully shut down
- Continues serving existing connections through old workers until they complete
graph TB
subgraph "Nginx Reload Process"
A1[Master Process] --> B1[Validate New Configuration]
B1 --> C1[Start New Worker Processes]
C1 --> D1[Signal Old Workers to Finish]
D1 --> E1[Old Workers Complete Connections]
E1 --> F1[Old Workers Terminate]
end
subgraph "Nginx Restart Process"
A2[Stop Master Process] --> B2[Close All Connections]
B2 --> C2[Terminate All Workers]
C2 --> D2[Start Fresh Master Process]
D2 --> E2[Start New Worker Processes]
end
This essentially means your users won’t get errors in their application because nginx gracefully handled the transition of configuration for you.
It might be a good time to bring up the fact that my automation did exactly that: reloaded nginx. Keep in mind, this is a production environment, having errors in the user’s face is not acceptable. So maybe the problem is there?
The Real Culprit: File Descriptor Limits
Here’s what’s was actually happening: Nginx was hitting the operating system’s limit on how many files it can have open simultaneously by the process. When this happens, the reload process can’t properly initialize the new worker processes, causing your certificate changes (and possibly other configuration changes) to be ignored.
Remember that process in which a new worker
needs to be created to handle
connections with the new configuration? Well, how would it create new ones if
it’s already at its limit?
Each connection, file, and socket counts as a file descriptor, which explains why it “suddenly” stopped working—we had a bunch of new hosts added and traffic has increased recently.
This is also why a full restart works when a reload doesn’t; the restart completely closes all connections before starting fresh, thus removing the old file descriptors that were causing the limit to be reached.
Diagnosing the Issue
First, let’s verify if you’re indeed hitting the file descriptor limit:
|
|
You’ll see output like:
root 12345 0.0 0.2 56820 8056 ? Ss Apr12 0:00 nginx: master process /usr/sbin/nginx
Make note of that process ID (12345 in this example). Now let’s check its limits:
|
|
You might see:
Max open files 1024 4096 files
The first number (1024) is the “soft limit” and the second (4096) is the “hard limit.”
Now, let’s see how many file descriptors Nginx is actually using:
|
|
If this number is close to or exceeding your soft limit, we’ve found our problem!
The Solution: Raising the Limits
Fixing this requires changes in two places: Nginx’s own configuration and the systemd service that manages it.
Step 1: Update Nginx Configuration
Open your nginx.conf file (usually at /etc/nginx/nginx.conf
):
|
|
Find the section with worker_processes
and add the worker_rlimit_nofile
directive:
|
|
Step 2: Create a systemd Override
Even with the Nginx configuration updated, we need to tell systemd to allow Nginx to use more file descriptors:
|
|
Add the following to this new file:
|
|
Step 3: Apply the Changes
Now let’s make these changes take effect:
|
|
Verifying the Fix
Let’s check if our changes worked:
|
|
You should now see the increased limit. Try updating your SSL certificates and reloading Nginx – it should work properly now!
Have you encountered other mysterious Nginx behaviors? Let me know in the comments below!
Sources: