Recently, an automation that creates new nginx hosts stopped working properly. The process would go through flawlessly, but when I tried to access the site, it redirected to another host. The same thing also occurred in certificate renewals. Only after a full nginx restart would the site work correctly.

This behavior led me down a debugging rabbit hole. In this post, I’ll share how I diagnosed this issue and the solution I found to fix it.

The TL;DR (For Those in a Hurry)

Your Nginx is hitting its maximum file descriptor limit, preventing it from properly reloading. You need to:

  1. Increase the worker_rlimit_nofile setting in your nginx.conf
  2. Add a systemd override to increase the service limit
  3. Restart (not reload) the service

Keep reading for the step-by-step diagnostic process and solution.

The Problem: “I Swear I Reloaded Nginx!”

When misterious bugs like this appear, my first intinct is to check the logs. But in this case, it didn’t yeld any results.

I’ve double-check file permissions, certificate paths, configuration… Everything looks perfect. Yet somehow, Nginx is was ignoring my changes.

The thing is, when I ran an systemctl restart nginx, it worked normally, my automation runs reload: BINGO!

THe difference between restart and reload

See, when you reload nginx updates your configuration in a very smart way:

  1. Reads the new configuration and validates it
  2. Starts new worker processes with the updated configuration
  3. Signals old worker processes to gracefully shut down
  4. Continues serving existing connections through old workers until they complete
graph TB
    subgraph "Nginx Reload Process"
        A1[Master Process] --> B1[Validate New Configuration]
        B1 --> C1[Start New Worker Processes]
        C1 --> D1[Signal Old Workers to Finish]
        D1 --> E1[Old Workers Complete Connections]
        E1 --> F1[Old Workers Terminate]
    end

    subgraph "Nginx Restart Process"
        A2[Stop Master Process] --> B2[Close All Connections]
        B2 --> C2[Terminate All Workers]
        C2 --> D2[Start Fresh Master Process]
        D2 --> E2[Start New Worker Processes]
    end

This essentially means your users won’t get errors in their application because nginx gracefully handled the transition of configuration for you.

It might be a good time to bring up the fact that my automation did exactly that: reloaded nginx. Keep in mind, this is a production environment, having errors in the user’s face is not acceptable. So maybe the problem is there?

The Real Culprit: File Descriptor Limits

Here’s what’s was actually happening: Nginx was hitting the operating system’s limit on how many files it can have open simultaneously by the process. When this happens, the reload process can’t properly initialize the new worker processes, causing your certificate changes (and possibly other configuration changes) to be ignored.

Remember that process in which a new worker needs to be created to handle connections with the new configuration? Well, how would it create new ones if it’s already at its limit?

Each connection, file, and socket counts as a file descriptor, which explains why it “suddenly” stopped working—we had a bunch of new hosts added and traffic has increased recently.

This is also why a full restart works when a reload doesn’t; the restart completely closes all connections before starting fresh, thus removing the old file descriptors that were causing the limit to be reached.

Diagnosing the Issue

First, let’s verify if you’re indeed hitting the file descriptor limit:

1
2
# Get the process ID of the Nginx master process
ps aux | grep "nginx: master"

You’ll see output like:

root  12345  0.0  0.2  56820  8056 ?  Ss  Apr12  0:00 nginx: master process /usr/sbin/nginx

Make note of that process ID (12345 in this example). Now let’s check its limits:

1
2
# Check the file descriptor limits for this process
cat /proc/12345/limits | grep 'Max open files'

You might see:

Max open files  1024  4096  files

The first number (1024) is the “soft limit” and the second (4096) is the “hard limit.”

Now, let’s see how many file descriptors Nginx is actually using:

1
sudo lsof -p 12345 | wc -l

If this number is close to or exceeding your soft limit, we’ve found our problem!

The Solution: Raising the Limits

Fixing this requires changes in two places: Nginx’s own configuration and the systemd service that manages it.

Step 1: Update Nginx Configuration

Open your nginx.conf file (usually at /etc/nginx/nginx.conf):

1
sudo nano /etc/nginx/nginx.conf

Find the section with worker_processes and add the worker_rlimit_nofile directive:

1
2
3
4
5
6
worker_processes auto;

# Add this line to increase Nginx's internal limit
worker_rlimit_nofile 65535;

pid /run/nginx.pid;

Step 2: Create a systemd Override

Even with the Nginx configuration updated, we need to tell systemd to allow Nginx to use more file descriptors:

1
2
3
4
5
# Create the override directory
sudo mkdir -p /etc/systemd/system/nginx.service.d/

# Create the override file
sudo nano /etc/systemd/system/nginx.service.d/limits.conf

Add the following to this new file:

1
2
[Service]
LimitNOFILE=65535

Step 3: Apply the Changes

Now let’s make these changes take effect:

1
2
3
4
5
# Reload systemd to recognize the override
sudo systemctl daemon-reload

# Restart (not reload) Nginx
sudo systemctl restart nginx

Verifying the Fix

Let’s check if our changes worked:

1
2
3
4
5
# Get the new process ID
ps aux | grep "nginx: master"

# Check the new limits
cat /proc/NEW_PID/limits | grep 'Max open files'

You should now see the increased limit. Try updating your SSL certificates and reloading Nginx – it should work properly now!

Have you encountered other mysterious Nginx behaviors? Let me know in the comments below!

Sources:

  1. Plesk Thread on SSLIT issues
  2. Nginx Documentation