Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our technical achievements and share some of our implemented solutions.
Our monitoring informed me about a HTTP 500 error from a central reverse proxy running with Nginx. Checking the error logs revealed the following issue:
2019/05/09 08:43:35 [crit] 25655#0: *524505514 open() "/usr/share/nginx/html/50x.html" failed (24: Too many open files)
[...]
2019/05/09 09:04:27 [alert] 28720#0: *59757 socket() failed (24: Too many open files) while connecting to upstream,
This basically means that the Nginx process had too many files open, which could also be checked on the Nginx status page. Here the graph from check_nginx_status.pl:
The default is set to a limit of 4096 files per (worker) process, which can be seen in /etc/default/nginx:
# cat /etc/default/nginx
# Note: You may want to look at the following page before setting the ULIMIT.
# http://wiki.nginx.org/CoreModule#worker_rlimit_nofile
# Set the ulimit variable if you need defaults to change.
# Example: ULIMIT="-n 4096"
#ULIMIT="-n 4096"
However don't be fooled. Changing this file doesn't help. Instead this needs to be set in /etc/security/limits.conf:
# tail /etc/security/limits.conf
#@faculty hard nproc 50
#ftp hard nproc 0
#ftp - chroot /ftp
#@student - maxlogins 4
# Added Nginx limits
nginx soft nofile 30000
nginx hard nofile 50000
# End of file
Here a soft limit of 30k and a hard limit of 50k files are defined per nginx process.
Note: I tried this here with www-data first (the user under which Nginx runs), but this didn't work. Although a user name could be used as a "domain" in this config file...
Additionally Nginx should be told how many files can be opened. In the main config file /etc/nginx/nginx.conf add:
# head /etc/nginx/nginx.conf
user www-data;
worker_processes 4;
pid /run/nginx.pid;
# 2019-05-09 Increase open files
worker_rlimit_nofile 30000;
After a service nginx restart the limits of the worker processes can be checked:
# ps auxf | grep nginx
root 7027 0.0 0.3 103620 13348 ? Ss 09:21 0:00 nginx: master process /usr/sbin/nginx
www-data 7028 8.6 1.0 127900 40724 ? R 09:21 2:37 \_ nginx: worker process
www-data 7029 8.9 1.0 127488 40536 ? S 09:21 2:44 \_ nginx: worker process
www-data 7031 9.5 1.0 127792 40896 ? S 09:21 2:53 \_ nginx: worker process
www-data 7032 8.1 1.0 128472 41244 ? S 09:21 2:29 \_ nginx: worker process
# cat /proc/7028/limits | grep "open files"
Max open files 30000 30000 files
The "too many open files" errors disappeared from the Nginx logs after this change.
But what did cause this sudden problem? As you can see in the graph above this Writing (and Waiting) connections suddenly sharply increased. It turned out that an upstream server behind this reverse proxy did not work anymore and this particular virtual host received a lot of traffic, causing general slowness and holding files open while waiting for a timeout from Nginx (504 in this case).
Update: February 1st 2021
The above fix was written two years ago and was working fine on a system without Systemd as init system. However when Nginx is started and controlled by Systemd, the limits defined in /etc/security/limits.conf seem to be ignored. Instead Systemd applies its own default limits. See Fredrik Averpil's blog post for additional info.
This can be nicely verified. An unlimited nofile limit was defined for multiple domains in /etc/security/limits.conf to see which would be applied to Nginx's processes:
root@nginx:~# cat /etc/security/limits.conf
[...]
# Added Nginx nofile limit
nginx soft nofile 50000
nginx hard nofile 80000
root soft nofile unlimited
root hard nofile unlimited
www-data soft nofile unlimited
www-data hard nofile unlimited
But after setting worker_rlimit_nofile in nginx.conf and a restart of Nginx, the limits still exists:
root@nginx:~# ps auxf
[...]
root 21114 0.0 0.4 64508 18264 ? Ss 09:36 0:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
www-data 21115 40.0 0.6 70340 26880 ? R 09:36 0:01 \_ nginx: worker process
www-data 21116 2.3 0.6 68888 25252 ? S 09:36 0:00 \_ nginx: worker process
www-data 21117 7.0 0.6 68888 25376 ? S 09:36 0:00 \_ nginx: worker process
www-data 21118 16.0 0.6 68888 25196 ? S 09:36 0:00 \_ nginx: worker process
www-data 21119 0.0 0.5 68888 21312 ? S 09:36 0:00 \_ nginx: cache manager process
www-data 21120 0.0 0.5 68888 20912 ? S 09:36 0:00 \_ nginx: cache loader process
[...]
root@nginx:~# cat /proc/21114/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 15598 15598 processes
Max open files 1024 4096 files
Max locked memory 16777216 16777216 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 15598 15598 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Even the nginx master process, executed as root, still has a file limit of 1024 (soft) and 4096 (hard).
This obviously causes errors when Nginx needs to open new sockets or file handlers and the error log can contain events like this:
2021/02/01 09:28:13 [emerg] 28935#28935: open() "/var/log/nginx/example.com.access.log" failed (24: Too many open files)
To solve this, the limits must be changed in the Systemd service unit configuration for Nginx. The quickest way to do this is to copy the original Nginx service unit and add the LimitNOFILE option:
root@nginx:~# cp /lib/systemd/system/nginx.service /etc/systemd/system/
root@nginx:~# cat /etc/systemd/system/nginx.service
# Stop dance for nginx
# =======================
#
# ExecStop sends SIGSTOP (graceful stop) to the nginx process.
# If, after 5s (--retry QUIT/5) nginx is still running, systemd takes control
# and sends SIGTERM (fast shutdown) to the main process.
# After another 5s (TimeoutStopSec=5), and if nginx is alive, systemd sends
# SIGKILL to all the remaining processes in the process group (KillMode=mixed).
#
# nginx signals reference doc:
# http://nginx.org/en/docs/control.html
#
[Unit]
Description=A high performance web server and a reverse proxy server
Documentation=man:nginx(8)
After=network.target
[Service]
Type=forking
PIDFile=/run/nginx.pid
ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;'
ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;'
ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload
ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid
TimeoutStopSec=5
KillMode=mixed
LimitNOFILE=500000
[Install]
WantedBy=multi-user.target
Note: A more proper solution is to create a service sub-directory (/etc/systemd/system/nginx.service.d) and append the LimitNOFILE option into a single config file with a [Service] section.
After another Nginx restart, the new limits can be verified:
root@nginx:~# ps auxf|grep nginx
root 26732 0.0 0.0 14428 1012 pts/0 S+ 10:03 0:00 \_ grep --color=auto nginx
root 21636 0.0 0.4 64508 18260 ? Ss 09:37 0:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
www-data 21637 0.1 0.6 68888 25368 ? S 09:37 0:01 \_ nginx: worker process
www-data 21638 10.6 0.7 75440 32228 ? R 09:37 2:42 \_ nginx: worker process
www-data 21639 1.6 0.6 69608 26276 ? R 09:37 0:24 \_ nginx: worker process
www-data 21640 29.0 1.1 87836 44740 ? S 09:37 7:22 \_ nginx: worker process
www-data 21641 0.0 0.5 68888 21368 ? S 09:37 0:00 \_ nginx: cache manager process
root@nginx:~# cat /proc/21636/limits | grep "open files"
Max open files 500000 500000 files
root@nginx:~# cat /proc/21637/limits |grep "open files"
Max open files 500000 500000 files
Or even quicker, without having to manually find the PID of the Nginx master process:
root@nginx:~# cat /proc/$(pgrep -u root nginx)/limits|grep "open"
Max open files 500000 500000 files
The limit, configured in Systemd's service unit file for Nginx, is applied for both master and worker processes.