Fully-Faltoo blog by Pratyush

Bio Twitter Screener

2nd Jan. 2024

"Too many files" error in NGINX

Screener was quite a bit slow yesterday. But more importantly, once in a while we were getting a "504 bad gateway" error.

One thing I remember the most from Max Kanat-Alexander's "Understanding Software" is to never fix a problem until you can reproduce it.
Sometimes people have a very hard time debugging. Mostly these are people who believe that in order to debug a system, you have to think about it instead of looking at it.

[...]

It can be tempting to think that you already know the answer. Sometimes you can guess and you're right. It doesn't happen very often, but it happens often enough to trick people in thinking that guessing the answer is a good method of debugging.

However, most of the time, you will spend hours, days, or weeks guessing the answer and trying different fixes with no result other than complicating the code. In fact, some codebases are full of "solutions" to "bugs" that are actually just guesses – and these "solutions" are significant source of complexity in the codebase.
The memory usage was 60%. uptime showed the CPU load around 90% but not crossing 100%. We needed to look further before just increasing the resources.

We soon discovered "too many files" errors in the NGINX logs.

NGINX seems to create temporary files when the incoming requests are too many to process by the upstream server. The files are kept open till the request is timed out.

We had over 9000 active visitors on the website at that moment. Our highest ever. While the load on the upstream server was making the website slow, the file limits in NGINX were raising 504 errors.

This blog post helped us debugging and fixing this issue. We increased the overall files limit of NGINX using LimitNOFILE in systemd. We increased the per-worker file limit using worker_rlimit_nofile. These two fixed the nginx issues.

Increasing the number of  workers in uwsgi server fixed the load and improved the speeds.

Updated Maintenance Checklist

We do a server check every week. There is a small checklist with all the maintenance commands. Someone runs them to ensure:
  • Adequate disk space (should be utilised less than 70%)
  • Adequate RAM (should be less than 70%)
  • Adequate system load (should be less than 70%)
  • Errors and warnings in postfix logs and DMARC reports
  • Check for exhaustion of primary key utilisation in MySQL
We have added a new task to the above list.
  • Check for [alert] and [warn] in NGINX logs.
We have learnt most of the coding, web development and server stuff due to this slow and steady traffic growth. Steady organic growth helps us in building better systems and face one problem at a time.

4 Comments

Jigar Patel
23 Jan 2024

Pratyush, I use logwatch to monitor items like those in your checklist. I have set it up so that I receive an email every day with those stats. It also works well with other tools like fail2ban.

Pratyush (admin)
23 Jan 2024

Thanks Jigar for sharing about Logwatch. I will give it a try. How has your experience been with fail2ban? Do you use it with NGINX? Can you please share the config for NGINX?

Jigar Patel
24 Jan 2024

I use fail2ban to ban ssh auth attempts. The default config is enough for that. I don't use fail2ban with web server as I don't get too many 40X requests. To use it with nginx or any other service, you just need to point fail2ban to that service's log file so that it can monitor it. I start my server setups with this basic config and move on from there. https://gist.github.com/jagira/84724cf584d070fe176c4a30ec05154e

Satya mohanty
28 Jan 2024

I want to create a filter for stocks where profits for 3rd qr last FY ( 22-23) was greater than profit for 2nd qr last FY(22-23) Please guide

Leave a comment

Your email will not be published.