Screener was quite a bit slow yesterday. But more importantly, once in a while we were getting a "504 bad gateway" error.
Sometimes people have a very hard time debugging. Mostly these are people who believe that in order to debug a system, you have to think about it instead of looking at it.
[...]
It can be tempting to think that you already know the answer. Sometimes you can guess and you're right. It doesn't happen very often, but it happens often enough to trick people in thinking that guessing the answer is a good method of debugging.
However, most of the time, you will spend hours, days, or weeks guessing the answer and trying different fixes with no result other than complicating the code. In fact, some codebases are full of "solutions" to "bugs" that are actually just guesses – and these "solutions" are significant source of complexity in the codebase.
The memory usage was 60%. uptime
showed the CPU load around 90% but not crossing 100%. We needed to look further before just increasing the resources.
We soon discovered "too many files" errors in the NGINX logs.
NGINX seems to create temporary files when the incoming requests are too many to process by the upstream server. The files are kept open till the request is timed out.
We had over 9000 active visitors on the website at that moment. Our highest ever. While the load on the upstream server was making the website slow, the file limits in NGINX were raising 504
errors.
This blog post helped us debugging and fixing this issue. We increased the overall files limit of NGINX using
LimitNOFILE
in systemd. We increased the per-worker file limit using
worker_rlimit_nofile
. These two fixed the nginx issues.
Increasing the number of workers in uwsgi server fixed the load and improved the speeds.
Updated Maintenance Checklist
We do a server check every week. There is a small checklist with all the maintenance commands. Someone runs them to ensure:
- Adequate disk space (should be utilised less than 70%)
- Adequate RAM (should be less than 70%)
- Adequate system load (should be less than 70%)
- Errors and warnings in postfix logs and DMARC reports
- Check for exhaustion of primary key utilisation in MySQL
We have added a new task to the above list.
- Check for
[alert]
and [warn]
in NGINX logs.
We have learnt most of the coding, web development and server stuff due to this slow and steady traffic growth. Steady organic growth helps us in building better systems and face one problem at a time.