Handling DDoS attacks
Handling millions of requests from thousands of IPs is hard! A flood of millions of requests clogs one drain after another.
This is what happened when we had a DDoS attack this week.
- First, our hard-disk got filled due to access logs
- Then our hard-disk got filled due to
too many open files
error logs - Various services started hitting file descriptor limits
- The system started dropping packets and killing connections as the connections table got full
Is it a DDoS?
The best way to check if it is a DDoS is to run the
netstat
commandsudo netstat -nta | grep 'ESTABLISHED' | awk '{sub(/^[[]/,"",$5); sub(/[]]?:[0-9]+$/,"",$5); print $5}' | sort | uniq -c | sort -nr | head -n 20
It shows the top 20 IPs with the most number of active connections.
During a DDoS, the IPs will be flooding the system with hundreds or thousands of connections.
Handling Disk Space Issues
The first step is to stop the NGINX for some time. If increasing logs are the reason for filling the disk, then increasing the disk size won't help for long. Stop the tap:
sudo systemctl stop nginx
When can then immediately free-up some disk space by removing some log files or old backups.
Checking What Is Taking Up the Space
We can check which folder is taking up space using
du
.# which folder is taking space in root sudo du -h --max-depth 1 / | sort -h # then incrementally inspect inside that folder sudo du -h --max-depth 1 /var/ | sort -h sudo du -h --max-depth 1 /var/log/ | sort -h
Temporary Solutions:
- Disable access logs
- Increase level of logging for error logs
Permanent Solutions:
- Stop the flood. Ban bad IPs (check below).
- Increase disk size to accommodate few days of logs
- Setup an alerting system for 80% disk utilization
- Create a temporary empty file of 5gb. The next time you run out of space; delete it to get some immediate breathing room.
Handling too may open files
errors
Services like
nginx
or wsgi
have restrictions on the number of files they can open at a time. These limits are sometimes very low. It was just 1024 in some of our cases, while we had set a much higher worker_connections
at the nginx level.These limits are often different for the services's main process and their child processes. We can check these using these following commands:
# checking system wide limit cat /proc/sys/fs/file-max # list processes for a service ps -ef | grep nginx # the second column in the above is the process id # the first process is often the main process # and others are child process # both can have different limits # checking process limits cat /proc/<pid>/limits | grep "Max open files"
We can increase the number of file limits for the services by setting
LimitNOFILE
in their systemd
configs.# for NGINX # add this in/etc/systemd/system/nginx.service.d/limits.conf
[Service] LimitNOFILE=200000 # for other services, example uwsgi # add this in their service.conf # eg /etc/systemd/system/my-wsgi.service
Handling nf_conntrack: table full
errors
Like with the max-number-of-open files limits, the system has a limit for the maximum number of connections it can keep in the table. These include internal as well as external connections.
We can use these commands to inspect these and increase them:
# check max connections value sudo sysctl net.netfilter.nf_conntrack_max # check current number of connections sudo sysctl net.netfilter.nf_conntrack_count # we need to increase the limit if the current count # is near the max limit sudo sysctl -w net.netfilter.nf_conntrack_max=1048576 # the above sets the limit temporarily # to make the above persistent on restarts vi /etc/sysctl.d/90-conntrack.conf # add this net.netfilter.nf_conntrack_max = 1048576
Stop the flood - ban bad IPs
We had this in our
nginx.conf
limit_conn_zone $binary_remote_addr zone=ipclients:100m; limit_conn ipclients 50; limit_conn_status 429;
We expected the above to block the IP if it had more than 50 connections. Yet the
netstat
command continuously showed IPs with more than 500 connections.limit_conn
doesn't block the IPslimit_conn
returns a polite error 429
for IPs sending too many requests. It doesn't block them at the TCP connection level.limit_conn
only refuses to serve the food if a person asks for too many plates. It doesn't restrict those hooligans from entering your party. These hooligans still create a crowd and disrupt the party. We need bouncers for them.The bouncer should throw them out and prevent them from entering again. This is what
fail2ban
does. We configured fail2ban
to monitor NGINX error logs and ban an IP if it gets too many 429 errors
.We asked
gemini-2.5-pro
to help us setup fail2ban.Have Proper Instrumentation
It's a pity that AWS doesn't provide a simple dashboard to monitor core vitals: CPU, RAM and Disk usage. All other hosting solutions provide it out of the box. It is a must.
We use Prometheus with Grafana to create a dashboard to monitor these metrics:
- CPU Utilization %
- CPU Load
- RAM Utilization % (memory)
- Disk Utilization
- Number of connections
- Disk IO
- Bandwidth utilization
The above charts tell us what is happening. They help us in fixing right things.
Statistics
Jean has written this wonderful blog post on how to setup fail2ban. He also shares a script to extract locations and organisations of the banned IPs. We improved upon this script to generate stats in batch.
These were the details from the last batch of 1500 IPs.
Top 5 Countries 286 CHINA 165 INDONESIA 98 UNITED STATES 73 RUSSIA 72 DENMARK
Top 5 ORGs 82 AS37963 Hangzhou Alibaba Advertising Co.,Ltd. 75 AS4134 CHINANET-BACKBONE 38 AS14061 DigitalOcean, LLC 24 AS24940 Hetzner Online GmbH 23 AS45090 Shenzhen Tencent Computer Systems Company Limited
Wish you the best. Hope you too are able to beat the attackers.