17th April 2025

Handling DDoS attacks

Handling millions of requests from thousands of IPs is hard! A flood of millions of requests clogs one drain after another.

This is what happened when we had a DDoS attack this week.

First, our hard-disk got filled due to access logs
Then our hard-disk got filled due to too many open files error logs
Various services started hitting file descriptor limits
The system started dropping packets and killing connections as the connections table got full

Is it a DDoS?

The best way to check if it is a DDoS is to run the netstat command

sudo netstat -nta | grep 'ESTABLISHED' | awk '{sub(/^[[]/,"",$5); sub(/[]]?:[0-9]+$/,"",$5); print $5}' | sort | uniq -c | sort -nr | head -n 20

It shows the top 20 IPs with the most number of active connections.

During a DDoS, the IPs will be flooding the system with hundreds or thousands of connections.

Handling Disk Space Issues

The first step is to stop the NGINX for some time. If increasing logs are the reason for filling the disk, then increasing the disk size won't help for long. Stop the tap:

sudo systemctl stop nginx

When can then immediately free-up some disk space by removing some log files or old backups.

Checking What Is Taking Up the Space

We can check which folder is taking up space using du.

# which folder is taking space in root
sudo du -h --max-depth 1 / | sort -h

# then incrementally inspect inside that folder
sudo du -h --max-depth 1 /var/ | sort -h
sudo du -h --max-depth 1 /var/log/ | sort -h

Temporary Solutions:

Disable access logs
Increase level of logging for error logs

Permanent Solutions:

Stop the flood. Ban bad IPs (check below).
Increase disk size to accommodate few days of logs
Setup an alerting system for 80% disk utilization
Create a temporary empty file of 5gb. The next time you run out of space; delete it to get some immediate breathing room.

Handling `too may open files` errors

Services like nginx or wsgi have restrictions on the number of files they can open at a time. These limits are sometimes very low. It was just 1024 in some of our cases, while we had set a much higher worker_connections at the nginx level.

These limits are often different for the services's main process and their child processes. We can check these using these following commands:

# checking system wide limit
cat /proc/sys/fs/file-max

# list processes for a service
ps -ef | grep nginx

# the second column in the above is the process id
# the first process is often the main process
# and others are child process
# both can have different limits

# checking process limits
cat /proc/<pid>/limits | grep "Max open files"

We can increase the number of file limits for the services by setting LimitNOFILE in their systemd configs.

# for NGINX
# add this in /etc/systemd/system/nginx.service.d/limits.conf
[Service]
LimitNOFILE=200000

# for other services, example uwsgi
# add this in their service.conf
# eg /etc/systemd/system/my-wsgi.service

Handling `nf_conntrack: table full` errors

Like with the max-number-of-open files limits, the system has a limit for the maximum number of connections it can keep in the table. These include internal as well as external connections.

We can use these commands to inspect these and increase them:

# check max connections value
sudo sysctl net.netfilter.nf_conntrack_max

# check current number of connections
sudo sysctl net.netfilter.nf_conntrack_count

# we need to increase the limit if the current count
# is near the max limit
sudo sysctl -w net.netfilter.nf_conntrack_max=1048576

# the above sets the limit temporarily
# to make the above persistent on restarts
vi /etc/sysctl.d/90-conntrack.conf
# add this
net.netfilter.nf_conntrack_max = 1048576

Stop the flood - ban bad IPs

We had this in our nginx.conf

    limit_conn_zone $binary_remote_addr zone=ipclients:100m;
    limit_conn ipclients 50;
    limit_conn_status 429;

We expected the above to block the IP if it had more than 50 connections. Yet the netstat command continuously showed IPs with more than 500 connections.

limit_conn doesn't block the IPs

limit_conn returns a polite error 429 for IPs sending too many requests. It doesn't block them at the TCP connection level.

limit_conn only refuses to serve the food if a person asks for too many plates. It doesn't restrict those hooligans from entering your party. These hooligans still create a crowd and disrupt the party. We need bouncers for them.

The bouncer should throw them out and prevent them from entering again. This is what fail2ban does. We configured fail2ban to monitor NGINX error logs and ban an IP if it gets too many 429 errors.

We asked gemini-2.5-pro to help us setup fail2ban.

Have Proper Instrumentation

It's a pity that AWS doesn't provide a simple dashboard to monitor core vitals: CPU, RAM and Disk usage. All other hosting solutions provide it out of the box. It is a must.

We use Prometheus with Grafana to create a dashboard to monitor these metrics:

CPU Utilization %
CPU Load
RAM Utilization % (memory)
Disk Utilization
Number of connections
Disk IO
Bandwidth utilization

The above charts tell us what is happening. They help us in fixing right things.

Statistics

Jean has written this wonderful blog post on how to setup fail2ban. He also shares a script to extract locations and organisations of the banned IPs. We improved upon this script to generate stats in batch.

These were the details from the last batch of 1500 IPs.

Top 5 Countries
    286 CHINA
    165 INDONESIA
     98 UNITED STATES
     73 RUSSIA
     72 DENMARK

Top 5 ORGs
     82 AS37963 Hangzhou Alibaba Advertising Co.,Ltd.
     75 AS4134 CHINANET-BACKBONE
     38 AS14061 DigitalOcean, LLC
     24 AS24940 Hetzner Online GmbH
     23 AS45090 Shenzhen Tencent Computer Systems Company Limited

Wish you the best. Hope you too are able to beat the attackers.

Liked this post?

Get new posts in your email. The updates are free.