Grafana line showing load dropping to normal

Rate Limiting an aggressive bot in Nginx

By John Locke on August 22, 2023

High load isn't necessarily an emergency, but it may be a heads-up before a site noticeably slows down. Sometimes there are weird spikes that just go away, but sometimes this is an indication of a Denial of Service.

Rate Limiting

NGinx has rate limiting that can be used to handle cases where a slow URL is targeted. Today one of our sites had high load alerts. Here's how I handled it:

  1. Check to see what was using the load. I logged into the server, and ran top, which immediately showed 7 - 8 PHP processes near the top of the list.
  2. Check the responsible process's status. I went to the PHP status page we have in our Nginx config, and saw that all available processes were in use, and many of them had a /index.php?document-type=... path -- e.g. a search path with a lot of parameters.
  3. Review the logs. Looking in the Nginx access_log, I found that the /search path was being hit 2 - 3 times per second, from multiple different IP addresses but all sharing the same user agent -- "Bytespider". This is a crawler from ByteDance (the company that owns TikTok).
  4. Determine a plan. In this case, after figuring out that Bytespider had hit the search page 33K times in the previous 6 hours, and it was also loading all page-related assets, I decided this was not something to block entirely, but a good situation to use a rate limit.

Implementation

A couple searches gave me some good references/instructions on this: https://serverfault.com/questions/639671/nginx-how-to-limit..., https://theawesomegarage.com/blog/limit-bandwidth..., https://www.nginx.com/blog/rate-limiting-nginx/

In our normal Nginx configurations, any .conf file in /etc/nginx/conf.d/*.conf is included in the main "http" block, so I created a "limits.conf" file there for the 'limit_req_zone' directive:

map $http_user_agent $limit_bot {
  default "";
  ~*(Bytespider) $http_user_agent;
}

limit_req_zone $limit_bot zone=bots:10m rate=30r/m;

... I had to read up on the map and limit_req_zone to understand this -- a map creates a new variable, and the first arg to limit_req_zone skips applying the zone if the value of the variable is empty. So this map outputs the Bytespider user agent string, or nothing.

The limit_req_zone creates a zone named 'bots', allocating 10 MB of space to track variations (which is probably entirely unnecessary, we only have a single variation here!). And then each variation of the user agent passed in $limit_bot is limited to 30 requests per minute -- 1 every 2 seconds.

To actually apply this, I could add directly to the sites-enabled file for the site, but this does get reset by our configuration management system (Salt Stack) every day. So instead I created another file in /etc/nginx/includes/limit_search.conf . Files in includes need to be explicitly included to take effect, and we have a section in our Salt configuration to do this.

In this file, I applied the actual limit to the search path:

location ~ /search {
  limit_req zone=bots burst=4;
  limit_req_status 429;
  try_files $uri @rewrite;
}

... this applies the "bots" limit to the /search path, allowing at most 4 concurrently running instances before it sends an error.

By default it sends a 503 error -- but looking up status codes, I found that 429 is "Too many requests", meant for rate limiting, so I changed the error code to send that.

And finally, this location block was taking over the entire URL -- but doesn't exist on the disk, and so I had to add the try_files handler to get it handled by Drupal/PHP.

And then the only thing left was to add it to the site configuration, using the line:

include includes/limit_search.conf

 

... and the traffic and load immediately dropped back to normal levels:

image.png

Comments

Eric A-M (not verified)

September 29, 2023

Thanks a bunch for this! This works to rate limit bots across multiple IPs, which is better than the IP approach I was using previously.

You're awesome!

mydeardiary (not verified)

August 2, 2024

Hi, thanks for sharing this tip.

I got to apply rate limiting because recently GPTBot accessed my site more than I do and used up more bandwidth than my typical usage.

Add new comment

The content of this field is kept private and will not be shown publicly.

Filtered HTML

  • Web page addresses and email addresses turn into links automatically.
  • Allowed HTML tags: <a href hreflang> <em> <strong> <blockquote cite> <cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h1> <h2 id> <h3 id> <h4 id> <h5 id> <p> <br> <img src alt height width>
  • Lines and paragraphs break automatically.

Drupal Canvas — Block HTML (locked)

  • Allowed HTML tags: <strong> <em> <u> <a href> <p> <br> <ul> <ol> <li>

Drupal Canvas — Inline HTML (locked)

  • Allowed HTML tags: <strong> <em> <u> <a href>