@corbet @LWN I sympathize, it's an exasperating problem. I've found microcaching all public facing content to be extremely effective.
- The web server sits behind a proxy micro cache
- Preseed the cache by crawling every valid public path
- Configure the valid life of cached content to be as short as you want
- Critically, ensure that every request is always served stale cached content while the cache leisurely repopulates with a fresh copy. This eliminates bot overwhelm by decoupling requests from ANY IP from the request rate hitting the upstream
- Rather than blocking aggressive crawlers, configure rate-limiting customized by profiling max predicted human rate
- For bots with known user agents, plus those detected by profiling their traffic, divert all their requests to a duplicate long lived cache that never invokes the upstream for updates
Micro caching has saved us thousands on compute, and eliminated weekly outages caused by abusive bots. It's open to significant tuning to improve efficiency based on your content.
Shout out to the infrastructure team at NPR@flipboard.com - a blog post they published 9 years ago (now long gone) described this approach.