The Wikimedia Foundation published a report last week about how AI scrapers are pillaging their site. Not only has scraping skyrocketed, but because of the way scrapers work, it’s the most expensive network usage they serve.
While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.
While undergoing a migration of our systems, we noticed that only a fraction of the expensive traffic hitting our core datacenters was behaving how web browsers would usually do, interpreting javascript code. When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total. This high usage is also causing constant disruption for our Site Reliability team, who has to block overwhelming traffic from such crawlers before it causes issues for our readers.
Wikipedia is uniquely positioned to suffer from this problem because they have a ton of content, and it is repeatedly edited by human editors leaving a deep trail showing how coherent narratives are perfected by human minds.
You are probably not serving anything like Wikipedia, but if you have a large amount of content, you might find that AI bots are putting a heavy load on your systems. What can you?
Enter Nepthenes and iocaine
iocaine is a tool designed to poison AI crawlers. It “generates an infinite maze of garbage”.
In other words, it snares AI crawlers into a never-ending maze they can’t escape from. Setup properly, it can divert bot load from your main site, forcing the scrapers to waste time – potentially infinite time – perusing nonsense and training on it, instead of pillaging your main site.
This is deliberately malicious software, intended to cause harm. Do not deploy if you aren’t fully comfortable with what you are doing. LLM scrapers are relentless and brutal, they will place additional burden on your server, even if you only serve static content. With
iocaine
, there’s going to be increased computing power used. It’s highly recommended to implement rate limits at the reverse proxy level, such as with the caddy-ratelimit plugin, if using Caddy.Entrapment is done by the reverse proxy. Anything that ends up being served by
iocaine
will be trapped there: there are no outgoing links. Be careful what you route towards it.
The tradeoff, of course, is CPU in your server to generate this maze.
While this may seem dubiously ethical – after all, the software is named after the famous fictional poison from The Princess Bride – it’s actually defending sites against unethical behavior. If someone building an LLM contacted you and asked if they could scrape your site, that’s one thing. Or if you publish it under a very permissive license, then you’ve essentially said “anyone can read this”.
But what if you published it All Rights Reserved? What if you put a notice that scraping is not allowed? This will not stop most AI bots, and indeed copyright concerns over LLMs are in the headlines nearly every day. What if the scraper ignores your robots.txt? Seems to be the norm.
iocaine says: “Lets make AI poisoning the norm. If we all do it, they won’t have anything to crawl.”
Nepenthes is another similar project, and its creator was interviewed on Ars Technica (“AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt“):
[Nepenthes is] not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an “infinite maze” of static files with no exit links, where they “get stuck” and “thrash around” for months, he tells users. Once trapped, the crawlers can be fed gibberish data, aka Markov babble, which is designed to poison AI models. That’s likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.
This is a new and interesting arms race.
Related Posts:
Ten Countries That Turned Their TLDs Into Gold (And Which Are Cheapest For You)
What Can You Use a GPU-Enabled VPS For (And Why Rent One on LowEndBox)?
StackOverflow has Collapsed: Spicy AMA Coming Up
Did You Miss LowEndBox's SuperBowl Ad?
AI Pushes the Doomsday Clock a Second Closer to Midnight
Have You Missed Any of these LowEndBoxTV Videos?

Raindog308 is a longtime LowEndTalk community administrator, technical writer, and self-described techno polymath. With deep roots in the *nix world, he has a passion for systems both modern and vintage, ranging from Unix, Perl, Python, and Golang to shell scripting and mainframe-era operating systems like MVS. He’s equally comfortable with relational database systems, having spent years working with Oracle, PostgreSQL, and MySQL.
As an avid user of LowEndBox providers, Raindog runs an empire of LEBs, from tiny boxes for VPNs, to mid-sized instances for application hosting, and heavyweight servers for data storage and complex databases. He brings both technical rigor and real-world experience to every piece he writes.
Beyond the command line, Raindog is a lover of German Shepherds, high-quality knives, target shooting, theology, tabletop RPGs, and hiking in deep, quiet forests.
His goal with every article is to help users, from beginners to seasoned sysadmins, get more value, performance, and enjoyment out of their infrastructure.
You can find him daily in the forums at LowEndTalk under the handle @raindog308.
Leave a Reply