The Wikimedia Foundation published a report last week about how AI scrapers are pillaging their site. Not only has scraping skyrocketed, but because of the way scrapers work, it’s the most expensive network usage they serve.
While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.
While undergoing a migration of our systems, we noticed that only a fraction of the expensive traffic hitting our core datacenters was behaving how web browsers would usually do, interpreting javascript code. When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total. This high usage is also causing constant disruption for our Site Reliability team, who has to block overwhelming traffic from such crawlers before it causes issues for our readers.
Wikipedia is uniquely positioned to suffer from this problem because they have a ton of content, and it is repeatedly edited by human editors leaving a deep trail showing how coherent narratives are perfected by human minds.
You are probably not serving anything like Wikipedia, but if you have a large amount of content, you might find that AI bots are putting a heavy load on your systems. What can you?
Enter Nepthenes and iocaine
iocaine is a tool designed to poison AI crawlers. It “generates an infinite maze of garbage”.
In other words, it snares AI crawlers into a never-ending maze they can’t escape from. Setup properly, it can divert bot load from your main site, forcing the scrapers to waste time – potentially infinite time – perusing nonsense and training on it, instead of pillaging your main site.
This is deliberately malicious software, intended to cause harm. Do not deploy if you aren’t fully comfortable with what you are doing. LLM scrapers are relentless and brutal, they will place additional burden on your server, even if you only serve static content. With
iocaine
, there’s going to be increased computing power used. It’s highly recommended to implement rate limits at the reverse proxy level, such as with the caddy-ratelimit plugin, if using Caddy.Entrapment is done by the reverse proxy. Anything that ends up being served by
iocaine
will be trapped there: there are no outgoing links. Be careful what you route towards it.
The tradeoff, of course, is CPU in your server to generate this maze.
While this may seem dubiously ethical – after all, the software is named after the famous fictional poison from The Princess Bride – it’s actually defending sites against unethical behavior. If someone building an LLM contacted you and asked if they could scrape your site, that’s one thing. Or if you publish it under a very permissive license, then you’ve essentially said “anyone can read this”.
But what if you published it All Rights Reserved? What if you put a notice that scraping is not allowed? This will not stop most AI bots, and indeed copyright concerns over LLMs are in the headlines nearly every day. What if the scraper ignores your robots.txt? Seems to be the norm.
iocaine says: “Lets make AI poisoning the norm. If we all do it, they won’t have anything to crawl.”
Nepenthes is another similar project, and its creator was interviewed on Ars Technica (“AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt“):
[Nepenthes is] not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an “infinite maze” of static files with no exit links, where they “get stuck” and “thrash around” for months, he tells users. Once trapped, the crawlers can be fed gibberish data, aka Markov babble, which is designed to poison AI models. That’s likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.
This is a new and interesting arms race.
Leave a Reply