Meadow

0004 - scraper honeypots

Give me more give me more.

Today marks the end of our beautiful time at the beach. Of all days, yesterday was my favorite. Things just seemed to flow better, and I also found myself in a better mood than usual. I wonder if setting an intention to have a “good day” in yesterday's word vomit had something to do with it? I shared some pictures in yesterday's entry if you're interested.

Something I'm realizing is that whenever I sit down to write one of these unscripted posts, I have lots of ideas I want to talk about. Of course, in writing it's much harder to mix and match disparate ideas this way, so what usually happens is that I either end up talking about a single thing or, more commonly, I write about many things at the same time, which makes everything hard to follow. Maybe in these cases I can tackle the topics as a list? Short sections for each one?

Anyway. One thing I wanted to talk about ever since my first vomit is my new project (source code here). So here it goes!

You know how LLMs need a lot of training data? This is obtained by “scraping” the web, which basically means that an automated program visits a link and saves the content as training data, then visits another link from that page, and so on.

In theory, there's an implicit web standard called robots.txt that a webmaster can use to tell scrapers which parts of their websites are allowed to be scraped and which aren't. Sadly, this is only “in theory.”

There are many, many dishonest scrapers out there that don't care what's allowed and what isn't and just scrape indiscriminately.

I think this behavior is unacceptable and doesn't help produce the kind of internet that we all want to see-the kind of internet where we freely express and collaborate. That's why you see many people now requiring human confirmation before you can even access their site.

However, human confirmation is not always perfect (it's definitely better than no confirmation at all, if you care about such things). It's also really, really hard (impossible?) to programmatically tell a bot scraper from an actual human, as the bots send all the necessary signals to the server to have the latter think it's, in fact, a human (to avoid being blocked in the first place)!

So what to do? Advances in human confirmation are playing an important role, but there's also another, sillier approach: honeypot traps.

I think it was Maurycy1 who first came up with the idea (or at least was the first I'm aware of). The concept is quite simple: scrapers are indiscriminate and want as much content as possible, so why not give it to them? The catch here is that the content we'll be feeding them is pure and absolute garbage. “Garbage for the garbage king,” says Maurycy.

The content is actually not entirely gibberish, but not entirely proper English either. For example, Maurycy uses a Markov chain approach to generate human-looking text.

The beauty of this idea is that there is a practically infinite number of pages that can be generated in this way, so a web scraper will get stuck scraping that site until either:

  1. the pipeline realizes there's something wrong, or maybe there's a max number of pages that can be scraped from a domain.

  2. a human supervisor intervenes and marks the domain as poisoned or adds it to some sort of exception list.

Both cases are positive for us, I think. In the first one, we might have the luck of having our data end up being used for actual training of the model, which will produce much subtler issues down the line that are considerably harder to debug. It is known that, for example, even a small number of examples in training data can cause the model to misbehave down the line. In the second case, we'll have wasted some “evil actor”’s time by forcing them to purge their current data set of our website.

I've been itching to make such a honeypot trap ever since I read about it in Maurycy’s post. However, I wasn't convinced of the Markov chain approach, as that's not entirely energy efficient. If it costs me more to generate a page than it costs a scraper to scrape it, then I'm losing the competition.

It wasn't until I read Herman's post that I realized (as he points out) that we don't really need to be so careful with how we generate the data. We don't need to fool a human reviewer but a machine, so using a more powerful generative method like Markov chains is not required. The approach Herman suggests is to randomly pick paragraphs from books in the public domain (from Project Gutenberg).

This is great! It (a) generates valid English text, and (b) it's computationally really cheap. However, it has the shortcoming that a given paragraph is internally consistent (not garbage), and it's much more likely to be repeated throughout different generations (of course, depending on the number of input books one uses, it will be more or less likely).

For my honeypot, I used a similar approach, but rather than randomizing on the paragraph level, I'm just picking random sentences. The text looks like proper English, but one sentence usually has no connection with the next. I say “usually,” and this is crucial. Sometimes it makes a little sense, or it directly contradicts itself, which I think is much more prone to break training by directly impacting logic capabilities.

I made my trap public on “Nov 17, 2025,” and since then I've gotten a whopping 1,652,590 scrapes! (There's a counter at the bottom of the page; I followed pretty much the same counting strategy that Herman used, where every visit is counted as a scrape.)

Pasted image 20251222143749.jpg

That's 53 k requests per day. Quite a lot, and much more than my actual blog gets on the best of days 😅 We can actually get a better idea of scraper behavior if we look at the “amount of output data over time” for the VM that hosts my trap.

Pasted image 20251222143759.jpg

Interestingly, you can see that there were a bunch of requests just after I made it public. It was around this date that I announced it on Mastodon, and I suppose scrapers picked it up from there. However, what I find most interesting is that it immediately tapered off, and it's only been during this month that traffic is picking up again-and quite steadily at that! I'm curious to see what “scrape speed” we will be seeing a week or so from now. I might update this post with any interesting data.

I found this project really fun to work on. After implementing the main “blog” section, I added one for “haikus” and one for social media (playing with on-the-fly image generation was satisfying). But as you can see from the counter screenshot I shared above, scrapers REALLY love the blog section. It's probably exactly the kind of content they're looking for.

Now, it's very likely that my trap will eventually be flagged and excluded from further scraping. I think this approach of laying honeypots is doomed to fail unless more people start creating their own.

Imagine what it would look like if 1% of folks with a domain decided to host their own honeypot. It would be really hard for a human reviewer to keep up with them! Not only that, but since the generated texts are pretty much valid English, they're also quite hard to automatically flag as invalid unless they use a more powerful model, which makes their scraping a lot more expensive.

(I shared the source code to my project above, but here it is again if you're interested. If you need any help setting it up, please let me know!)

Now, as a closing thought, I want to say that (contrary to what it might seem from this post) I'm not against AI. I actually think LLMs and the greater AI ecosystem have a lot of potential to be used for good. I sincerely believe that it's already helping in a lot of areas, and we're going to see lots of improvements going forward.

I'm not against my content being scraped either. What I'm against is assholes who don't respect ownership or my preference as the creator of the content they so desperately want.


Thoughts

  • I really should try and get back into meditating. For a long time, I used to have a fairly consistent practice, but now I'm struggling to find time and motivation. Perhaps I should try and do something extreme like waking up at 4:45am every morning to meditate before everyone wakes up? Extreme schedules seem to work for me for some reason.

Some pics from today. Both of them are from a restaurant where we had brunch before starting our drive back home. It had a nice message at the entrance that I thought was appropriate to share, as well as a cool mushroom design on the table where we sat.

Pasted image 20251222143820_cAdEbF91.jpg
Pasted image 20251222143825_AeEfB2B8.jpg

Footnotes

  1. I'm assuming “Maurycy” is their name based on their URL. I did a quick search but wasn't able to find any “about” page or anything that would give me this information.