Robots.txt Rules to Block LLM AI like ChatGPT from Crawling Your Web Content

How to block Large Language Model AI from crawling your website with robots.txt User-agent block list

I recently checked my Google Search Console and found out that some of the visitors are AI agents.

I believe this isn’t human visiting or reading my article. It’s most likely just an AI agent, scraping my article, and providing the summary to their user.

I love the development of AI, especially Generative AI. But, the business practice isn’t fair yet. How can humans be motivated to write something original if it doesn’t get any benefit at all?

If you have a website, you can try to block the AI agents by adding some rules to your robots.txt file.

Here’s the section of my robots.txt file that tries to block AI agents from crawling my website.

User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: Diffbot
User-agent: FacebookBot
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Omgilibot
User-agent: PerplexityBot
User-agent: Timpibot
User-agent: Google-Extended
Disallow: /

Some of the list I got from Google itself. Funnily, they also try to block AI agents from crawling their content. Just take a look at Google’s robots.txt.

I understand that this method isn’t 100% foolproof, but it’s a good first step.

I’m gonna try to update the list once in a while, so we can always have an up-to-date list.

Thanks for reading!

References