GPT-4、ChatGPT 等强大模型能即时智能回答，AI 爬虫却在肆意抓取威胁原创内容价值

小蓝

2024-8-31

行业资讯

新用户专享：「香港/美国云服务器」新购6折低至9元/月！点击查看活动介绍>>>

好嘛,AI爬虫就像隐形杀手，悄悄钻进我们的网站偷走东西。日以继夜、疲劳算什么，就为了给大牛的AI模型做培训资料。你会乐意这样？我说不会吧!今天咱们来聊聊，作为站长,咋样保护自己的原创内容，别让它们变成AI的盘中餐。

AI爬虫的威胁

首先，咱们得知道，AI爬虫真叫人头疼。它会让我们辛辛苦苦做出来的东西变得一文不值，甚至影响到我们的收入。想想看，要是人家想了解什么都直接找AI就能搞定，那谁还愿意来你这儿？这不就是在糟蹋我们的心血和原创精神吗？

何况，这帮AI爬虫的行踪还不太明朗。有的公司会大方承认自己的爬虫，但还有些公司却闷声不响，像贼似的悄悄收集我们的信息。这种行为就像藏在黑暗里的鬼祟之手，让人大吃一惊！

保护措施之一：robots.txt

咋整？那咱们就来对付这些不打招呼就上门的'朋友们'！最常用的方法之一就是让robots.txt出马。这个小文件能告诉爬虫啥东西能抓，啥不能碰。只要设定好规矩，就能把那些讨厌的爬虫挡在门外了。

但光靠robots.txt是远远不够的，有些爬虫就是不怕你的规矩，照样偷你家东西。所以咱们得用点儿狠招儿，比如说Cloudflare的自动WAF规则，这样才能让咱们的防护更给力！

GPT-4、ChatGPT 等强大模型能即时智能回答，AI 爬虫却在肆意抓取威胁原创内容价值插图

CloudFlare的自动化WAF规则

用上Cloudflare的自动WAF规则，网站安全性猛增！有了这些规则，黑客们的爬虫都无处可藏！就是这么简单，像给网站加了个围墙，不让坏心眼儿的爬虫随便进来。

还有，用Cloudflare那个自动化WAF规则的超赞之处在于，它能够科技升级，时刻准备着应对那些翻天覆地的爬虫行为！所以不管是什么时候，我们都不用再紧张兮兮的监视自己的网站，就不怕突然冒出啥新爬虫了。

AI爬虫的现状与未来

User-agent: Baiduspider
Allow: / 
User-agent: Mediapartners-Google
Allow: /
User-agent: Google-Display-Ads-Bot
Allow: /
User-agent: Googlebot
Allow: /
User-agent: Googlebot-Mobile
Allow: /
User-agent: Googlebot-Image
Allow: /
User-agent: Adsbot-Google
Allow: /
User-agent: Sogou
Allow: /
User-agent: DotBot
Disallow: /
User-agent: DataForSeoBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: Feedly
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: ias-ir
Disallow: /
User-agent: adsbot
Disallow: /
User-agent: barkrowler
Disallow: /
User-agent: Mail.RU_Bot
Disallow: /
User-agent: SEOkicks
Disallow: /
User-agent: ias-va
Disallow: /
User-agent: proximic
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: grapeshot
Disallow: /
User-agent: BLEXBot
Disallow: /
#禁止 AI 爬虫
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: GoogleOther
Disallow: /
User-agent: Applebot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: DataForSeoBot
Disallow: /
User-agent: peer39 crawler
Disallow: /
User-agent: FriendlyCrawler
Disallow: /
User-agent: magpie-crawler
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: omgili
Disallow: /
User-agent: Meltwater
Disallow: /
User-agent: AwarioSmartBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: img2dataset
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: PipiBot
Disallow: /
User-agent: Seekr
Disallow: /
User-agent: scoop.it
Disallow: /
User-agent: AwarioRssBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: *
Allow: /robots.txt
Allow: /ads.txt
Allow: /*.ico$
Aloow: /*.webp$
Allow: /*.png$
Allow: /*.jpg$
Allow: /*.jpeg$
Allow: /*.gif$
Allow: /*.bmp$
Allow: /wp-admin/admin-ajax.php
Allow: /timthumb/
Allow: /wp-content/uploads/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /cdn-cgi/
Disallow: /*?replytocom=*
Disallow: /?s=*
Disallow: /redirect*
Sitemap: https://www.imydl.com/wp-sitemap.xml

尽管已经有些应对方法了，但是AI爬虫的问题可不会就这么过去了。因为科技在进步，它们也会变得越发精明，防不胜防。所以，我们得时刻警觉着，升级保护自己的招数。

GPT-4、ChatGPT 等强大模型能即时智能回答，AI 爬虫却在肆意抓取威胁原创内容价值插图1