#RobotsTxt

Jonathan BaileyI asked ChatGPT about the recent copyright news. It rehashed my latest column and misconstrued the facts. But why was it on my site at all?<a href="https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/" rel="nofollow noopener" translate="no" target="_blank">https://www.plagiarismtoday.com/2025/07/23/chatgpt-ignores-robots-txt-rehashes-my-column/</a><a href="https://mastodon.world/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#AI</a> <a href="https://mastodon.world/tags/ChatGPT" class="mention hashtag" rel="nofollow noopener" target="_blank">#ChatGPT</a> <a href="https://mastodon.world/tags/OpenAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#OpenAI</a> <a href="https://mastodon.world/tags/RobotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTxt</a>

Inautilo<a href="https://mastodon.social/tags/Development" class="mention hashtag" rel="nofollow noopener" target="_blank">#Development</a> <a href="https://mastodon.social/tags/Trends" class="mention hashtag" rel="nofollow noopener" target="_blank">#Trends</a> Who’s crawling your site in 2025 · The most active and blocked bots and crawlers <a href="https://ilo.im/1652mx" rel="nofollow noopener" translate="no" target="_blank">https://ilo.im/1652mx</a>_____ <a href="https://mastodon.social/tags/Bots" class="mention hashtag" rel="nofollow noopener" target="_blank">#Bots</a> <a href="https://mastodon.social/tags/Crawlers" class="mention hashtag" rel="nofollow noopener" target="_blank">#Crawlers</a> <a href="https://mastodon.social/tags/Website" class="mention hashtag" rel="nofollow noopener" target="_blank">#Website</a> <a href="https://mastodon.social/tags/Business" class="mention hashtag" rel="nofollow noopener" target="_blank">#Business</a> <a href="https://mastodon.social/tags/SEO" class="mention hashtag" rel="nofollow noopener" target="_blank">#SEO</a> <a href="https://mastodon.social/tags/UserAgents" class="mention hashtag" rel="nofollow noopener" target="_blank">#UserAgents</a> <a href="https://mastodon.social/tags/RobotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTxt</a> <a href="https://mastodon.social/tags/WebDev" class="mention hashtag" rel="nofollow noopener" target="_blank">#WebDev</a> <a href="https://mastodon.social/tags/Frontend" class="mention hashtag" rel="nofollow noopener" target="_blank">#Frontend</a> <a href="https://mastodon.social/tags/Backend" class="mention hashtag" rel="nofollow noopener" target="_blank">#Backend</a>

Inautilo<a href="https://mastodon.social/tags/Business" class="mention hashtag" rel="nofollow noopener" target="_blank">#Business</a> <a href="https://mastodon.social/tags/Findings" class="mention hashtag" rel="nofollow noopener" target="_blank">#Findings</a> Most blocked SEO bots · Insights from ~140 million websites <a href="https://ilo.im/16439x" rel="nofollow noopener" translate="no" target="_blank">https://ilo.im/16439x</a>_____ <a href="https://mastodon.social/tags/SEO" class="mention hashtag" rel="nofollow noopener" target="_blank">#SEO</a> <a href="https://mastodon.social/tags/Bots" class="mention hashtag" rel="nofollow noopener" target="_blank">#Bots</a> <a href="https://mastodon.social/tags/Crawlers" class="mention hashtag" rel="nofollow noopener" target="_blank">#Crawlers</a> <a href="https://mastodon.social/tags/Content" class="mention hashtag" rel="nofollow noopener" target="_blank">#Content</a> <a href="https://mastodon.social/tags/Website" class="mention hashtag" rel="nofollow noopener" target="_blank">#Website</a> <a href="https://mastodon.social/tags/Blog" class="mention hashtag" rel="nofollow noopener" target="_blank">#Blog</a> <a href="https://mastodon.social/tags/RobotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTxt</a> <a href="https://mastodon.social/tags/Development" class="mention hashtag" rel="nofollow noopener" target="_blank">#Development</a> <a href="https://mastodon.social/tags/WebDev" class="mention hashtag" rel="nofollow noopener" target="_blank">#WebDev</a> <a href="https://mastodon.social/tags/Backend" class="mention hashtag" rel="nofollow noopener" target="_blank">#Backend</a>

Inautilo<a href="https://mastodon.social/tags/Business" class="mention hashtag" rel="nofollow noopener" target="_blank">#Business</a> <a href="https://mastodon.social/tags/Explorations" class="mention hashtag" rel="nofollow noopener" target="_blank">#Explorations</a> What would happen if I blocked big search? · Pros and cons of blocking major search engines <a href="https://ilo.im/163yb3" rel="nofollow noopener" translate="no" target="_blank">https://ilo.im/163yb3</a>_____ <a href="https://mastodon.social/tags/SearchEngine" class="mention hashtag" rel="nofollow noopener" target="_blank">#SearchEngine</a> <a href="https://mastodon.social/tags/SEO" class="mention hashtag" rel="nofollow noopener" target="_blank">#SEO</a> <a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#AI</a> <a href="https://mastodon.social/tags/Website" class="mention hashtag" rel="nofollow noopener" target="_blank">#Website</a> <a href="https://mastodon.social/tags/Blog" class="mention hashtag" rel="nofollow noopener" target="_blank">#Blog</a> <a href="https://mastodon.social/tags/RobotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTxt</a> <a href="https://mastodon.social/tags/Development" class="mention hashtag" rel="nofollow noopener" target="_blank">#Development</a> <a href="https://mastodon.social/tags/WebDev" class="mention hashtag" rel="nofollow noopener" target="_blank">#WebDev</a> <a href="https://mastodon.social/tags/Frontend" class="mention hashtag" rel="nofollow noopener" target="_blank">#Frontend</a> <a href="https://mastodon.social/tags/Backend" class="mention hashtag" rel="nofollow noopener" target="_blank">#Backend</a>

Tino Eberl<a href="https://mastodon.online/tags/Google" class="mention hashtag" rel="nofollow noopener" target="_blank">#Google</a> nutzt Inhalte für das <a href="https://mastodon.online/tags/KI" class="mention hashtag" rel="nofollow noopener" target="_blank">#KI</a>-Training auch dann, wenn Urheber dem widersprechen. Das wurde nun offiziell bestätigt.Laut Google <a href="https://mastodon.online/tags/Deepmind" class="mention hashtag" rel="nofollow noopener" target="_blank">#Deepmind</a> betrifft der Widerspruch nur bestimmte <a href="https://mastodon.online/tags/Konzernbereiche" class="mention hashtag" rel="nofollow noopener" target="_blank">#Konzernbereiche</a>. Wer seine Daten schützen will, muss die Seite komplett aus der <a href="https://mastodon.online/tags/Google" class="mention hashtag" rel="nofollow noopener" target="_blank">#Google</a>-Suche entfernen. <a href="https://mastodon.online/tags/Verlage" class="mention hashtag" rel="nofollow noopener" target="_blank">#Verlage</a> und <a href="https://mastodon.online/tags/Webseitenbetreiber" class="mention hashtag" rel="nofollow noopener" target="_blank">#Webseitenbetreiber</a> sehen sich dadurch wirtschaftlich benachteiligt.<a href="https://www.golem.de/news/kuenstliche-intelligenz-google-trainiert-ki-auch-wenn-urheber-es-nicht-erlauben-2505-195928.html" rel="nofollow noopener" translate="no" target="_blank">https://www.golem.de/news/kuenstliche-intelligenz-google-trainiert-ki-auch-wenn-urheber-es-nicht-erlauben-2505-195928.html</a><a href="https://mastodon.online/tags/Urheberrecht" class="mention hashtag" rel="nofollow noopener" target="_blank">#Urheberrecht</a> <a href="https://mastodon.online/tags/KITraining" class="mention hashtag" rel="nofollow noopener" target="_blank">#KITraining</a> <a href="https://mastodon.online/tags/Gemini" class="mention hashtag" rel="nofollow noopener" target="_blank">#Gemini</a> <a href="https://mastodon.online/tags/Suchmaschinen" class="mention hashtag" rel="nofollow noopener" target="_blank">#Suchmaschinen</a> <a href="https://mastodon.online/tags/RobotsTXT" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTXT</a> <a href="https://mastodon.online/tags/KITraining" class="mention hashtag" rel="nofollow noopener" target="_blank">#KITraining</a> <a href="https://mastodon.online/tags/Kartellverfahren" class="mention hashtag" rel="nofollow noopener" target="_blank">#Kartellverfahren</a>

Inautilo<a href="https://mastodon.social/tags/Business" class="mention hashtag" rel="nofollow noopener" target="_blank">#Business</a> <a href="https://mastodon.social/tags/Introductions" class="mention hashtag" rel="nofollow noopener" target="_blank">#Introductions</a> Meet LLMs.txt · A proposed standard for AI website content crawling <a href="https://ilo.im/16318s" rel="nofollow noopener" translate="no" target="_blank">https://ilo.im/16318s</a>_____ <a href="https://mastodon.social/tags/SEO" class="mention hashtag" rel="nofollow noopener" target="_blank">#SEO</a> <a href="https://mastodon.social/tags/GEO" class="mention hashtag" rel="nofollow noopener" target="_blank">#GEO</a> <a href="https://mastodon.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#AI</a> <a href="https://mastodon.social/tags/Bots" class="mention hashtag" rel="nofollow noopener" target="_blank">#Bots</a> <a href="https://mastodon.social/tags/Crawlers" class="mention hashtag" rel="nofollow noopener" target="_blank">#Crawlers</a> <a href="https://mastodon.social/tags/LlmsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#LlmsTxt</a> <a href="https://mastodon.social/tags/RobotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTxt</a> <a href="https://mastodon.social/tags/Development" class="mention hashtag" rel="nofollow noopener" target="_blank">#Development</a> <a href="https://mastodon.social/tags/WebDev" class="mention hashtag" rel="nofollow noopener" target="_blank">#WebDev</a> <a href="https://mastodon.social/tags/Backend" class="mention hashtag" rel="nofollow noopener" target="_blank">#Backend</a>

PrivacyDigest<a href="https://mas.to/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#AI</a> haters build <a href="https://mas.to/tags/tarpits" class="mention hashtag" rel="nofollow noopener" target="_blank">#tarpits</a> to trap and trick <a href="https://mas.to/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#AI</a> <a href="https://mas.to/tags/scrapers" class="mention hashtag" rel="nofollow noopener" target="_blank">#scrapers</a> that ignore <a href="https://mas.to/tags/robots" class="mention hashtag" rel="nofollow noopener" target="_blank">#robots</a>.txt <a href="https://mas.to/tags/tarpit" class="mention hashtag" rel="nofollow noopener" target="_blank">#tarpit</a> <a href="https://mas.to/tags/security" class="mention hashtag" rel="nofollow noopener" target="_blank">#security</a> <a href="https://mas.to/tags/privacy" class="mention hashtag" rel="nofollow noopener" target="_blank">#privacy</a> <a href="https://mas.to/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a> <a href="https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/" rel="nofollow noopener" translate="no" target="_blank">https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/</a>

Konstantin WeddigeThis is not an acceptable situation and therefore I propose to extend the robots.txt standard and the corresponding HTML meta tags.For robots.txt, I see two ways to approach this:The first option would be to introduce a meta-user-agent that can be used to define rules for all AI bots, e.g. "User-agent: §MODELTRAINIGN§".The second option would be a directive like "Crawl-delay" that indicates how to use the data. For example, "Model-training: disallow".2/3<a href="https://gruene.social/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a> <a href="https://gruene.social/tags/rfc" class="mention hashtag" rel="nofollow noopener" target="_blank">#rfc</a>

Konstantin WeddigeI was just confronted with the question of how to prevent a website being used to train AI models. Using robots.txt, I only see two options, both of which are bad in different ways: I can either disallow all known AI bots while still being guaranteed to miss some bots.Or I can disallow all bots and explicitly allow known search engines. This way I will overblock and favour the big search engines over yet unknown challengers.1/3<a href="https://gruene.social/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a> <a href="https://gruene.social/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#AI</a> <a href="https://gruene.social/tags/LLM" class="mention hashtag" rel="nofollow noopener" target="_blank">#LLM</a>

apfeltalk :verified:Apple Intelligence Training: Große Websites entscheiden sich gegen die Teilnahme Einem neuen Bericht zufolge haben sich viele der größten Websites entschieden, sich nicht am Training von Apple Intelligence zu beteiligen. <a href="https://www.apfeltalk.de/magazin/services/apple-intelligence-training-grosse-websites-entscheiden-sich-gegen-die-teilnahme/" rel="nofollow noopener" target="_blank">https://www.apfeltalk.de/magazin/services/apple-intelligence-training-grosse-websites-entscheiden-sich-gegen-die-teilnahme/</a> <a href="https://creators.social/tags/News" class="mention hashtag" rel="nofollow noopener" target="_blank">#News</a> <a href="https://creators.social/tags/Services" class="mention hashtag" rel="nofollow noopener" target="_blank">#Services</a> <a href="https://creators.social/tags/AppleIntelligence" class="mention hashtag" rel="nofollow noopener" target="_blank">#AppleIntelligence</a> <a href="https://creators.social/tags/Applebot" class="mention hashtag" rel="nofollow noopener" target="_blank">#Applebot</a> <a href="https://creators.social/tags/Datenschutz" class="mention hashtag" rel="nofollow noopener" target="_blank">#Datenschutz</a> <a href="https://creators.social/tags/Facebook" class="mention hashtag" rel="nofollow noopener" target="_blank">#Facebook</a> <a href="https://creators.social/tags/Instagram" class="mention hashtag" rel="nofollow noopener" target="_blank">#Instagram</a> <a href="https://creators.social/tags/KITraining" class="mention hashtag" rel="nofollow noopener" target="_blank">#KITraining</a> <a href="https://creators.social/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a> <a href="https://creators.social/tags/TheAtlantic" class="mention hashtag" rel="nofollow noopener" target="_blank">#TheAtlantic</a> <a href="https://creators.social/tags/TheNewYorkTimes" class="mention hashtag" rel="nofollow noopener" target="_blank">#TheNewYorkTimes</a> <a href="https://creators.social/tags/Urheberrecht" class="mention hashtag" rel="nofollow noopener" target="_blank">#Urheberrecht</a>

tomateBlogmojo.ai: Plagiat im Jahr 2023 (sic!) – Wenn künstliche Intelligenz Texte klaut!Lesedauer 3 Minuten[Update 11.08.2024], [Update 13.08.2024]In meinen Logfiles steht ein neuer Bot: WordPress/6.6.1; <a href="https://blogmojo.ai" rel="nofollow noopener" translate="no" target="_blank">https://blogmojo.ai</a>. Oh, wieder einer dieser AI Scraper? Schauen wir mal, was <a href="https://blogmojo.ai" rel="nofollow noopener" target="_blank">Blogmojo.ai</a> eigentlich so macht: „Generiere epische Blogartikel in weniger als 2 Minuten“ begrüßt mich die Website. In drei einfachen Schritten generiert mir diese AI ein Posting zu einem beliebigen Thema. In Schritt 1 wird das Thema und die Zielgruppe spezifiziert. Schritt 2 erlaubt mir, eine URL zu einem Post anzugeben, damit die AI den Stil dieses Posts nachahmen kann (da kommt dann der Bot ins Spiel). In Schritt 3 kann ich, wenn ich denn Geld drauf werfen, bessere Modelle als GPT-3.5 wählen, Google-Suchergebnisse mit einbeziehen, den Text auf Keywörter optimieren und auch mehr als 500 Zeichen erstellen lassen.Dieser Bot kopiert also Schreibstile, das was Autor*innen neben dem Inhalt ausmacht. Woran sie Jahre arbeiten, teilweise das Markenzeichen geworden sind und somit auch Teil der Einkünfte von Autor*innen. Wenn ich mir aussuchen kann, wer einen Text schreibt, nehme ich die Autorin, derren Stil mir am besten gefällt. „Wähle aus, ob die Suchergebnisse für dein Haupt-Keyword bei der Generierung des Blogartikels einbezogen werden sollen“ – Standard ist nein, gegen Geld kann ich ja anklicken. Ich gehe davon aus, dass der Inhalt der Google-Ergebnisse gescrapet wird und die Arbeit anderer Menschen in den generierten Post einfließen. Insgesamt will ich eigentlich nur wissen ob ich 1. den Bot blocken kann und 2. ob ich Kohle bekomme, wenn der Bot Texte von mir verarbeitet. Um Antworten auf meine Fragen zu bekommen habe ich eine Mail an die im Impressum angegebene Adresse gesendet:Hallo, Ich habe ein paar Fragen zu dem Bot „WordPress/6.6.1; <a href="https://blogmojo.ai&#8220" rel="nofollow noopener" translate="no" target="_blank">https://blogmojo.ai&#8220</a>; und dem Dienst Blogmojo.ai.1. Respektiert der Bot eine vorhanden robots.txt? 2. Welche IP-Adresse(n) nutzt der Bot?  3. Wie stellen Sie sicher, dass Autor*innen, deren Posts als Beispiel-Post für den Schreibtstil herhalten müssen, kompensiert werden? Der Stil, der kopiert wird, ist Teil der Kreativität von Autor*innen und ihrer eigenen Denkleistung, die der Bot abgrast, um sie dann schamlos zu kopieren. Vielen Dank für das Beantworten der Fragen. Bitte beachten Sie, dass ihre Antworten in einem Artikel auf meinem Blog veröffentlicht werden können.Mit besten Grüßen, Jascha Urbach Ich bin sehr gespannt, ob es eine Antwort gibt. Die Überschrift dieses Posts hat übrigens die AI für mich generiert. Als Thema hatte ich „Blogmojo.ai nutzt ungefragt und unentgeltlich Texte anderer Autoren“ angegeben. Und was soll ich sagen? Der Artikel, der dabei heraus kam, ist pures Gold:Blogmojo.ai: Plagiat im Jahr 2023 – Wenn künstliche Intelligenz Texte klaut!ProblemStell dir vor, du hast tagelang an einem Blogartikel oder einer wissenschaftlichen Arbeit gearbeitet, nur um herauszufinden, dass deine Texte von einer künstlichen Intelligenz plagiiert wurden. Genau das könnte im Jahr 2023 Realität werden, wenn wir nicht aufpassen.AnsatzBlogmojo.ai ist ein neues Tool, das künstliche Intelligenz nutzt, um automatisch Blogartikel zu generieren. Das klingt zwar zunächst praktisch, birgt jedoch auch das Risiko, dass diese AI-Texte von anderen ohne Erlaubnis verwendet werden.LösungEs ist wichtig, dass wir uns als Blogger und Autoren bewusst sind, dass Plagiat durch künstliche Intelligenz im Jahr 2023 eine reale Bedrohung darstellen könnte. Indem wir unsere Texte regelmäßig überprüfen und Tools wie CopyScape verwenden, können wir sicherstellen, dass unsere Inhalte geschützt sind.FazitEs liegt an uns allen, verantwortungsbewusst mit künstlicher Intelligenz umzugehen und sicherzustellen, dass unsere eigenen Texte geschützt sind. Blogmojo.ai mag zwar eine nützliche Ressource sein, aber wir müssen auch die Risiken im Blick behalten. Update vom 11.08.2024:Ich habe noch keine Antwort von Blogmojo erhalten und daher diese freundliche Erinnerung gesendet:Hallo,dies ist eine freundliche Erinnerung zur Beantwortung meiner Fragen.Mit besten Grüßen, Jascha UrbachIch bin gespannt! Update vom 13.08.2024:Ich habe dann doch eine Antwort bekommen, die nicht wirklich befriedigend ist:Hi Jascha,kein Problem, mache ich gerne. 😊1. Nein, der Bot respektiert die robots.txt nicht. 2. 185.30.32.227 3. Bei der Schreibstil-Analyse wird der Schreibstil nur stichpunktartig analysiert. Es werden keine Inhalte direkt übernommen. Plagiate sind also ausgeschlossen.Das hier ist der verwendete Prompt aus dem Quellcode von Blogmojo.ai, falls für dich relevant:(Inline-Bild aus der Antwortmail)Die übermittelten Blogartikel werden laut eigenen Daten von OpenAI nicht als Trainingsdaten verwendet (siehe: <a href="https://web.archive.org/web/20241005072507/https://platform.openai.com/docs/concepts" rel="nofollow noopener" target="_blank">https://platform.openai.com/docs/concepts</a>).Eine direkte Einbeziehung von Suchergebnissen oder Webseiten als inhaltliche Vorlage für generierte Blogartikel ist übrigens ebenfalls nicht möglich, da es keine Pro-Version gibt (und aktuell auch keine in Planung ist, ich habe das Projekt auf Eis gelegt). Liebe Grüße Finn Einerseits bin ich ja ganz froh, dass dieses AI-Projekt erstmal auf Eis liegt, andererseits finde ich Bots, die eine robot.txt nicht beachten wirklich blöd und jetzt muss ich halt eine IP-Adresse sperren und Antwort 3 finde ich nicht wirklich befriedigend. Ich weiß wirklich nicht, was ich von solchen Projekten halten soll. Wie seht ihr das?<a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://jascha.wtf/tag/ai/" target="_blank">#AI</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://jascha.wtf/tag/bot/" target="_blank">#Bot</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://jascha.wtf/tag/generative-ki/" target="_blank">#GenerativeKI</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://jascha.wtf/tag/ki/" target="_blank">#KI</a> <a rel="nofollow noopener" class="hashtag u-tag u-category" href="https://jascha.wtf/tag/robots-txt/" target="_blank">#robotsTxt</a>

Ecologia Digital<a href="https://mato.social/tags/Robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#Robotstxt</a> <a href="https://mato.social/tags/CrawlerBacklash" class="mention hashtag" rel="nofollow noopener" target="_blank">#CrawlerBacklash</a> Trickle-down effects: "people start blocking all crawlers, and some crawlers are very important, for search indexing, internet archiving, some are used for academic research, and so the bad behaviours of all these <a href="https://mato.social/tags/AIcompanies" class="mention hashtag" rel="nofollow noopener" target="_blank">#AIcompanies</a>, and the backlash to it, is kind of fundamentally changing how the Internet works, how it is remembered and indexed..." <a href="https://pca.st/yto6v3il?t=11m34s" rel="nofollow noopener" translate="no" target="_blank">https://pca.st/yto6v3il?t=11m34s</a>

Miguel Afonso Caetano<a href="https://tldr.nettime.org/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#AI</a> <a href="https://tldr.nettime.org/tags/GenerativeAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#GenerativeAI</a> <a href="https://tldr.nettime.org/tags/AITraining" class="mention hashtag" rel="nofollow noopener" target="_blank">#AITraining</a> <a href="https://tldr.nettime.org/tags/Anthropic" class="mention hashtag" rel="nofollow noopener" target="_blank">#Anthropic</a> <a href="https://tldr.nettime.org/tags/WebCrawlers" class="mention hashtag" rel="nofollow noopener" target="_blank">#WebCrawlers</a> <a href="https://tldr.nettime.org/tags/WebScraping" class="mention hashtag" rel="nofollow noopener" target="_blank">#WebScraping</a> <a href="https://tldr.nettime.org/tags/Robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#Robotstxt</a>: "Hundreds of websites trying to block the AI company Anthropic from scraping their content are blocking the wrong bots, seemingly because they are copy/pasting outdated instructions to their robots.txt files, and because companies are constantly launching new AI crawler bots with different names that will only be blocked if website owners update their robots.txt. In particular, these sites are blocking two bots no longer used by the company, while unknowingly leaving Anthropic’s real (and new) scraper bot unblocked. This is an example of “how much of a mess the robots.txt landscape is right now,” the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers—many of them operated by AI companies—and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work."<a href="https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/" rel="nofollow noopener" translate="no" target="_blank">https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/</a>

Ecologia Digital"…the <a href="https://mato.social/tags/backlash" class="mention hashtag" rel="nofollow noopener" target="_blank">#backlash</a> to AI tools from content creators and website owners who do not want their work to be used for AI training purposes without permission or compensation is not only real but is becoming increasingly widespread. The analysis also highlights the limitations of robots.txt—while many companies respect robots.txt instructions, some do not. Perplexity have been caught circumventing & ignoring <a href="https://mato.social/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a>."<a href="https://www.404media.co/the-backlash-against-ai-scraping-is-real-and-measurable/" rel="nofollow noopener" translate="no" target="_blank">https://www.404media.co/the-backlash-against-ai-scraping-is-real-and-measurable/</a>

Ecologia Digital"…researchers estimate that in the 3 data sets—called C4, RefinedWeb and Dolma—5% of all data, and 25% of data from the highest-quality sources, has been restricted…set up through the <a href="https://mato.social/tags/RobotsExclusionProtocol" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsExclusionProtocol</a>, a method for website owners to prevent automated bots from crawling their pages using a file called <a href="https://mato.social/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a>."<a href="https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html?unlocked_article_code=1.8k0.8eMA.cGAaZ0i10aZE&smid=nytcore-ios-share&referringSource=articleShare" rel="nofollow noopener" translate="no" target="_blank">https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html?unlocked_article_code=1.8k0.8eMA.cGAaZ0i10aZE&smid=nytcore-ios-share&referringSource=articleShare</a>

PJ Coffey<a href="https://raggedfeathers.com/@lilithsaintcrow" class="u-url mention" rel="nofollow noopener" target="_blank">@lilithsaintcrow</a> Right with you. <a href="https://mastodon.ie/tags/AIethics" class="mention hashtag" rel="nofollow noopener" target="_blank">#AIethics</a> <a href="https://mastodon.ie/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a> <a href="https://mastodon.ie/tags/mastodonAdmin" class="mention hashtag" rel="nofollow noopener" target="_blank">#mastodonAdmin</a> <a href="https://mastodon.ie/tags/Perplexity" class="mention hashtag" rel="nofollow noopener" target="_blank">#Perplexity</a> <a href="https://mastodon.ie/tags/perplexityai" class="mention hashtag" rel="nofollow noopener" target="_blank">#perplexityai</a>

Stefan BohacekInteresting results, thanks everyone for voting!I wrote more on this topic here: <a href="https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/" rel="nofollow noopener" translate="no" target="_blank">https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/</a><a href="https://stefanbohacek.online/tags/ai" class="mention hashtag" rel="nofollow noopener" target="_blank">#ai</a> <a href="https://stefanbohacek.online/tags/ChatGPT" class="mention hashtag" rel="nofollow noopener" target="_blank">#ChatGPT</a> <a href="https://stefanbohacek.online/tags/RobotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTxt</a>

Stefan BohacekInterestingly, a few of the top websites actively invite AI crawlers to crawl them.<a href="https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/" rel="nofollow noopener" translate="no" target="_blank">https://stefanbohacek.com/blog/which-top-sites-block-ai-crawlers/</a><a href="https://stefanbohacek.online/tags/dataviz" class="mention hashtag" rel="nofollow noopener" target="_blank">#dataviz</a> <a href="https://stefanbohacek.online/tags/data" class="mention hashtag" rel="nofollow noopener" target="_blank">#data</a> <a href="https://stefanbohacek.online/tags/ai" class="mention hashtag" rel="nofollow noopener" target="_blank">#ai</a> <a href="https://stefanbohacek.online/tags/RobotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#RobotsTxt</a> <a href="https://stefanbohacek.online/tags/WebCrawlers" class="mention hashtag" rel="nofollow noopener" target="_blank">#WebCrawlers</a> <a href="https://stefanbohacek.online/tags/internet" class="mention hashtag" rel="nofollow noopener" target="_blank">#internet</a> <a href="https://stefanbohacek.online/tags/TheWeb" class="mention hashtag" rel="nofollow noopener" target="_blank">#TheWeb</a>

Stefan BohacekJust throwing out a thought before I do some research on this, but I think robots.txt needs an update. Ideally I'd like to define an "allow list" that tells web scrapers how my content can be used. Eg.:- monetizable: false - fediverse: true - nonfediverse: false - ai: falseEtc. And I'd like to apply this to my social media profile and any other web presence, not just my personal website.<a href="https://stefanbohacek.online/tags/internet" class="mention hashtag" rel="nofollow noopener" target="_blank">#internet</a> <a href="https://stefanbohacek.online/tags/fediverse" class="mention hashtag" rel="nofollow noopener" target="_blank">#fediverse</a> <a href="https://stefanbohacek.online/tags/SocialMedia" class="mention hashtag" rel="nofollow noopener" target="_blank">#SocialMedia</a> <a href="https://stefanbohacek.online/tags/robotsTxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotsTxt</a>

Ecologia Digital"the rise of AI products like <a href="https://mato.social/tags/ChatGPT" class="mention hashtag" rel="nofollow noopener" target="_blank">#ChatGPT</a>, and the <a href="https://mato.social/tags/LLMs" class="mention hashtag" rel="nofollow noopener" target="_blank">#LLMs</a> underlying them, have made high-quality training data one of the internet’s most valuable commodities. That has caused internet providers of all sorts to reconsider the value of the data on their servers, and rethink who gets access to what. Being too permissive can bleed your website of all its value; being too restrictive can make you invisible." <a href="https://mato.social/tags/robotstxt" class="mention hashtag" rel="nofollow noopener" target="_blank">#robotstxt</a> <a href="https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders" rel="nofollow noopener" translate="no" target="_blank">https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders</a>

Recent searches

Search options

Administered by:

Server stats: