eupolicy.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
This Mastodon server is a friendly and respectful discussion space for people working in areas related to EU policy. When you request to create an account, please tell us something about you.

Server stats:

196
active users

#webscraping

1 post1 participant0 posts today
Holger Hellingers' Polente<p>Die HTML-Bombe ist eine neuartige Verteidigung gegen KI-Crawler. Sie bläht eine Seite auf über 10 GB auf und lässt Bots abstürzen. Ein kreatives Mittel gegen unerwünschtes Crawling.</p><p><a href="https://polente.de/2025/08/22/die-html-bombe-digitale-abwehr-gegen-ki-crawler/" class="" rel="nofollow noopener" target="_blank">https://polente.de/2025/08/22/die-html-bombe-digitale-abwehr-gegen-ki-crawler/</a></p>

Cloudflare says Perplexity evaded website blocks with stealth crawlers, sparking debate over AI data ethics ⚠️
Perplexity denies the claims, calling the analysis flawed and insisting user-driven access only 🤖

Users split: some defend AI access, others back stricter protections for site owners 🔐

@itsfoss

news.itsfoss.com/perplexity-ig

It's FOSS News · Is Perplexity a Shameless AI Company That Won't Take No for an Answer?Perplexity keeps crawling websites, even when it's told no, says Cloudflare.

New Open-Source Tool Spotlight 🚨🚨🚨

Scrapling is redefining Python web scraping. Adaptive, stealthy, and fast, it can bypass anti-bot measures while auto-tracking changes in website structure. A standout: 4.5x faster than AutoScraper for text-based extractions. #Python #WebScraping

🔗 Project link on #GitHub 👉 github.com/D4Vinci/Scrapling

#Infosec #Cybersecurity #Software #Technology #News #CTF #Cybersecuritycareer #hacking #redteam #blueteam #purpleteam #tips #opensource #cloudsecurity

✨
🔐 P.S. Found this helpful? Tap Follow for more cybersecurity tips and insights! I share weekly content for professionals and people who want to get into cyber. Happy hacking 💻🏴‍☠️

Q: Based on his ideas, would Adolf Hitler be for or against GDPR and right to erasure nowadays if he still lived?

A: It's reasonable to infer that Hitler would not support a regulation like #GDPR which emphasizes individual rights such as #privacy protection, data accessibility or erasure; and instead might favor more centralized control over information dissemination for propaganda purposes.

"The report, titled “Are AI Bots Knocking Cultural Heritage Offline?” was written by Weinberg of the GLAM-E Lab, a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law, which works with smaller cultural institutions and community organizations to build open access capacity and expertise. GLAM is an acronym for galleries, libraries, archives, and museums. The report is based on a survey of 43 institutions with open online resources and collections in Europe, North America, and Oceania. Respondents also shared data and analytics, and some followed up with individual interviews. The data is anonymized so institutions could share information more freely, and to prevent AI bot operators from undermining their countermeasures.

Of the 43 respondents, 39 said they had experienced a recent increase in traffic. Twenty-seven of those 39 attributed the increase in traffic to AI training data bots, with an additional seven saying the AI bots could be contributing to the increase.

“Multiple respondents compared the behavior of the swarming bots to more traditional online behavior such as Distributed Denial of Service (DDoS) attacks designed to maliciously drive unsustainable levels of traffic to a server, effectively taking it offline,” the report said. “Like a DDoS incident, the swarms quickly overwhelm the collections, knocking servers offline and forcing administrators to scramble to implement countermeasures. As one respondent noted, ‘If they wanted us dead, we’d be dead.’”"

404media.co/ai-scraping-bots-a

404 Media · AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums"This is a moment where that community feels collectively under threat and isn't sure what the process is for solving the problem.”

Are AI bots overwhelming digital collections?
A new GLAM-E Lab report shows how scrapers for AI training datasets are putting real strain on the infrastructures of galleries, libraries, archives, and museums. Technical bottlenecks, ethical dilemmas, and escalating costs—open culture is under pressure.
Read the full analysis:
glamelab.org/products/are-ai-b
#DigitalHeritage #GLAM #WebScraping #OpenAccess #CulturalData #MuseTech #DigitalHumanities #GLAMlab

GLAM-E LabAre AI Bots Knocking Cultural Heritage Offline?

"To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers."

eff.org/deeplinks/2025/06/keep

Electronic Frontier Foundation · Keeping the Web Up Under the Weight of AI CrawlersIf you run a site on the open web, chances are you've noticed a big increase in traffic over the past few months, whether or not your site has been getting more viewers, and you're not alone. Operators everywhere have observed a drastic increase in automated traffic—bots—and in most cases attribute...

📣 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗛𝘂𝗯 𝟱 𝗝𝘂𝗻𝗲: 𝗪𝗼𝗿𝗸𝘀𝗵𝗼𝗽 𝗪𝗲𝗯 𝗦𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝘂𝘀𝗶𝗻𝗴 𝗣𝘆𝘁𝗵𝗼𝗻 🐍
Is much of the information you need for your #research available on websites, but not as downloadable #datasets or #files? This workshop will introduce you to the basics of #webscraping in a clear, practical way!

Also drop by at the 𝗦𝘂𝗽𝗽𝗼𝗿𝘁 𝗖𝗮𝗳é where experts will be present for your quick (or big;-) questions about #R, #Python, #Statistics, #MachineLearning, #HPC and #Geo!

ℹ️ More information 👉🏼edu.nl/rw7vd

I'm having trouble figuring out what kind of botnet has been hammering our web servers over the past week. Requests come in from tens of thousands of addresses, just once or twice each (and not getting blocked by fail2ban), with different browser strings (Chrome versions ranging from 24.0.1292.0 - 108.0.5163.147) and ridiculous cobbled-together paths like /about-us/1-2-3-to-the-zoo/the-tiny-seed/10-little-rubber-ducks/1-2-3-to-the-zoo/the-tiny-seed/the-nonsense-show/slowly-slowly-slowly-said-the-sloth/the-boastful-fisherman/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/brown-bear-brown-bear-what-do-you-see/pancakes-pancakes/pancakes-pancakes/the-tiny-seed/pancakes-pancakes/pancakes-pancakes/slowly-slowly-slowly-said-the-sloth/the-tiny-seed

(I just put together a bunch of Eric Carle titles as an example. The actual paths are pasted together from valid paths on our server but in invalid order, with as many as 32 subdirectories.)

Has anyone else been seeing this and do you have an idea what's behind it?

"On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. It looked to be some kind of distributed denial-of-service attack.

He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site.

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “Each page has at least three photos.”

OpenAI was sending “tens of thousands” of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions.

“OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more,” he said of the IP addresses the bot used to attempt to consume his site.

“Their crawlers were crushing our site,” he said “It was basically a DDoS attack.”

Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from actual human models.

It sells the 3D object files, as well as photos — everything from hands to hair, skin, and full bodies — to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics."

techcrunch.com/2025/01/10/how-

TechCrunch · How OpenAI's bot crushed this seven-person company's website ‘like a DDoS attack’ | TechCrunchOpenAI was sending “tens of thousands” of server requests trying to download Triplegangers' entire site which hosts hundreds of thousands of photos.