eupolicy.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
This Mastodon server is a friendly and respectful discussion space for people working in areas related to EU policy. When you request to create an account, please tell us something about you.

Server stats:

206
active users

#dataset

6 posts4 participants0 posts today

“OpenAI says models are programmed to make stuff up instead of admitting ignorance”

‘In theory, #AI model makers could eliminate hallucinations by using a #dataset that contains no errors. But the paper admits such a scenario isn't remotely possible, particularly since the huge volumes of data used in #training likely contain #mistakes.’

BoyGenius A: I asked ChatGPT why you eat using the outer cutlery?

BoyGenius B: And it made up a reason? We don’t want to make consumers uneasy.

#OpenAI / #LLM / #DataIntegrity <theregister.com/2025/09/17/ope>

Hugging Face 🤗 has released FinePDFs - a dataset of more than 475 million PDFs from Common Crawl snapshots of the web (2013-2025). The 3.65TB data set is split by language (1733 languages!) and contains the full text extracted from the PDFs. It is intended for training large language models. License: ODC-By 1.0 + Common Crawl terms & conditions.

huggingface.co/datasets/Huggin

No mention of the rights of the original creators of the PDFs and their content.

huggingface.coHuggingFaceFW/finepdfs · Datasets at Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.

README files are a simple but crucial component of data sharing. Without a #README, the benefits of #opendata — improved understanding, reproducibility, trust, and attention — are moot. That’s because data cannot be interpreted, validated, or reused without certain key details that a README provides.

Learn how to maximize the impact of your dataset with an effective README ➡️ blog.datadryad.org/2025/09/09/