eupolicy.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
This Mastodon server is a friendly and respectful discussion space for people working in areas related to EU policy. When you request to create an account, please tell us something about you.

Server stats:

202
active users

#datalabeling

0 posts0 participants0 posts today

"The incredible demand for high-quality human-annotated data is fueling soaring revenues of data labeling companies. In tandem, the cost of human labor has been consistently increasing. We estimate that obtaining high-quality human data for LLM post-training is more expensive than the marginal compute itself1 and will only become even more expensive. In other words, high-quality human data will be the bottleneck for AI progress if these trends continue.

The revenue of major data labeling companies and the marginal compute cost of training of training frontier models for major AI providers in 2024.

To assess the proportion of data labeling costs within the overall AI training budget, we collected and estimated both data labeling and compute expenses for leading AI providers in 2024:

- Data labeling costs: We collected revenue estimates of major data labeling companies, such as Scale AI, Surge AI, Mercor, and LabelBox.
- Compute costs: We gathered publicly reported marginal costs of compute2 associated with training top models released in 2024, including Sonnet 3.5, GPT-4o, DeepSeek-V3, Mistral Large, Llama 3.1-405B, and Grok 2.

We then calculate the sum of costs in a category as the estimate of the market total. As shown above, the total cost of data labeling is approximately 3.1 times higher than total marginal compute costs. This finding highlights clear evidence: the cost of acquiring high-quality human-annotated data is rapidly outpacing the compute costs required for training state-of-the-art AI models."

ddkang.substack.com/p/human-da

Daniel’s Substack · Human Data is (Probably) More Expensive Than Compute for Training Frontier LLMsBy Daniel Kang

Data Annotation vs Data Labelling- Find the right for you

Key takeaways:

• Understand the core difference between annotation and labeling
• Explore use cases across NLP, computer vision & more
• Learn how each process impacts model training and accuracy

Read now to make smarter data decisions:

hitechbpo.com/blog/data-annota

"Scale AI is basically a data annotation hub that does essential grunt work for the AI industry. To train an AI model, you need quality data. And for that data to mean anything, an AI model needs to know what it's looking at. Annotators manually go in and add that context.

As is the means du jour in corporate America, Scale AI built its business model on an army of egregiously underpaid gig workers, many of them overseas. The conditions have been described as "digital sweatshops," and many workers have accused Scale AI of wage theft.

It turns out this was not an environment for fostering high-quality work.

According to internal documents obtained by Inc, Scale AI's "Bulba Experts" program to train Google's AI systems was supposed to be staffed with authorities across relevant fields. But instead, during a chaotic 11 months between March 2023 and April 2024, its dubious "contributors" inundated the program with "spam," which was described as "writing gibberish, writing incorrect information, GPT-generated thought processes."

In many cases, the spammers, who were independent contractors who worked through Scale AI-owned platforms like Remotasks and Outlier, still got paid for submitting complete nonsense, according to former Scale contractors, since it became almost impossible to catch them all. And even if they did get caught, some would come back by simply using a VPN.

"People made so much money," a former contributor told Inc. "They just hired everybody who could breathe.""

futurism.com/scale-ai-zuckerbe

Futurism · The AI Company Zuckerberg Just Poured $14 Billion Into Is Reportedly a Clown Show of Ludicrous IncompetenceBy Frank Landymore

"The production of artificial intelligence (AI) requires human labour, with tasks ranging from well-paid engineering work to often-outsourced data work. This commentary explores the economic and policy implications of improving working conditions for AI data workers, specifically focusing on the impact of clearer task instructions and increased pay for data annotators. It contrasts rule-based and standard-based approaches to task instructions, revealing evidence-based practices for increasing accuracy in annotation and lowering task difficulty for annotators. AI developers have an economic incentive to invest in these areas as better annotation can lead to higher quality AI systems. The findings have broader implications for AI policy beyond the fairness of labour standards in the AI economy. Testing the design of annotation instructions is crucial for the development of annotation standards as a prerequisite for scientific review and effective human oversight of AI systems in protection of ethical values and fundamental rights."

journals.sagepub.com/doi/10.11

According to Reuters, a major shift is underway as Google plans to cut ties with Scale AI, its largest data-labeling partner, following Meta's acquisition of a 49% stake in Scale. This strategic move aims to protect proprietary interests amid rising competitive threats. As Google explores alternatives for AI services, this could significantly impact Scale's revenue and open doors for new competitors. Read more about the implications [here](cnbc.com/2025/06/14/google-sca). Kudos to Reuters for the insightful coverage! #Google #ScaleAI #Meta #AI #DataLabeling #MachineLearning #BusinessStrategy #Technology #Competitors

CNBCGoogle, Scale AI's largest customer, plans split after Meta deal, sources sayGoogle plans to cut ties with Scale AI after news broke that rival Meta is taking a 49% stake in the AI data-labeling startup, Reuters reported, citing sources.

TechXplore: Third-party data annotators often fail to accurately read the emotions of others, study finds. “Machine learning algorithms and large language models (LLMs), such as the model underpinning the functioning of the platform ChatGPT, have proved to be effective in tackling a wide range of tasks. These models are trained on various types of data (e.g., texts, images, videos, and/or […]

https://rbfirehose.com/2025/05/22/techxplore-third-party-data-annotators-often-fail-to-accurately-read-the-emotions-of-others-study-finds/

"The familiar narrative is that artificial intelligence will take away human jobs: machine-learning will let cars, computers and chatbots teach themselves - making us humans obsolete.

Well, that's not very likely, and we're gonna tell you why. There's a growing global army of millions toiling to make AI run smoothly. They're called "humans in the loop:" people sorting, labeling, and sifting reams of data to train and improve AI for companies like Meta, OpenAI, Microsoft and Google. It's gruntwork that needs to be done accurately, fast, and - to do it cheaply – it's often farmed out to places like Africa –

Naftali Wambalo: The robots or the machines, you are teaching them how to think like human, to do things like human.

We met Naftali Wambalo in Nairobi, Kenya, one of the main hubs for this kind of work. It's a country desperate for jobs… because of an unemployment rate as high as 67% among young people. So Naftali, father of two, college educated with a degree in mathematics, was elated to finally find work in an emerging field: artificial intelligence."

cbsnews.com/news/labelers-trai

#AI #GenerativeAI #LLMs #AITraining #DataLabeling #GigEconomy: "Who are the workers behind the training datasets powering the biggest LLMs on the market? In this explainer, we delve into data labeling as part of the AI supply chain, the labourers behind this data labeling, and how this exploitative labour ecosystem functions, aided by algorithms and larger systemic governance issues that exploit microworkers in the gig economy.

Key points:

- High quality training data is the crucial element for producing a performing LLM, and high quality training data is labeled datasets.

- Several digital labour platforms have arisen to the task of supplying data labeling for LLM training. However, a lack of transparency and use of algorithmic decision-making models undergirds their exploitative business models.

- Workers are often not informed about who or what they are labeling raw datasets for, and they are subjected to algorithmic surveillance and decision-making systems that facilitate unreliable job stability and unpredictable wages."

privacyinternational.org/expla

Privacy InternationalHumans in the AI loop: the data labelers behind some of the most powerful LLMs' training datasetsBehind every machine is a human person who makes the cogs in that machine turn - there's the developer who builds (codes) the machine, the human evaluators who assess the basic machine's performance, even the people who build the physical parts for the machine.