https://www.europesays.com/2399795/ Sources: AI training startup Mercor eyes $10B+ valuation on $450 million run rate #business #DataLabeling #Entrepreneurship #FelicisVentures #Mercor #SPV
https://www.europesays.com/2399795/ Sources: AI training startup Mercor eyes $10B+ valuation on $450 million run rate #business #DataLabeling #Entrepreneurship #FelicisVentures #Mercor #SPV
"The incredible demand for high-quality human-annotated data is fueling soaring revenues of data labeling companies. In tandem, the cost of human labor has been consistently increasing. We estimate that obtaining high-quality human data for LLM post-training is more expensive than the marginal compute itself1 and will only become even more expensive. In other words, high-quality human data will be the bottleneck for AI progress if these trends continue.
The revenue of major data labeling companies and the marginal compute cost of training of training frontier models for major AI providers in 2024.
To assess the proportion of data labeling costs within the overall AI training budget, we collected and estimated both data labeling and compute expenses for leading AI providers in 2024:
- Data labeling costs: We collected revenue estimates of major data labeling companies, such as Scale AI, Surge AI, Mercor, and LabelBox.
- Compute costs: We gathered publicly reported marginal costs of compute2 associated with training top models released in 2024, including Sonnet 3.5, GPT-4o, DeepSeek-V3, Mistral Large, Llama 3.1-405B, and Grok 2.
We then calculate the sum of costs in a category as the estimate of the market total. As shown above, the total cost of data labeling is approximately 3.1 times higher than total marginal compute costs. This finding highlights clear evidence: the cost of acquiring high-quality human-annotated data is rapidly outpacing the compute costs required for training state-of-the-art AI models."
https://ddkang.substack.com/p/human-data-is-probably-more-expensive
Data Annotation vs Data Labelling- Find the right for you
Key takeaways:
• Understand the core difference between annotation and labeling
• Explore use cases across NLP, computer vision & more
• Learn how each process impacts model training and accuracy
Read now to make smarter data decisions:
"Scale AI is basically a data annotation hub that does essential grunt work for the AI industry. To train an AI model, you need quality data. And for that data to mean anything, an AI model needs to know what it's looking at. Annotators manually go in and add that context.
As is the means du jour in corporate America, Scale AI built its business model on an army of egregiously underpaid gig workers, many of them overseas. The conditions have been described as "digital sweatshops," and many workers have accused Scale AI of wage theft.
It turns out this was not an environment for fostering high-quality work.
According to internal documents obtained by Inc, Scale AI's "Bulba Experts" program to train Google's AI systems was supposed to be staffed with authorities across relevant fields. But instead, during a chaotic 11 months between March 2023 and April 2024, its dubious "contributors" inundated the program with "spam," which was described as "writing gibberish, writing incorrect information, GPT-generated thought processes."
In many cases, the spammers, who were independent contractors who worked through Scale AI-owned platforms like Remotasks and Outlier, still got paid for submitting complete nonsense, according to former Scale contractors, since it became almost impossible to catch them all. And even if they did get caught, some would come back by simply using a VPN.
"People made so much money," a former contributor told Inc. "They just hired everybody who could breathe.""
"The production of artificial intelligence (AI) requires human labour, with tasks ranging from well-paid engineering work to often-outsourced data work. This commentary explores the economic and policy implications of improving working conditions for AI data workers, specifically focusing on the impact of clearer task instructions and increased pay for data annotators. It contrasts rule-based and standard-based approaches to task instructions, revealing evidence-based practices for increasing accuracy in annotation and lowering task difficulty for annotators. AI developers have an economic incentive to invest in these areas as better annotation can lead to higher quality AI systems. The findings have broader implications for AI policy beyond the fairness of labour standards in the AI economy. Testing the design of annotation instructions is crucial for the development of annotation standards as a prerequisite for scientific review and effective human oversight of AI systems in protection of ethical values and fundamental rights."
Meta's Scale AI Gambit Ignites Exodus of Big-Tech Customers and AI Labs
#Meta #ScaleAI #AI #Google #BigTech #AITraing #Zuckerberg #AINeutrality #DataLabeling
According to Reuters, a major shift is underway as Google plans to cut ties with Scale AI, its largest data-labeling partner, following Meta's acquisition of a 49% stake in Scale. This strategic move aims to protect proprietary interests amid rising competitive threats. As Google explores alternatives for AI services, this could significantly impact Scale's revenue and open doors for new competitors. Read more about the implications [here](https://www.cnbc.com/2025/06/14/google-scale-ais-largest-customer-plans-split-after-meta-deal.html). Kudos to Reuters for the insightful coverage! #Google #ScaleAI #Meta #AI #DataLabeling #MachineLearning #BusinessStrategy #Technology #Competitors
Google, Scale AI’s top client, is ending its partnership after Meta acquired a 49% stake in Scale. Microsoft, OpenAI, and xAI are also stepping back, prompting major shifts in AI data-labeling partnerships. #Google #ScaleAI #Meta #AI #DataLabeling #Microsoft #OpenAI #xAI #TechNews
TechXplore: Third-party data annotators often fail to accurately read the emotions of others, study finds. “Machine learning algorithms and large language models (LLMs), such as the model underpinning the functioning of the platform ChatGPT, have proved to be effective in tackling a wide range of tasks. These models are trained on various types of data (e.g., texts, images, videos, and/or […]
I created an offline-ready web app for labeling text data. Select a text file (one entry per line), and assign categories/labels to each entry. Your progress is automatically saved in your browser. Export labeled data as text files (again, one entry per line) or combined CSV. 1/
"The familiar narrative is that artificial intelligence will take away human jobs: machine-learning will let cars, computers and chatbots teach themselves - making us humans obsolete.
Well, that's not very likely, and we're gonna tell you why. There's a growing global army of millions toiling to make AI run smoothly. They're called "humans in the loop:" people sorting, labeling, and sifting reams of data to train and improve AI for companies like Meta, OpenAI, Microsoft and Google. It's gruntwork that needs to be done accurately, fast, and - to do it cheaply – it's often farmed out to places like Africa –
Naftali Wambalo: The robots or the machines, you are teaching them how to think like human, to do things like human.
We met Naftali Wambalo in Nairobi, Kenya, one of the main hubs for this kind of work. It's a country desperate for jobs… because of an unemployment rate as high as 67% among young people. So Naftali, father of two, college educated with a degree in mathematics, was elated to finally find work in an emerging field: artificial intelligence."
#AI #GenerativeAI #LLMs #AITraining #DataLabeling #GigEconomy: "Who are the workers behind the training datasets powering the biggest LLMs on the market? In this explainer, we delve into data labeling as part of the AI supply chain, the labourers behind this data labeling, and how this exploitative labour ecosystem functions, aided by algorithms and larger systemic governance issues that exploit microworkers in the gig economy.
Key points:
- High quality training data is the crucial element for producing a performing LLM, and high quality training data is labeled datasets.
- Several digital labour platforms have arisen to the task of supplying data labeling for LLM training. However, a lack of transparency and use of algorithmic decision-making models undergirds their exploitative business models.
- Workers are often not informed about who or what they are labeling raw datasets for, and they are subjected to algorithmic surveillance and decision-making systems that facilitate unreliable job stability and unpredictable wages."