As OpenAI rolls out ChatGPT 4.1, initial benchmarks reveal that while it shows significant improvements over its predecessor, it still lags behind Google's Gemini models. This analysis dives into the ...

https://news.lavx.hu/article/chatgpt-4-1-vs-google-gemini-a-benchmark-showdown-in-ai-performance

#news #tech #ChatGPT

UK @uk@pubeurope.com · Apr 10

Apr 10

UK @uk@pubeurope.com

https://www.europesays.com/uk/8874/ The rise of AI ‘reasoning’ models is making benchmarking more expensive #AI #AiBenchmarks #AIReasoningModels #ArtificialIntelligence #Technology #UK #UnitedKingdom

**IT News** @itnewsbot@schleuss.online · Mar 7

Mar 7

IT News @itnewsbot@schleuss.online

CMU research shows compression alone may unlock AI puzzle-solving abilities - A pair of Carnegie Mellon University researchers recently discovered hints... - https://arstechnica.com/ai/2025/03/compression-conjures-apparent-intelligence-in-new-puzzle-solving-ai-approach/ #françoischollet #machinelearning #aicompression #aibenchmarks #compression #arc-agi #biz⁢ #tech #ai

Ars Technica · Mar 6Compression conjures apparent intelligence in new puzzle-solving AI approachBy Benj Edwards

**Miguel Afonso Caetano** @remixtures@tldr.nettime.org · Jan 17

Jan 17

Miguel Afonso Caetano @remixtures@tldr.nettime.org

"The current landscape of AI agent evaluation faces several critical challenges. Benchmark evaluations tend to focus on accuracy while ignoring costs, leading to uninformative evaluations for downstream developers. What does it mean if an agent has 1% higher accuracy on a benchmark but is 10x more expensive? The lack of standardized evaluation practices makes it difficult to assess real-world capabilities and prevents meaningful comparisons between different approaches. As shown in "AI Agents That Matter" (arXiv:2407.01502), these issues have led to confusion about which advances actually improve performance.

HAL addresses these challenges through two key components: 1) A central leaderboard platform that incorporates cost-controlled evaluations by default, providing clear insights into the cost-performance tradeoffs of different agents, and 2) A standardized evaluation harness that enables reproducible agent evaluations across various benchmarks while tracking token usage and traces and without requiring any changes to the agent code or constraining agent developers to follow a certain agent framework. Evaluations can be run locally or in the cloud and fully parallelized.

TLDR: We aim to standardize AI agent evaluations by providing a third-party platform for comparing agents across various benchmarks. Our goal with HAL is to serve as a one-stop shop for agent evaluations, taking into account both accuracy and cost by default. The accompanying HAL harness offers a simple and scalable way to run agent evals - locally or in the cloud."

https://hal.cs.princeton.edu/

hal.cs.princeton.eduHAL: Holistic Agent LeaderboardThe Holistic Agent Leaderboard (HAL) is the standardized, cost-aware, and third-party leaderboard for evaluating agents.

#AI #GenerativeAI #LLMs

**LavX News** @lavxnews@mastodon.cloud · Jan 9

Jan 9

LavX News @lavxnews@mastodon.cloud

The Paradox of AI: Can Models Truly Reason Like Humans?

As OpenAI prepares to unveil its o3 model, the AI community is left grappling with a perplexing question: can artificial intelligence genuinely achieve human-level reasoning? Despite impressive perfor...

https://news.lavx.hu/article/the-paradox-of-ai-can-models-truly-reason-like-humans