MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

Stay Ahead, Stay ONMINE

MCP-Universe benchmark shows GPT-5 fails more than half of real-world orchestration tasks

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now The adoption of interoperability standards, such as the Model Context Protocol (MCP), can provide enterprises with insights into how agents and models function outside their walled confines. However, many […]

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

The adoption of interoperability standards, such as the Model Context Protocol (MCP), can provide enterprises with insights into how agents and models function outside their walled confines. However, many benchmarks fail to capture real-life interactions with MCP.

Salesforce AI Research developed a new open-source benchmark it calls MCP-Universe, which aims to track LLMs as these interact with MCP servers in the real world, arguing that it will paint a better picture of real-life and real-time interactions of models with tools enterprises actually use. In its initial testing, it found that models like OpenAI’s recently released GPT-5 are strong, but still do not perform as well in real-life scenarios.

“Existing benchmarks predominantly focus on isolated aspects of LLM performance, such as instruction following, math reasoning, or function calling, without providing a comprehensive assessment of how models interact with real-world MCP servers across diverse scenarios,” Salesforce said in a paper.

MCP-Universe captures model performance through tool usage, multi-turn tool calls, long context windows and large tool spaces. It’s grounded on existing MCP servers with access to actual data sources and environments.

AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

Turning energy into a strategic advantage
Architecting efficient inference for real throughput gains
Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO

Junnan Li, director of AI research at Salesforce, told VentureBeat that many models “still face limitations that hold them back on enterprise-grade tasks.”

“Two of the biggest are: Long context challenges, models can lose track of information or struggle to reason consistently when handling very long or complex inputs,” Li said. “And, Unknown tool challenges, models often aren’t able to seamlessly use unfamiliar tools or systems in the way humans can adapt on the fly. This is why it’s crucial not to take a DIY approach with a single model to power agents alone, but instead, to rely on a platform that combines data context, enhanced reasoning, and trust guardrails to truly meet the needs of enterprise AI.”

MCP-Universe joins other MCP-based proposed benchmarks, such as MCP-Radar from the University of Massachusetts Amherst and Xi’an Jiaotong University, as well as the Beijing University of Posts and Telecommunications’ MCPWorld. It also builds on MCPEvals, which Salesforce released in July, which focuses mainly on agents. Li said the biggest difference between MCP-Universe and MCPEvals is that the latter is evaluated with synthetic tasks.

How it works

MCP-Universe evaluates how well each model performs a series of tasks that mimic those undertaken by enterprises. Salesforce said it designed MCP-Universe to encompass six core domains used by enterprises: location navigation, repository management, financial analysis, 3D design, browser automation and web search. It accessed 11 MCP servers for a total of 231 tasks.

Location navigation focuses on geographic reasoning and the execution of spatial tasks. The researchers tapped the Google Maps MCP server for this process.
The repository management domain looks at codebase operations and connects to the GitHub MCP to expose version control tools like repo search, issue tracking and code editing.
Financial analysis connects to the Yahoo Finance MCP server to evaluate quantitative reasoning and financial market decision-making.
3D design evaluates the use of computer-aided design tools through the Blender MCP.
Browser automation, connected to Playwright’s MCP, tests browser interaction.
The web searching domain employs the Google Search MCP server and the Fetch MCP to check “open-domain information seeking” and is structured as a more open-ended task.

Salesforce said that it had to design new MCP tasks that reflect real use cases. For each domain, they created four to five kinds of tasks that the researchers think LLMs can easily complete. For example, the researchers assigned the models a goal that involved route planning, identifying the optimal stops and then locating the destination.

Each model is evaluated on how they completed the tasks. Li and his team opted to follow an execution-based evaluation paradigm rather than the more common LLM-as-a-judge system. The researchers noted the LLM-as-a-judge paradigm “is not well-suited for our MCP-Universe scenario, since some tasks are designed to use real-time data, while the knowledge of the LLM judge is static.”

Salesforce researchers used three types of evaluators: format evaluators to see if the agents and models follow format requirements, static evaluators to assess correctness over time and dynamic evaluators for fluctuating answers like flight prices or GitHub issues.

“MCP-Universe focuses on creating challenging real-world tasks with execution-based evaluators, which can stress-test the agent in complex scenarios. Furthermore, MCP-Universe offers an extendable framework/codebase for building and evaluating agents,” Li said.

Even the big models have trouble

To test MCP-Universe, Salesforce evaluated several popular proprietary and open-source models. These include Grok-4 from xAI, Anthropic’s Claude-4 Sonnet and Claude 3.7 Sonnet, OpenAI’s GPT-5, o4-mini, o3, GPT-4.1, GPT-4o, GPT-oss, Google’s Gemini 2.5 Pro and Gemini 2.5 Fkash, GLM-4.5 from Zai, Moonshot’s Kimi-K2, Qwen’s Qwen3 Coder and Qwen3-235B-A22B-Instruct-2507 and DeepSeek-V3-0304 from DeepSeek. Each model tested had at least 120B parameters.

In its testing, Salesforce found GPT-5 had the best success rate, especially for financial analysis tasks. Grok-4 followed, beating all the models for browser automation, and Claude-4.0 Sonnet rounds out the top three, although it did not post any performance numbers higher than either of the models it follows. Among open-source models, GLM-4.5 performed the best.

However, MCP-Universe showed the models had difficulty handling long contexts, especially for location navigation, browser automation and financial analysis, with efficiency falling significantly. The moment the LLMs encounter unknown tools, their performance also drops. The LLMs demonstrated difficulty in completing more than half of the tasks that enterprises typically perform.

“These findings highlight that current frontier LLMs still fall short in reliably executing tasks across diverse real-world MCP tasks. Our MCP-Universe benchmark, therefore, provides a challenging and necessary testbed for evaluating LLM performance in areas underserved by existing benchmarks,” the paper said.

Li told VentureBeat that he hopes enterprises will use MCP-Universe to gain a deeper understanding of where agents and models fail on tasks so that they can improve either their frameworks or the implementation of their MCP tools.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

CompTIA unveils AI prompting certification

IT training and certification provider CompTIA launched a new certification designed to help professionals develop and enhance artificial intelligence skills in prompt writing, AI collaboration, and the responsible use of AI. The AI Prompting Essentials certification program can help professionals learn how to identify tasks best suited to AI and

VMware Explore preview: Customers are looking for VCF value

“This year’s VMware Explore theme, ‘Simplify Your Cloud. Architect Your Future,’ speaks to where we see customers going,” said Broadcom in a statement to Network World. “Customers are struggling with cloud complexity, AI adoption, and security demands – all at once. VMware Explore 2025 aims to provide them with the

Intel saga continues: Federal bailout questions and another voice undermines CEO Tan

It could be a good deal, it could be a bad one, says Jack Gold, president of J. Gold Associates, there’s too little information to tell. “If it’s a loan like the feds did for other industries like aerospace and auto when they were down and out, that might be

Halliburton to Provide Well Stimulation Services for ConocoPhillips

Halliburton said it was awarded a contract to deliver comprehensive well stimulation services to ConocoPhillips Skandinavia AS to improve well performance and reservoir productivity. The contract spans five years and includes three optional extension periods, Halliburton said in a news release. Financial terms of the contract were not disclosed. Under

DOE Grants Lake Charles LNG Additional Time to Commence Exports

DOE Grants Lake Charles LNG Additional Time to Commence Exports WASHINGTON—U.S. Secretary of Energy Chris Wright today signed an amendment order granting additional time for Energy Transfer’s Lake Charles LNG Export Company, LLC to commence exports of liquefied natural gas (LNG) to non-free trade agreement (non-FTA) countries from the Lake Charles LNG project in Lake Charles, LA. Once fully constructed, the Lake Charles LNG project will be capable of exporting up to 2.33 billion cubic feet per day (Bcf/d) of natural gas as LNG. “On the heels of President Trump’s historic trade negotiations, demand for secure, reliable American LNG is surging,” said Secretary Wright. “The Department of Energy is ensuring companies like Lake Charles LNG are prepared meet this global demand while advancing commonsense policies that support American jobs and lower energy costs here at home.” “Granting this commencement extension furthers the Trump Administration’s priority of unleashing American Energy, a radical shift from the last administration, whose actions undermined the progress of Lake Charles LNG for years,” said Tala Goudarzi, Principal Deputy Assistant Secretary of the Office of Fossil Energy and Carbon Management. Lake Charles LNG was originally configured as an LNG import terminal, but is now being developed as an LNG export terminal. Recently, the company signed several long-term LNG off-take contracts, including agreements with Chevron and Kyushu Electric Power Company. The United States is the largest global producer and exporter of natural gas. There are currently eight large-scale LNG projects operating in the United States and several additional projects are expanding or under construction. Under President Trump’s leadership, the Department has approved applications from projects that will export more than 13.8 billion cubic feet per day (Bcf/d) of natural gas as LNG, a volume greater than the world’s second-largest LNG- exporting nation is exporting today.

Oil Rises on Fed Rate Cut Hopes

Oil rose after Federal Reserve Chair Jerome Powell signaled openness to an interest rate cut in September, countering an increasingly bearish supply outlook. West Texas Intermediate edged higher to settle above $63 a barrel, while Brent settled near $68 after Powell’s highly anticipated prepared remarks were more dovish than some investors anticipated. Crude futures stand to benefit from cheaper borrowing costs, which investors expect to spur economic activity and increase fuel demand. Lower rates also ease the financing and storage expenses that make holding oil positions more expensive when rates are high. Oil’s gains are being limited by continued expectations that global markets will face a supply glut after peak summer demand ends. Crude has dropped more than 10% this year on concerns that US tariffs will hurt economic growth just as OPEC+ nations are returning idled production. On the geopolitical front, White House trade adviser Peter Navarro blasted India again for continuing to buy Russian oil and said he sees US import levies on the nation doubling as planned on Aug. 27. President Donald Trump has threatened to raise the import duties on Indian goods to 50%, half of which is due to purchases of Russian crude. Still, oil refiners in the South Asian nation have returned to buying the barrels after a brief pause, while an official from Moscow said he expects flows to be maintained. There was little sign of progress toward a peace deal to end the war in Ukraine on Friday as President Volodymyr Zelenskiy said he’d had no contact with Russia on potential talks. While Navarro’s latest comments offer a fresh reminder of the headline risks around Russian energy as Trump seeks to engineer an end to the war in Ukraine, oil has largely drifted in recent weeks amid thinner summer trading. As traders

More Tariffs Loom as US Probes Wind Turbine Imports

The Trump administration has launched an investigation into imported wind turbines and parts, a potential precursor to adding more tariffs on the clean-energy components. The US Commerce Department started a national security probe on Aug. 13 into whether wind energy imports harm national security and undermine domestic production, according to a notice posted by the agency Thursday. Earlier this week, the agency said it was including wind turbines and related parts among the products facing 50 percent steel and aluminum tariffs. The US wind industry is heavily dependent on imports for parts such as blades, drivetrains and electrical systems, according to research firm Wood Mackenzie. In 2023, wind-related equipment imports to the US were valued at $1.7 billion, with 41 percent coming from Mexico, Canada and China, the firm said. President Donald Trump and his administration have repeatedly attacked wind energy, saying turbines have ruined the views at some of his golf properties and making unsubstantiated claims about their roles in the deaths of birds and whales. Trump on Wednesday said the US would not approve solar and wind projects that hurt farmland, saying they increase energy prices. He indefinitely halted the sale of new offshore wind leases on his first day in office and paused permitting of all wind projects on federal lands and waters. Thursday’s announcement adds to the list of industries facing potential tariffs under Section 232 of the Trade Expansion Act, which allows the president to regulate imports that are deemed to threaten national security. The president has said that duties on pharmaceutical and semiconductor imports could be rolled out in the coming weeks. Other industries under investigation include commercial aircraft, critical minerals and lumber. Trump has already announced duties on steel, aluminum, copper and automobiles. What do you think? We’d love to hear from you, join the

Streamline Innovations Completes Recapitalization

Streamline Innovations Inc. said it had completed a recapitalization with new investors and that it is now eyeing growth in the energy sector and other industrial markets. Eldon Pass, which describes itself as a group of mature operating companies, real estate and new ventures, led the equity investment for Streamline Innovations. Select Milk, a Dallas, Texas-based dairy cooperative, was the lead equity participant, according to a statement on Streamline Innovations’ website. San Antonio, Texas-based Streamline Innovations offers patented solutions to eliminate toxic and corrosive hydrogen sulfide emissions from oil, gas, renewable natural gas, carbon dioxide, municipal and industrial wastewater and other industrial processes, according to the company. “Additionally, Proterra Investment Partners LP arranged and led a comprehensive unitranche loan to help facilitate the transaction”, the statement said. “These combined proceeds will be used as strategic capital to position Streamline for continued growth in the energy market, as well as expansion into additional industrial markets and geographic regions. “Proceeds were also used to redeem the majority of existing equity holders at attractive returns for them, and to simplify the company’s ownership structure”. The statement did not disclose the financial details of the transaction. “Now, with the support of our new stakeholders, we are poised for continued growth, which we will pursue in various ways including the development of new products and technologies to help customers both economically and environmentally”, said Streamline Innovations co-founder and chief executive Dave Sisk. “While the company will continue to operate as an independent entity, we couldn’t be more excited to partner with Select Milk… In addition to growing our existing energy business and serving our great customers in the energy space, we see additional opportunities that exist through the Select Milk relationship”. To contact the author, email [email protected] WHAT DO YOU THINK? Generated by readers, the comments

Cenovus Beats Strathcona with $5B Deal to Acquire MEG

Cenovus Energy Inc. agreed to buy MEG Energy Corp. for C$6.93 billion ($5 billion), beating a bid from Strathcona Resources Ltd. to boost its position among Canada’s top oil producers. The deal values MEG at C$27.25 and calls for Cenovus to pay three-quarters in cash and a quarter in stock, according to a statement Friday. Cenovus expects the acquisition to close in the fourth quarter, subject to regulatory and shareholder approvals. The total value of the deal is C$7.9 billion including debt. The agreement caps a three-month battle for control of MEG triggered when oil tycoon Adam Waterous’ Strathcona Resources made an unsolicited cash-and-stock bid. MEG’s board had spurned Strathcona’s approaches before it took the proposal public. Once disclosed, some MEG investors panned the roughly C$6 billion proposal as too low. The board started a strategic review to seek other offers. Royal Bank of Canada analyst Greg Pardy said in May that Cenovus was “the most logical fit” to take over MEG because it also operates in Christina Lake, offering greater potential operating synergies than other possible buyers. The Cenovus takeover unites two Calgary-based firms with significant operations in the oil-sands region of northeastern Alberta. MEG’s Christina Lake project includes 200 square kilometers (77 square miles) of leases in the oil-rich area, and the company has regulatory approvals to produce around 210,000 barrels a day. MEG, which pumps about 100,000 barrels of crude a day, is one of the last companies in the industry small enough to be taken over but large enough to vault the acquirer up in the ranks of the country’s major producers. Cenovus is the third-largest Canadian crude producer by stock-market value, producing the equivalent of about 800,000 barrels of oil a day last year – mostly bitumen, along with natural gas liquids and some conventional oil

Indonesia Energy Signs MoU with AEP for Brazil O&G Projects

Indonesia Energy Corporation (IEC) said it signed a memorandum of understanding (MoU) with Aguila Energia e Participações Ltda. (AEP) to collaborate on oil and gas projects in Brazil. The MoU establishes a cooperative framework to jointly identify, evaluate, and pursue potential opportunities to acquire or participate in oil and gas or other energy-related projects in Brazil, IEC said in a news release. IEC and AEP will seek to enter into appropriate definitive agreements for projects on an opportunity-by-opportunity basis. The collaboration combines “IEC’s oil and gas and capital market experience with AEP’s capabilities in local Brazilian transactions, regulatory engagement, and asset development,” according to the release. IEC President Frank Ingriselli said, “This collaboration marks an exciting first step by IEC to evaluate world-class opportunities beyond Indonesia in Brazil, a market that has become one of the most attractive for upstream investment. Working with Blener Mayhew and his team will give us immediate local insight and access. In addition to opportunities in Brazil, AEP may also assist us in further commercializing our Indonesian assets and identifying new Indonesian domestic growth projects. Together with our planned drilling program at our Kruh Block, this initiative advances our growth strategy to scale production and diversify our portfolio in the final months of 2025 and beyond”. “We believe Brazil is experiencing an attractive convergence of market catalysts that create an exceptional entry point for nimble independent oil and gas companies like ours,” Ingriselli added. The executive mentioned Brazil’s bid system that allows acquisition opportunities, as well as the trend of junior operators divesting assets, as reasons for exploring growth opportunities in the country. Ingiselli also mentioned that concession contracts with the Brazilian government offered “the potential for higher after-tax cash flows compared to production-sharing contracts, as well as the potential for enhanced operational flexibility and

Nvidia turns to software to speed up its data center networking hardware for AI

Typically chunks of AI tasks are distributed across GPUs, which then coordinate to provide a unified output. Adaptive routing ensures the network and GPUs over long distances are in sync when running AI workloads, Shainer said. Jitter bugs “If I retransmit the packet, I create jitter, which means one GPU out of many will be delayed and all the others have to wait for that GPU to finish,” Shainer said. The congestion control improvements remove bottlenecks by balancing transmissions across switches. Nvidia tested XGS algorithms in its server hardware and measured a 1.9x improvement in GPU-to-GPU communication compared to off-the-shelf networking technology, executives said during a briefing on the technology. Cloud providers already have long-distance high-speed networks. For example, Google’s large-scale Jupiter network uses optical switching for fast communications between its AI chips, which are called TPUs. It is important to separate the physical infrastructure from the software algorithms like XGS, Shainer said.

Fluke Networks expands testing to help ease data center networking challenges

High-density fiber connections amplify contamination risks The shift toward higher-density fiber connections has significantly complicated contamination control. Modern array connectors can house up to 24 individual fibers within a single connection point. In contrast, traditional duplex connections contained just two fibers. “The slightest little bit of dust on one of those nine micron wide fibers, which, by the way, is much smaller than a human hair, the slightest little bit of dust on any one of the 24 in that connector, and it’s not going to work,” Mullins explained. The inspection and cleaning requirements extend beyond traditional fiber testing. Each kit includes cleaning and inspection capabilities. Mullins noted that many technicians take shortcuts on fiber preparation. “Cleaning and inspecting a fiber, every time you unplug it and plug it back in, adds, like another minute worth of work. But you know what? If you don’t do it, you’re gonna pay for it down the road,” he said. Cable identification a persistent challenge In addition to the new kits, Fluke Networks is also continuing to help solve other persistent networking issues. Physical cable identification continues to plague data center operations despite advances in network management and monitoring. Fluke’s solutions address this through multiple approaches. These include tone and probe technology, remote identification systems, and active network port discovery.

Cisco ties storage networking gear to IBM z17 mainframe

“IBM Z systems are mainframes known for their ability to handle massive transaction volumes, support large-scale databases, and provide unmatched levels of security and uptime,” wrote Fausto Vaninetti, a senior solutions engineer for data center technologies at Cisco, in a blog post about the news. “The newest in the IBM Z system family, IBM z17 is the first mainframe fully engineered for the AI age, unlocking expanded capabilities for enterprise-scale AI, such as large language models, generative AI, and accelerated inferencing. However, the performance of mainframe applications depends on the underlying storage infrastructure.” SANs play a critical role in ensuring fast, reliable, and secure access to data, Vaninetti wrote: “For mainframe environments, which leverage high-speed [Fibre Connection] FICON protocol, having a robust SAN fabric that supports these requirements is non-negotiable. A solution that combines high throughput, low latency, and enterprise-class resilience is vital to ensure seamless operations and meet stringent service-level agreement requirements.” According to Vaninetti, some standout features of the MDS 9000 Series for mainframe environments include:

Scaling Up: Tract’s Master-Planned Land and Infrastructure Approach to Data Center Development

With the rapid growth of physical data center infrastructure, it’s no surprise that a niche market has emerged for companies specializing in land acquisition. Reports of massive property purchases by firms planning new facilities appear almost daily—and so do accounts of the challenges developers face before the first shovel hits the ground. As parcel sizes grow and power and water demands intensify, the complexities of acquiring and preparing these sites have only increased. Tract is a leader in this space. The Denver-based company develops master-planned data center parks, with more than 25,000 acres of potential sites under its control and plans to support over 25 GW of workload capacity. To put that into perspective, 25,000 acres is roughly 40 square miles—about two-thirds the land area of Washington, D.C., or, for European readers, two-thirds the size of Liechtenstein. Building Shovel-Ready Megasites Rather than waiting for developers to come knocking, Tract takes a proactive approach, built on the core belief that the future of data center growth lies in pre-entitled, zoned, and infrastructure-ready megasites. The company works years in advance to deliver shovel-ready campuses with reliable energy, fiber connectivity, and municipal cooperation already in place. Its model emphasizes strategic land aggregation in high-growth regions, the cultivation of long-term relationships with utilities and governments, and master planning for power, cooling, transportation, and sustainability. This integrated approach positions Tract to deliver both speed and certainty to hyperscale project developers—at scale. Tract’s leadership team brings deep industry experience. Founder and Executive Chairman Grant van Rooyen previously led acquisitions and expansions at Cologix and Teraco. President Matt Spencer brings more than 35 years of telecom and infrastructure leadership, while Chief Energy Officer Nat Sahlstrom, former head of Amazon’s global energy, water, and sustainability teams, helped make Amazon the world’s largest buyer of renewable energy. Backed by

When Communities Push Back: Navigating Data Center Opposition

2025 has been a landmark year for data center development. The rise of the AI Factory and AI-driven data center designs has made announcements of massive new complexes routine, with claims and certainties that these facilities will require hundreds of megawatts of power scarcely raising an industry eyebrow. At the same time, opposition is becoming more organized, often forming unexpected alliances. Even in an era of clear political alignment around certain causes, blocking data center projects has emerged as a bipartisan concern among voters. In the past several months, as data center projects in the gigawatt range have been announced, significant behind-the-scenes opposition has been building, from local grassroots organizations to state legislatures crafting new guidelines for data center development. Rising Community Opposition In 2025, multiple communities across the U.S., from Northern Virginia to Indiana, Texas, Arizona, Georgia, and Alabama, have effectively organized to challenge proposed data center developments. Some campaigns have already succeeded in delaying or derailing projects, while others are still building momentum. A report from Data Center Watch, covering the period from May 2024 to March 2025, estimates that billions in data center investment have already been affected by local resistance: $18 billion in projects were blocked, and another $46 billion faced delays. Whether these trends will represent a lasting constraint on the AI-driven data center boom remains unclear, but one point is certain: organized community action has become a central front in the debate over digital infrastructure in America. The Data Center Watch report also identified 142 activist groups across 24 states actively opposing data center projects. While opposition is largely local in focus, the nature of the concerns has remained relatively consistent, with activism often coalescing quickly into organized groups (such as the Coalition to Protect Prince William County, No Desert Data Center, and Protect

Study finds data center colocation capacity near zero

Vacancy in the North American market has declined to a new record low of 2.3 percent, and JLL projects that figure will remain the same or go even further down from now through 2027. For comparison, the vacancy rate stood at 9.8 percent in 2020. As bad as the wait for data center capacity has become, the wait for power is even worse. The average wait time for a grid connection across North America is now four years, according to the report, with power delays representing a significant hurdle in efforts to alleviate the shortage of new colocation capacity. Most of the top markets have doubled or even tripled in size since 2020, with Columbus, Ohio leading the way with 1800% growth, followed by Austin/San Antonio at 500% growth. However, they started from a small base in 2020. In absolute terms, Northern Virginia (+3,975 MW), Dallas (+1,008 MW) and Atlanta (+828 MW) have seen the largest increase in capacity.

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Stay Ahead, Stay ONMINE