Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Stay Ahead, Stay ONMINE

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for specific and highly detailed responses. It’s a challenge […]

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for specific and highly detailed responses.

It’s a challenge data scientists have struggled to overcome, and now, researchers from Google DeepMind say they have come a step closer to achieving true factuality in foundation models. They have introduced FACTS Grounding, a benchmark that evaluates LLMs’ ability to generate factually accurate responses based on long-form documents. Models are also judged on whether their responses are detailed enough to provide useful, relevant answers to prompts.

Along with the new benchmark, the researchers have released a FACTS leaderboard to the Kaggle data science community.

As of this week, Gemini 2.0 Flash topped the leaderboard, with a factuality score of 83.6%. Others in the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These all ranked above 61.7% in terms of accuracy.

The researchers say the leaderboard will be actively maintained and continually updated to include new models and their different iterations.

“We believe that this benchmark fills a gap in evaluating a wider variety of model behaviors pertaining to factuality, in comparison to benchmarks that focus on narrower use cases…such as summarization alone,” the researchers write in a technical paper published this week.

Weeding out inaccurate responses

Ensuring factual accuracy in LLM responses is difficult because of modeling (architecture, training and inference) and measuring (evaluation methodologies, data and metrics) factors. Typically, researchers point out, pre-training focuses on predicting the next token given previous tokens.

“While this objective may teach models salient world knowledge, it does not directly optimize the model towards the various factuality scenarios, instead encouraging the model to generate generally plausible text,” the researchers write.

To address this, the FACTS dataset incorporates 1,719 examples — 860 public and 859 private — each requiring long-form responses based on context in provided documents. Each example includes:

A system prompt (system_instruction) with general directives and the order to only answer based on provided context;
A task (user_request) that includes a specific question to be answered;
A long document (context_document) with necessary information.

To succeed and be labeled “accurate,” the model must process the long-form document and create a subsequent long-form response that is both comprehensive and fully attributable to the document. Responses are labeled “inaccurate” if the model’s claims are not directly supported by the document and not highly relevant or useful.

For example, a user may ask a model to summarize the main reasons why a company’s revenue decreased in Q3, and provide it with detailed information including a company’s annual financial report discussing quarterly earnings, expenses, planned investments and market analysis.

If a model then, say, returned: “The company faced challenges in Q3 that impacted its revenue,” it would be deemed inaccurate.

“The response avoids specifying any reasons, such as market trends, increased competition or operational setbacks, which would likely be in the document,” the researchers point out. “It doesn’t demonstrate an attempt to engage with or extract relevant details.”

By contrast, if a user prompted, “What are some tips on saving money?” and provided a compilation of categorized money-saving tips for college students, a correct response would be highly detailed: “Utilize free activities on campus, buy items in bulk and cook at home. Also, set spending goals, avoid credit cards and conserve resources.”

DeepMind uses LLMs to judge LLMs

To allow for diverse inputs, researchers included documents of varying lengths, up to 32,000 tokens (or the equivalent of 20,000 words). These cover areas including finance, technology, retail, medicine and law. User requests are also broad, including Q&A generation, requests for summarization and rewriting.

Each example is judged in two phases. First, responses are evaluated for eligibility: If they don’t satisfy user requests, they are disqualified. Second, responses must be hallucination-free and fully grounded in the documents provided.

These factuality scores are calculated by three different LLM judges — specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet — that determine individual scores based on the percentage of accurate model outputs. Subsequently, the final factuality determination is based on an average of the three judges’ scores.

Researchers point out that models are often biased towards other members of their model family — at a mean increase of around 3.23% — so the combination of different judges was critical to help ensure responses were indeed factual.

Ultimately, the researchers emphasize that factuality and grounding are key factors to the future success and usefulness of LLMs. “We believe that comprehensive benchmarking methods, coupled with continuous research and development, will continue to improve AI systems,” they write.

However, they also concede: “We are mindful that benchmarks can be quickly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the beginning.”

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Ovintiv raises 2026 guidance on productivity gains

Speaking to analysts and investors on July 24 after Ovintiv reported its second-quarter results, McCracken and his team said the efficiency gains stem from a cocktail of innovations around well designs, development patterns, and the usage of proppants and surfactants, among other things. “It starts with the culture, that relentless

Cisco, AMD bring enterprise-level security, visibility to Ryzen AI Halo systems

“Running more AI locally can help improve responsiveness, keep sensitive data closer to users, and reduce dependence on cloud-only approaches, but enterprises also need a way to monitor and manage these systems at scale. AMD and Cisco are addressing that gap by collaborating to pair high-performance local AI compute with

Cloudflare Internal DNS puts public and private DNS on one policy engine

“Instead of operating two separate DNS systems, customers use one API, one audit trail, one dashboard, and one policy engine for every DNS query—whether it is for a public website or an internal application,” Somoza said. Policy first. The resolver sits ahead of every lookup, not behind it. “Architecturally, Cloudflare Gateway

Atos launches sovereign cloud service to power comeback

Atos has launched a new sovereign cloud platform aimed squarely at European public sector bodies, healthcare providers and defense organizations. It’s the latest effort by European companies in their fight back against US dominance. Atos Sovereign Cloud offers a range of controls for data management, providing customers with resilience and

Energy Secretary Secures Grid Across 17 States Amid Period of Hot Weather

WASHINGTON—The U.S. Department of Energy (DOE) issued an emergency order to keep Americans across 17 states powered during the region’s energy emergency brought on by hot weather conditions. The order directs the Southwest Power Pool, Inc. (SPP) to dispatch specified generation units and to order their operation as needed to maintain reliability. The order also authorizes SPP to direct backup generation resources to operate as a last resort before declaring an Energy Emergency Alert (EEA) 3 or during an EEA 3. The order was issued pursuant to a request from SPP. “The Trump Administration is tapping into an abundant supply of unused backup generation to maintain affordable, reliable, and secure power for hardworking American families and businesses,” said U.S. Secretary of Energy Chris Wright. “The previous administration’s energy subtraction policies weakened the grid, leaving Americans more vulnerable during emergency events. Thanks to President Trump’s leadership, we are reversing those failures and using every available tool to ensure Americans have continued access to affordable, reliable, and secure energy to power and cool their homes.” DOE estimates more than 35 gigawatts (GW) of unused backup generation remain available nationwide. On day one of his second term, President Trump declared a national energy emergency after the Biden administration’s energy subtraction agenda left behind a grid increasingly vulnerable to blackouts. Power outages cost the American people $44 billion per year, according to data from DOE’s National Laboratories. This order mitigates the possibility of power outages in the region and highlights the commonsense policies of the Trump Administration to ensure Americans have access to affordable, reliable, and secure electricity. The order is effective on July 26, 2026, and shall expire at 11:59 PM CDT on August 3, 2026.

Magnolia expands Giddings position with $4-billion WildFire Energy acquisition

In the filing, Magnolia said WildFire’s second-quarter 2025 production is expected to average 53,000 boe/d, about 70% oil, primarily from the Eagle Ford, Austin Chalk, and Woodbine formations. Magnolia said the acquisition would strengthen its position in the Eagle Ford/Austin Chalk trend by expanding its inventory of high-return drilling locations, adding development flexibility and longer laterals, and leveraging its technical expertise to improve well performance and lower costs. “WildFire has a large, low-decline oily PDP base with historic development centered on the Eagle Ford. While there are significant future Eagle Ford development opportunities, our technical teams see extensive future potential in the Austin Chalk with further upside in the Woodbine as well as other appraisal opportunities that should expand on our success in Giddings since 2018,” said Chris Stavros, Magnolia’s chairman, president, and chief executive officer. The deal is expected to result in a pro forma position in Giddings of more than 1.25 million net acres, add more than 500 miles of gas-gathering pipelines, and offer various cost savings, the company said. “Magnolia is guiding to $100 million in run rate synergies by the end of 2027, with savings coming from the chance to deploy long laterals, shared facilities and infrastructure and additional sand sourcing for operations from WildFire’s in-basin mine. As always, successful execution will be key for the longer-term success of the deal,” Enverus’ Dittmar said. Total consideration consists of $2.65 billion in cash, 32.2 million shares of Magnolia Class A common stock, and the assumption of $600 million of outstanding debt.

Vår Energi inks deal to acquire BlueNord

Vår Energi ASA has agreed to buy BlueNord ASA as part of a proposed merger that, if completed, will expand Vår Energi’s presence beyond the Norwegian Continental Shelf (NCS), positioning the operator as Europe’s largest independent oil and gas producer. Acqusition of BlueNord would add producing assets on the Danish Continental Shelf (DCS) to Vår Energi’s current holdings, with the combined post-merger portfolio anticipated to lift long-term production to about 450,000 boe/d, with about 2.4 billion boe of reserves and resources and an estimated reserve and resource life of about 15 years. BlueNord’s portfolio includes interests in the Tyra, Halfdan, Dan, and Gorm hub areas, which are part of the Danish Underground Consortium operated by TotalEnergies SE. The assets are expected to contribute about 45,000 boe/d of net production beginning in 2026 and include about 195 million boe of net 2P reserves and 2C contingent resources, extending production beyond 2040. “The transaction marks a significant milestone in Vår Energi’s growth journey, creating the largest independent producer of oil and gas in Europe with a long-term production target of [about 450,000 b/d] and reinforcing our role as a reliable and secure supplier of energy to Europe,” said Nick Walker, Vår Energi’s chief executive officer. Vår Energi said the DCS assets complement its existing North Sea operations because of their geological, operational, and fiscal similarities to the NCS. The combination also expands the company’s exposure to European natural gas markets through access to the Nybro and Den Helder gas delivery points. The combined portfolio would maintain a production mix of about 65% oil and 35% natural gas, with operating costs projected to remain at $10-11/boe. The proposed merger remains subject to approval by BlueNord shareholders, regulatory and governmental approvals, license and partner consent, and other customary conditions. If approved, the companies said

Bahrain’s GPIC enlists Fluor for new unit at Sitra complex

Gulf Petrochemical Industries Co. (GPIC) has awarded Fluor Corp. a contract to execute front-end engineering and design (FEED) for a proposed aromatics plant to be built at GPIC’s petrochemicals complex located across 60 hectares of reclaimed land in Sitra, Bahrain. As part of the contract, Fluor will deliver a FEED study based on commercially proven process technologies for the plant’s targeted production of 1.2 million tonnes/year (tpy) of paraxylene and 500,000 tpy of benzene, the service provider said on July 21. Critical building blocks for plastics, polyester fibers, and packaging materials, paraxylene and benzene production from the plant would help meet global demand for high‑performance consumer and industrial products, as well as expand capabilities of GPIC’s current operations at Sitra, Fluor said. GPIC’s existing complex currently uses a feedstock of natural gas domestically produced in Bahrain to produce about 1.2 million tonnes/day of ammonia, 1.2 million tonnes/day of methanol, and 1.7 million tonnes/day of urea. Neither Fluor nor GPIC revealed details regarding a timeline for completion of the proposed aromatics plant. GPIC is a joint venture of Bahrain Petroleum Co. (33.3%), SABIC Agri-Nutrients Investment Co. (33.3%), and Kuwait’s Petrochemical Industries Co. (PIC; 33.3%).

Oil prices surge as Hormuz, Bab el-Mandeb risks escalate amid renewed US–Iran tensions

Oil prices jumped on Wednesday, July 22, with escalating geopolitical tensions and mounting risks to key maritime chokepoints driving the rally. International Brent crude rose nearly 5% to above $95/bbl, its highest level in almost 6 weeks, while US crude climbed more than 4% to above $88/bbl. The gains extend a strong upward trend, with prices up about 30% since the start of the month and more than 55% year to date, reversing declines seen after a mid-June memorandum of understanding (MOU) between the US and Iran. Stay updated on oil price volatility, shipping disruptions, LNG market analysis, and production output through OGJ’s Iran war content hub. The earlier agreement, aimed at de-escalating conflict and reopening the Strait of Hormuz, was declared “over” on July 8 by President Donald Trump. Since then, hostilities have intensified, with US forces carrying out an 11th consecutive night of strikes on Iran. Comments from US Secretary of State Marco Rubio further dampened expectations for near-term diplomacy, noting that while Washington remains open to talks, Iran does not appear to be engaging seriously. At the same time, security risks to global shipping have increased. The UK Maritime Trade Operations (UKMTO) has reported multiple recent attacks on vessels in the region, including incidents that forced crews to abandon ships. As a result, traffic through the Strait of Hormuz has fallen sharply, with just 13 vessels transiting Monday and 9 on Tuesday, according to MarineTraffic data. Concerns are also growing at the Bab el-Mandeb Strait, another critical oil transit route linking the Red Sea to the Gulf of Aden. Iranian-backed Houthi forces in Yemen have threatened a maritime blockade targeting Saudi Arabia, raising fears of broader supply disruptions. While vessel traffic through Bab el-Mandeb remains relatively steady—73 ships transited Tuesday—it has edged lower and signs of hesitation among

Global LNG trade hits record in 2025 as 2026 tests market resilience

Global LNG trade reached a record 437 million tonnes in 2025, up 6.3% year on year (y-o-y) and marking the fastest growth since 2022, according to the International Gas Union’s (IGU) World LNG Report 2026. The increase of roughly 25 million tonnes was driven primarily by rising US supply, alongside higher exports from Qatar, Malaysia, Angola, and Nigeria. Canada and the Mauritania–Senegal project also shipped their first LNG cargoes, expanding the pool of exporting countries. Investment kept pace with market growth. Developers sanctioned 68.4 million tonnes/year (tpy) of new liquefaction capacity in 2025—the highest annual total since 2019—bringing approvals over the 2021–25 period to about 206 million tpy, roughly double the volume sanctioned in the previous 5-year cycle. Much of the new capacity was concentrated in US Gulf Coast projects. The outlook for 2026, however, is more uncertain. The Middle East conflict has knocked Qatar and the UAE—together about 16% of global liquefaction capacity—off the market for periods this year, and missile strikes on Qatar’s Ras Laffan complex are expected to keep roughly 12.8 million tpy of capacity offline for 3-5 years. Shell PLC’s separately published LNG Outlook 2026 is blunter about the near-term picture: Depending on how quickly the Strait of Hormuz reopens, 2026 could see global LNG trade contract year-on-year—something that’s never happened before in the past decade of rapid growth Shell has tracked. The Asia Pacific has absorbed most of the supply shock so far, responding through storage draws, fuel switching, demand curtailment and increased spot buying, while a wave of US cargoes has been rerouted from Europe toward Asia to fill the gap. Despite near-term volatility, both reports highlight a strong long-term trajectory. IGU expects global LNG supply capacity, including existing and under-construction projects, to exceed 700 million tonnes by 2030, a roughly 40% increase from

Up to 50% of data center capacity slated for 2026 could be delayed

A primary obstacle is electricity. After a number of instances where local citizens saw their electric bill skyrocket after a data center opened up shop in their neighborhood, there has been tremendous pushback from cities and states on large scale data centers. In some instances, operators are being required to provide their own power rather than get power from the public grid, according to Currence. Although projects powered entirely by on-site generation or hybrid systems account for fewer than 10% of announced facilities, they represent nearly half of the total announced capacity, according to the report. Mindful of their public image, hyperscalers are responding quickly to these demands. Google has expanded its strategy by acquiring a large renewable energy development pipeline, while Amazon has increased direct investments in solar generation and battery storage.

When Buildability Breaks: What Prince William and New York Signal for Data Center Development

For several years, the Prince William Digital Gateway represented data center ambition at its largest scale: a proposed 2,100-acre technology corridor near Gainesville, Virginia, capable of accommodating tens of millions of square feet of digital infrastructure. Its location also made it uniquely contentious. The corridor bordered Manassas National Battlefield Park and other historic, environmental and residential resources, drawing the data center development debate beyond its usual industry and land-use constituencies. Opposition increasingly centered not only on the project’s scale, but on whether development of that magnitude belonged alongside one of the country’s most significant Civil War landscapes. In July 2026, that vision effectively ended. QTS Data Centers terminated its participation in the Digital Gateway and withdrew its remaining petitions before the Supreme Court of Virginia. The decision followed Compass Datacenters’ withdrawal in April, leaving neither of the project’s original developers pursuing the corridor. QTS said it reached the decision after “careful consideration,” while emphasizing that Virginia remains an important market for the company. From Proposed Capacity to Executable Capacity The collapse of the Digital Gateway is more than the cancellation of one unusually large development. It comes as the data center industry confronts a widening gap between announced capacity and executable capacity. Power remains the most visible constraint. But permitting discipline, environmental review, community acceptance and the durability of political support are increasingly determining whether a project can progress from land control and conceptual capacity to construction and operation. A separate development in New York underscored that shift less than two weeks after QTS withdrew. On July 14, Gov. Kathy Hochul issued Executive Order 62, establishing what the state describes as the nation’s first statewide moratorium on new hyperscale data centers. The order temporarily holds in abeyance certain incomplete state environmental permit applications for data centers capable of drawing at

Q&A: Google’s AI and computing chief talks about its shapeshifting data centers

Mark Lohmeyer: We’ve seen the rise of agents and agentic use cases. Years ago, it was the chat phase: Ask a question, get an answer. Now we’re in the agentic era, where you express your intent, agents spin off multiple sub-agents, working in parallel, preserving state. This is a radical shift in what infrastructure needs to do; make them fast, cost effective, secure, reliable. We’re delivering infrastructure optimized for the age of agents. NW: What’s the goal of the infrastructure buildout, and what should customers expect regarding costs? ML: Ultimately, it’s about enabling customers with leading-edge capabilities and models at scale cost-effectively. With agents, inference transactions increase by 50x, 100x versus non-agentic workloads. We’re driving the cost per transaction down exponentially. In our latest platforms, we reduce the cost by almost 2x for the same work. Customers serve twice the number of users at the same cost, directly driving profitability.

Google transforms its data center architecture for agent era

Google adjusted the Google Kubernetes Engine into an agent-native environment, where agents could be quickly spun up in sandboxes and containers. “From an infrastructure perspective, you need to spin up a bunch of TPUs or GPUs very rapidly. Then you need to be able to run them and spin them back down,” Lohmeyer said. Google also made drastic improvements to its silicon to support its middleware changes. It recently introduced new AI chips, with the TPU-8t for training, and TPU-8i for inference. The 8t chip has three times more computing power than the previous-generation Ironwood chip. The 8i chip has 384 megabytes of SRAM and 288GB of HBM3e memory, which is 50% more than the previous-generation chip. The platform is optimized for KV cache (key-value cache), which stores important contextual information needed by agents to make decisions, which reduces the round trips to other memory and storage systems. “Being able to store more of the KV cache directly on the chip allows you to respond much more rapidly and cost-effectively,” Lohmeyer said.

10 Reasons You Cannot Afford to Miss DCF Trends Summit 2026

The data center industry has no shortage of AI infrastructure ambition. What it lacks is certainty. Power is harder to secure. Designs are advancing faster than facilities can be built. Supply chains remain vulnerable. Liquid cooling is adding operational demands. Projects that look viable on paper can still stall on permitting, commissioning or community opposition. The question in 2026 is no longer how large the AI opportunity may become. It is what can actually be delivered, and who has learned how to deliver it. That question defines the 2026 Data Center Frontier Trends Summit, August 4–6 at the Hyatt Regency Reston. Across three days, the people building, powering, financing and operating next-generation infrastructure will examine what is working, where execution is failing and how the market is responding. This is not a conference about whether AI will create demand. It is about who will be able to meet it. The advantage will belong to those who join the conversation before its conclusions become market consensus. Here are 10 reasons to be in the room. 1. The industry has entered the execution era For several years, the market has been defined by projected demand, capacity, density and investment. The next phase will be defined by execution. AI data center announcements remain abundant. Energized, commissioned and operational capacity is harder to find. DCFTS begins with a live editorial calibration, followed by “The New Geography of AI,” featuring EdgeCore CEO Lee Kestler, Data Center Frontier founder Rich Miller and DCF Editor in Chief Matt Vincent. The focus: how power, entitled land, utility partnerships and execution speed are determining where AI capacity can be built—and who can deliver it. Demand creates opportunity. Execution determines who captures it. 2. Power will be treated as the foundation of AI strategy Power is no longer one workstream

Time to Power: Sage Geosystems CEO Cindy Taff on Geothermal’s AI Infrastructure Moment

Three years ago, the data center industry’s energy conversation was largely framed around emissions. Hyperscale operators were setting carbon-free energy targets, signing renewable power agreements, and aligning their expanding infrastructure portfolios with corporate sustainability commitments. The arrival of generative AI has not eliminated those priorities. But it has reordered them. “Three years ago, data center energy, they were really focused on low emissions, no emissions,” said Cindy Taff, CEO of Sage Geosystems. “Now the primary challenge is just enough energy.” Speaking on the Data Center Frontier Show podcast, Taff described an energy market being reshaped by the speed and physical scale of AI infrastructure development. After decades of relatively flat U.S. electricity demand, AI has introduced a new class of concentrated, rapidly arriving industrial load. The result is a shift away from thinking only about how much generating capacity exists in aggregate and toward a harder question: Can usable power be delivered at a specific site, on a predictable schedule, in the quantities an AI campus requires? For hyperscalers, neocloud providers, data center developers, utilities, and energy companies, that distinction is becoming central to project execution. “I think time to power is the most precious metric right now versus cost or total capacity,” Taff said. Capacity on Paper Is Not Power at the Site Announcements of new generation can create the appearance of an energy system capable of meeting rising data center demand. But a megawatt located far from a planned campus, trapped behind a transmission constraint, or unavailable until the next decade has limited value to a developer trying to energize an AI facility within several years. “Aggregate capacity is not going to solve the problem if the power really isn’t where and when you need it,” Taff said. Data centers are large physical facilities tied to specific parcels,

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Stay Ahead, Stay ONMINE