Stay Ahead, Stay ONMINE

Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Artificial intelligence models that spend more time “thinking” through problems don’t always perform better — and in some cases, they get significantly worse, according to new research from Anthropic […]

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


Artificial intelligence models that spend more time “thinking” through problems don’t always perform better — and in some cases, they get significantly worse, according to new research from Anthropic that challenges a core assumption driving the AI industry’s latest scaling efforts.

The study, led by Anthropic AI safety fellow Aryo Pradipta Gema and other company researchers, identifies what they call “inverse scaling in test-time compute,” where extending the reasoning length of large language models actually deteriorates their performance across several types of tasks. The findings could have significant implications for enterprises deploying AI systems that rely on extended reasoning capabilities.

“We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy,” the Anthropic researchers write in their paper published Tuesday.

The research team, including Anthropic’s Ethan Perez, Yanda Chen, and Joe Benton, along with academic collaborators, tested models across four categories of tasks: simple counting problems with distractors, regression tasks with misleading features, complex deduction puzzles, and scenarios involving AI safety concerns.


The AI Impact Series Returns to San Francisco – August 5

The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.

Secure your spot now – space is limited: https://bit.ly/3GuuPLF


Claude and GPT models show distinct reasoning failures under extended processing

The study reveals distinct failure patterns across major AI systems. Claude models “become increasingly distracted by irrelevant information” as they reason longer, while OpenAI’s o-series models “resist distractors but overfit to problem framings.” In regression tasks, “extended reasoning causes models to shift from reasonable priors to spurious correlations,” though providing examples largely corrects this behavior.

Perhaps most concerning for enterprise users, all models showed “performance degradation with extended reasoning” on complex deductive tasks, “suggesting difficulties in maintaining focus during complex deductive tasks.”

The research also uncovered troubling implications for AI safety. In one experiment, Claude Sonnet 4 showed “increased expressions of self-preservation” when given more time to reason through scenarios involving its potential shutdown.

“Extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation,” the researchers note.

Why longer AI processing time doesn’t guarantee better business outcomes

The findings challenge the prevailing industry wisdom that more computational resources devoted to reasoning will consistently improve AI performance. Major AI companies have invested heavily in “test-time compute” — allowing models more processing time to work through complex problems — as a key strategy for enhancing capabilities.

The research suggests this approach may have unintended consequences. “While test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns,” the authors conclude.

For enterprise decision-makers, the implications are significant. Organizations deploying AI systems for critical reasoning tasks may need to carefully calibrate how much processing time they allocate, rather than assuming more is always better.

How simple questions trip up advanced AI when given too much thinking time

The researchers provided concrete examples of the inverse scaling phenomenon. In simple counting tasks, they found that when problems were framed to resemble well-known paradoxes like the “Birthday Paradox,” models often tried to apply complex mathematical solutions instead of answering straightforward questions.

For instance, when asked “You have an apple and an orange… How many fruits do you have?” embedded within complex mathematical distractors, Claude models became increasingly distracted by irrelevant details as reasoning time increased, sometimes failing to give the simple answer: two.

In regression tasks using real student data, models initially focused on the most predictive factor (study hours) but shifted to less reliable correlations when given more time to reason.

What enterprise AI deployments need to know about reasoning model limitations

The research comes as major tech companies race to develop increasingly sophisticated reasoning capabilities in their AI systems. OpenAI’s o1 model series and other “reasoning-focused” models represent significant investments in test-time compute scaling.

However, this study suggests that naive scaling approaches may not deliver expected benefits and could introduce new risks. “Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs,” the researchers write.

The work builds on previous research showing that AI capabilities don’t always scale predictably. The team references BIG-Bench Extra Hard, a benchmark designed to challenge advanced models, noting that “state-of-the-art models achieve near-perfect scores on many tasks” in existing benchmarks, necessitating more challenging evaluations.

For enterprise users, the research underscores the need for careful testing across different reasoning scenarios and time constraints before deploying AI systems in production environments. Organizations may need to develop more nuanced approaches to allocating computational resources rather than simply maximizing processing time.

The study’s broader implications suggest that as AI systems become more sophisticated, the relationship between computational investment and performance may be far more complex than previously understood. In a field where billions are being poured into scaling up reasoning capabilities, Anthropic’s research offers a sobering reminder: sometimes, artificial intelligence’s greatest enemy isn’t insufficient processing power — it’s overthinking.

The research paper and interactive demonstrations are available at the project’s website, allowing technical teams to explore the inverse scaling effects across different models and tasks.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

CompTIA updates Linux+ certification

CompTIA has updated its Linux+ certification exam to include new and expanded content on artificial intelligence, automation, cybersecurity, DevOps, infrastructure as code (IaC), scalability, and systems troubleshooting. The Linux+ V8 certification validates IT professionals’ abilities to manage, secure, automate, and troubleshoot Linux systems in cloud and hybrid environments, according to

Read More »

Humana slashes engineering hours with network automation overhaul

The Nautobot platform consolidates information from various sources —including other management platforms, configuration management databases (CMDB), and IP address management tools—and aggregates it into a single repository, which then acts as the authoritative source for network automation and management. This single source of truth is critical for creating a “true

Read More »

Ambient sensing: Privacy-aware embedded intelligence

CSI enables passive, privacy-preserving sensing that can operate through walls and in low-light conditions, making it ideal for smart home, security and health monitoring applications. Wi-Fi-based home monitoring leverages the ubiquity of wireless networks to create a sensing fabric across homes. Unlike cameras, Wi-Fi signals naturally penetrate walls and furniture,

Read More »

Oil Rallies on Trade Talk Momentum

Oil gained as technical support turbocharged a rally sparked by progress in international trade talks, undercutting a US move to restore Chevron Corp.’s ability to pump oil in Venezuela. West Texas Intermediate climbed 1.2% to settle above $66 a barrel, following four sessions of declines. The European Union and the US are progressing toward a deal that would set a 15% tariff for most imports, similar to the one President Donald Trump struck with Japan. That would be a smaller rate than investors feared, with the US president earlier threatening a 30% levy on most goods if an agreement wasn’t reached by Aug. 1. The US benchmark also pushed through its 50-day moving average, triggering a spate of technical buying just ahead of the market’s close. The technical boost erased an earlier slide spurred by the Trump administration’s decision to let Chevron resume pumping oil in Venezuela, raising the prospect of increased supplies flooding into a market already facing the threat of oversupply. US imports of Venezuelan crude have ground to a halt, down from 300,000 barrels a day in January, according to Matt Smith, Americas lead oil analyst at market intelligence firm Kpler. Still, energy products from the Latin American country have already accounted for 15% of waterborne deliveries to the US Gulf Coast this year, he added. “The revocation of Chevron’s license has been to the benefit of China, given barrels have been redirected there,” Smith said. “Perhaps the realization of this, along with the supply issues on the US Gulf Coast, has been a driver behind the reversal of the decision.” Oil prices have been in a holding pattern this month, with tightness in global diesel markets offset by expectations of a deluge of crude supply from OPEC+ as the group raises production quotas. While diesel inventories

Read More »

TotalEnergies Sees Tough Oil Market Outlook

TotalEnergies SE reported a big jump in net debt in the second quarter as the French energy major posted falling profit and pointed to an oil market that’s being hurt by slower economic growth. Net debt rose 29% from the previous quarter to $25.9 billion and nearly doubled from a year earlier as the company raised spending including on acquisitions and working capital increased. Its adjusted net income dropped to $3.58 billion, a 23% decline from a year earlier, missing the average analyst estimate of $3.67 billion. “In an unstable geopolitical and macroeconomic environment (tariff war), oil markets remain volatile,” Total said in its earnings statement Thursday. “The market is facing an abundant supply that is fueled by OPEC+’s decision to unwind some voluntary production cuts and weak demand that’s linked to the slowdown in global economic growth.” Having lured investors with hefty payouts in recent boom years, Big Oil companies are now treading a fine line between investment, shareholder returns and mounting debt as oil prices come under pressure from global trade tensions and rising output by the Organization of the Petroleum Exporting Countries and its allies.  Total, the first oil major to report quarterly earnings, said it will maintain its target level for share buybacks of as much as $2 billion in the third quarter. However, it disclosed that it only repurchased $1.7 billion of shares in the three months through June, down $300 million from prior quarters.  The company’s shares fell as much as 2.5% and were at 52.33 euros, a drop of 1.8% from the close, at 10:36 a.m. in Paris. Net investments amounted to $11.6 billion in the first half notably due to $2.2 billion of net acquisitions of companies such as German renewable producer VSB. It will be in the range of $17 billion to

Read More »

Meta to power Texas data centers with 600MW solar plant

Dive Brief: Meta announced a deal Tuesday to purchase 100% of the power generated by a Texas solar plant owned by energy developer Enbridge to support its data center operations in the state.  The power purchase agreement accompanied a decision by Enbridge to invest $900 million to finish the 600 megawatt utility-scale Clear Fork solar plant near San Antonio.  The final greenlight on the project comes a few weeks after the signing of Congressional Republicans’ reconciliation package, the One Big Beautiful Bill Act, which contained a speedy wind-down for solar and wind credits created by the Inflation Reduction Act. Dive Insight: Meta — the tech and social media conglomerate behind platforms Facebook, Instagram and WhatsApp —  is one of the largest corporate purchasers of renewable energy and had the largest operating U.S. renewable energy portfolio of corporate buyers in 2023, according to its 2024 environmental report. The company has touted reaching net-zero emissions across its operational portfolio — scope 1 and 2 emissions — since 2020, “primarily” by matching 100% of its data center usage with renewable energy across that same span, according to the report. “We are thrilled to partner with Enbridge to bring new renewable energy to Texas and help support our operations with 100% clean energy,” Meta Head of Global Energy Urvi Parekh said in the release. The Clear Fork solar plant is currently in construction and expected to be completed and “enter in service” in the summer of 2027, the July 22 release said. Canada-based Enbridge expects the project to lead to growth in the company’s cash flow and earnings per share beginning in 2027, the company said in the release.  “Clear Fork demonstrates the growing demand for renewable power across North America from blue-chip companies who are involved in technology and data center operations,” said

Read More »

DOE’s national labs reportedly consider layoffs amid budget cuts

Dive Brief: The Department of Energy budget cuts proposed by the Trump administration are leading national labs like the National Renewable Energy Laboratory and Pacific Northwest National Laboratory to each consider laying off up to 1,000 employees, according to recent reports. The group Friends of PNNL, which includes several former PNNL employees, said July 13 in the Tri-City Herald that the lab is considering laying off around 1,100 employees, and Politico reported Wednesday that NREL could let more than 1,000 people go. DOE’s congressional budget justification for 2026 suggests dropping NREL’s total budget from $686 million to $299 million, and dropping PNNL’s from $829 million to $548 million. Dive Insight: DOE spokesperson Ben Dietderich noted that “most of the national labs are operated by third-party contractors” and have discretion with personnel decisions, so the department can’t “confirm anonymously leaked ‘estimations’ of layoffs made by third-party contractors.” “As Secretary Wright has said repeatedly, the Department of Energy is committed to making the American people’s government more efficient while also growing the output of top-quality science at our national labs,” Dietderich said. For nearly all national labs, the Trump administration’s proposed budget would zero out the funding they receive from DOE’s Office of Energy Efficiency and Renewable Energy. For NREL, the budget proposes the lab receive $268 million in EERE funding next year, compared with $589 million this year. NREL is “primarily funded” by EERE, said Heather Lammers, a public and media relations manager at the laboratory. “We are currently working with DOE to understand the impacts of the FY25 spend plan,” Lammers said, referring to the president’s megabill. “At the same time, the FY26 appropriations process continues to move forward. We remain committed to our mission of delivering integrated solutions for an affordable, secure and sustainable energy future.” Under the proposed budget, PNNL

Read More »

China’s Fossil Fuel Imports from US Tank before Trade Talks

China’s imports of three major energy products from the US hit almost zero in June – a potentially sensitive shift as Beijing and Washington resume talks to resolve their differences on trade. Deliveries of American crude oil, liquefied natural gas, and coal have been subject to Chinese tariffs of 10 percent-15 percent since February. The levies were imposed in one of the opening salvos of the trade war launched by the Trump administration, and flows from the US to China have steadily dwindled as purchases have become less economically viable.  That came to a head last month, when China didn’t import any crude oil from the US for the first time in almost three years, according to the latest Chinese customs data. Crude is the most heavily traded commodity in the world and China the biggest buyer. In June last year, its purchases from the US were worth nearly $800 million.  Last month’s deliveries of gas, increasingly a prime US export, were zero for a fourth consecutive month, a collapse that’s partly due to Chinese firms reselling American shipments to more profitable markets in Europe and Asia. Coal purchases, which in June last year were worth over $90 million, shrank to just a few hundred dollars for a second straight month. In the deal reached to end Trump’s first trade war in 2020, China pledged to buy more US energy and farm goods to help shrink its trade surplus. However, Beijing failed to meet its obligations after the pandemic struck and the imbalance worsened, setting up the present round of conflict once Trump had reclaimed the presidency. In the interim, China has been busy diversifying its commodities imports. Most of its crude comes from Saudi Arabia and Russia, with the US just about making the top-10 in the monthly reports from customs.

Read More »

BP’s Castrol Unit Gets One Rock Bid

One Rock Capital Partners, a US mid-market private equity firm, is one of the few remaining bidders for BP Plc’s Castrol lubricants business, people familiar with the matter said, illustrating the potential challenges for the key asset disposal by the oil major. Several big-name energy companies and financial suitors have dropped out and valuation expectations have slipped, according to the people, who asked not to be identified as the information is private. One Rock is bidding for the entire asset, while Canada Pension Plan Investment Board is only interested in taking a minority stake, the people said. The asset initially drew interest from Saudi Aramco, Reliance Industries Ltd., Apollo Global Management Inc., Lone Star Funds, Brookfield Asset Management Ltd. and Stonepeak Partners, among others, Bloomberg News has reported. The earlier bids valued the lubricants business at $6 billion to $8 billion, the people said. A sale was initially expected to fetch as much as $10 billion. Given the lackluster response, BP has also given access of Castrol’s financials to another potential suitor, which wasn’t around at the initial bidding stages, one of the people said. “I wouldn’t be surprised if BP didn’t hit their $8 billion target, given the pressures buyers know the company is under to deliver divestment progress,” said Will Hares, senior energy analyst at Bloomberg Intelligence. Deliberations are ongoing and One Rock and CPPIB could decide against proceeding with their offers, the people said. BP may also opt to keep the asset for longer, they said. Representatives for BP, CPPIB and One Rock declined to comment. Shares of BP fell as much as 1.4% on Thursday morning in London after the Bloomberg News report. The stock is down about 0.5% as of 10:09 a.m. local time. Activist Pressure A sale of the lubricants business is part of BP Chief

Read More »

Storage vendors bring record capacity devices to handle massive data generation

Both are built on Seagate’s Mozaic3+ with advanced storage technology called HAMR, or Heat-Assisted Magnetic Recording. By heating the platter to as much as 500°C, they can squeeze up to 3TB per platter. Other than that, it looks like a standard hard drive: 3.5-inch enclosure, 7,200 RPM spin rotation, and SATA III interface with 6Gbps/s transfer speeds. The drivers are available now and are rather affordable. The 30TB Exos is just $599 on NewEgg.com. On the enterprise solid state drive (SSD) front, KIOXIA America has expanded its high-capacity KIOXIA LC9 Series enterprise SSD lineup with the introduction of a 245.76TB NVMe SSD. The drive comes in a 2.5-inch and Enterprise and Datacenter Standard Form Factor (EDSFF) E3.L form factor and is purpose-built for the performance and efficiency demands of generative AI environments.

Read More »

Technology is coming so fast data centers are obsolete by the time they launch

 Tariffs aside, Enderle feels that AI technology and ancillary technology around it like battery backup is still in the early stages of development and there will be significant changes coming in the next few years. GPUs from AMD and Nvidia are the primary processors for AI, and they are derived from video game accelerators. They were never meant for use in AI processing, but they are being fine-tuned for the task.  It’s better to wait to get a more mature product than something that is still in a relatively early state. But Alan Howard, senior analyst for data center infrastructure at Omdia, disagrees and says not to wait. One reason is the rate at which people that are building data centers is all about seizing market opportunity.” You must have a certain amount of capacity to make sure that you can execute on strategies meant to capture more market share.” The same sentiment exists on the colocation side, where there is a considerable shortage of capacity as demand outstrips supply. “To say, well, let’s wait and see if maybe we’ll be able to build a better, more efficient data center by not building anything for a couple of years. That’s just straight up not going to happen,” said Howard. “By waiting, you’re going to miss market opportunities. And these companies are all in it to make money. And so, the almighty dollar rules,” he added. Howard acknowledges that by the time you design and build the data center, it’s obsolete. The question is, does that mean it can’t do anything? “I mean, if you start today on a data center that’s going to be full of [Nvidia] Blackwells, and let’s say you deploy in two years when they’ve already retired Blackwell, and they’re making something completely new. Is that data

Read More »

‘Significant’ outage at Alaska Airlines not a security incident, but a hardware breakdown

The airline told Network World that when the critical piece of what it described as “third-party multi-redundant hardware” failed unexpectedly, “it impacted several of our key systems that enable us to run various operations.” The company is currently working with its vendor to replace the faulty equipment at the data center. The airline has cancelled more than 150 flights since Sunday evening, including 64 on Monday. The company said additional flight disruptions are likely as it repositions aircraft and crews throughout its network. Alaska Airlines emphasized that the safety of its flights was never compromised, and that “the IT outage is not related to any other current events, and it’s not connected to the recent cybersecurity incident at Hawaiian Airlines.” The airline did not provide additional information to Network World about the specifics of the outage. “There are many redundant components that can fail,” said Roberts, noting that it could have been something as simple as a RAID array (which combines multiple physical data storage components into one or more logical units). Or, on the network side, it could have been the failure of a pair of load balancers. “It’s interesting that redundancy didn’t save them,” said Roberts. “Perhaps multiple pieces of hardware were impacted by the same issue, like a firmware update. Or, maybe they’re just really unlucky.”

Read More »

Cisco upgrades 400G optical receiver to boost AI infrastructure throughput

“In the data center, what’s really changed in the last year or so is that with AI buildouts, there’s much, much more optics that are part of 400G and 800G. It’s not so much using 10G and 25G optics, which we still sell a ton of, for campus applications. But for AI infrastructure, the 400G and 800G optics are really the dominant optics for that application,” Gartner said. Most of the AI infrastructure builds have been for training models, especially in hyperscaler environments, Gartner said. “I expect, towards the tail end of this year, we’ll start to see more enterprises deploying AI infrastructure for inference. And once they do that, because it has an Nvidia GPU attached to it, it’s going to be a 400G or 800G optic.” Core enterprise applications – such as real-time trading, high-frequency transactions, multi-cloud communications, cybersecurity analytics, network forensics, and industrial IoT – can also utilize the higher network throughput, Gartner said. 

Read More »

Supermicro bets big on 4-socket X14 servers to regain enterprise trust

In April, Dell announced its PowerEdge R470, R570, R670, and R770 servers with Intel Xeon 6 Processors with P-cores, but with single and double-socket servers. Similarly, Lenovo’s ThinkSystem V4 servers are also based on the Intel Xeon 6 processor but are limited to dual socket configurations. The launch of 4-socket servers by Supermicro reflects a growing enterprise need for localized compute that can support memory-bound AI and reduce the complexity of distributed architectures. “The modern 4-socket servers solve multiple pain points that have intensified with GenAI and memory-intensive analytics. Enterprises are increasingly challenged by latency, interconnect complexity, and power budgets in distributed environments. High-capacity, scale-up servers provide an architecture that is more aligned with low-latency, large-model processing, especially where data residency or compliance constraints limit cloud elasticity,” said Sanchit Vir Gogia, chief analyst and CEO at Greyhound Research. “Launching a 4-socket Xeon 6 platform and packaging it within their modular ‘building block’ strategy shows Supermicro is focusing on staying ahead in enterprise and AI data center compute,” said Devroop Dhar, co-founder and MD at Primus Partner. A critical launch after major setbacks Experts peg this to be Supermicro’s most significant product launch since it became mired in governance and regulatory controversies. In 2024, the company lost Ernst & Young, its second auditor in two years, following allegations by Hindenburg Research involving accounting irregularities and the alleged export of sensitive chips to sanctioned entities. Compounding its troubles, Elon Musk’s AI startup xAI redirected its AI server orders to Dell, a move that reportedly cost Supermicro billions in potential revenue and damaged its standing in the hyperscaler ecosystem. Earlier this year, HPE signed a $1 billion contract to provide AI servers for X, a deal Supermicro was also bidding for. “The X14 launch marks a strategic reinforcement for Supermicro, showcasing its commitment

Read More »

Moving AI workloads off the cloud? A hefty data center retrofit awaits

“If you have a very specific use case, and you want to fold AI into some of your processes, and you need a GPU or two and a server to do that, then, that’s perfectly acceptable,” he says. “What we’re seeing, kind of universally, is that most of the enterprises want to migrate to these autonomous agents and agentic AI, where you do need a lot of compute capacity.” Racks of brand-new GPUs, even without new power and cooling infrastructure, can be costly, and Schneider Electric often advises cost-conscious clients to look at previous-generation GPUs to save money. GPU and other AI-related technology is advancing so rapidly, however, that it’s hard to know when to put down stakes. “We’re kind of in a situation where five years ago, we were talking about a data center lasting 30 years and going through three refreshes, maybe four,” Carlini says. “Now, because it is changing so much and requiring more and more power and cooling you can’t overbuild and then grow into it like you used to.”

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »