How to build a better AI benchmark

Stay Ahead, Stay ONMINE

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects. In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later. Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement. Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.” “It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.” The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. “Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?” A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the much-hyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models. “Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.” The limits of traditional testing If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long. One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes. Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.) A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system. But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly. Where things break down Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.” Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist. For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.” In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks). This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.) Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively. Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit. Going smaller For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap? In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark. “You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?” The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined. BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.) Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.” Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step. “AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.” Measuring the “squishy” things To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking. The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations. In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition. To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills. It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.” Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks. The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims. For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.” For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust. “This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.” Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats. This story was supported by a grant from the Tarbell Center for AI Journalism.

It’s not easy being one of Silicon Valley’s favorite benchmarks.

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects.

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.”

“It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.”

The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones.

“Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?”

A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the much–hyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long.

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly.

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks).

This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.)

Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively.

Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit.

Going smaller

For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap?

In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

“You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?”

The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined.

BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.)

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.”

Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step.

“AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.”

Measuring the “squishy” things

To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking.

The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations.

In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition.

To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills.

It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.”

Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks.

The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims.

For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.”

For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust.

“This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.”

Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats.

This story was supported by a grant from the Tarbell Center for AI Journalism.

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Palo Alto Networks readies security for AI-first world

Palo Alto has articulated the value of a security platform for several years. But now, given the speed at which AI is moving, the value shifts from cost consolidation to agility. With AI, most customers don’t know what their future operating environment will look like, and a platform approach lets

Chevron executives see 2025 production growth nearing 8%

Executives of Chevron Corp., Houston, expect the company’s 2025 production growth, excluding former Hess operations, to be near the top of their guidance range of 6-8%, they said Oct. 31. Chevron’s total production for the 3 months that ended Sept. 30 totaled nearly 4.09 MMboe/d compared with 3.37 MMboe/d in

Cisco unveils integrated edge platform for AI

Announced at Cisco’s Partner Summit, Unified Edge will likely be part of many third-party packages that can be configured in a variety of ways, Cisco stated. “The platform is customer definable. For example, if a customer has a workload and they’ve decided they want to use Nutanix, they can go

Infoblox bolsters Universal DDI Platform with multi-cloud integrations

Universal DDI for Microsoft Management integration enables enterprises to gain control of their DNS and DHCP by centrally managing DNS and DHCP hosted on Microsoft server platforms. Integration with Google Cloud Internal Range applies consistent IPAM policies across Google Cloud, on-premises, and other cloud environments, which helps enterprise IT to

Oil Retreats on Strong Greenback

Oil fell, halting a four-session run of gains, pressured by a strong dollar and a backdrop of oversupply. West Texas Intermediate fell 0.8% to settle below $61 a barrel on Tuesday. A global equities rally hit a speed bump amid concerns about lofty valuations while the greenback climbed to the highest in more than five months, weighing on crude and other dollar-denominated commodities. Oil declined because of “the dollar funding stress and the second-order effect on global liquidity and, in turn, global growth,” said Jon Byrne, an analyst at Strategas Securities. The Organization of the Petroleum Exporting Countries and its allies said over the weekend they planned to hold back from lifting production quotas in the first quarter. The decision came as market observers brace for what is expected to be a global crude glut. The US oil benchmark has retreated almost 16% this year as OPEC+ and non-member nations ramped up production. Prices rebounded from five-month lows when the US recently announced sanctions on Rosneft PJSC and Lukoil PJSC, Russia’s two biggest oil companies, but have since surrendered some of those advances. Russian seaborne crude shipments fell sharply in the wake of the sanctions, dropping by the most since January 2024, according to data tracked by Bloomberg. Cargo discharges have been hit even harder than loadings, with oil held in tanker ships surging. Still, some are skeptical the restrictions will stop Russian oil from finding buyers. “Down the line, you will see that more and more of the disrupted Russian oil, one way or another, finds its way to the market,” Torbjörn Törnqvist, chief executive officer of Gunvor Group, said during an interview on Tuesday. “It always does somehow.” Eni SpA CEO Claudio Descalzi said Monday that any concerns about oversupply will be short-lived, the latest comments by an

Gunvor CEO Says Deal for Lukoil Assets Is a ‘Clean Break’

Gunvor Group Chief Executive Officer Torbjörn Törnqvist said a deal to acquire the international assets of sanctioned Russian oil producer Lukoil PJSC represents a “clean break” for the portfolio and should pass muster with regulators. “We believe it is satisfying all the concerns that may arise from a transaction of this magnitude and given the parties involved,” Törnqvist said in an interview with Bloomberg Television. He ruled out selling any of the assets back should sanctions on Lukoil be removed. “It’s a clean break; the moment the deal is done — that’s it.” Lukoil last week announced it had agreed to sell Gunvor its vast international network of oil wells, refineries and gas stations, as well as its trading book, without disclosing terms. If finalized, the deal is a coup for Gunvor, a large trader of oil and gas that has longstanding ties to Russia’s energy industry. In 2014, co-founder Gennady Timchenko was sanctioned by the US, which claimed Russian President Vladimir Putin had “investments in Gunvor,” which the company has consistently denied. Törnqvist said he believes any concerns the authorities might have about continued Russian influence over the portfolio would be satisfied. “We’re pretty confident that this deal ticks off all the critical boxes,” he said Tuesday. The US blacklisted Lukoil and fellow Russian oil giant Rosneft PJSC last month as part of a fresh bid to end the war in Ukraine by depriving Moscow of revenues. Gunvor’s subsequent deal is subject to clearance from the US Treasury’s Office of Foreign Assets Control, among other authorities. As part of the sanctions, Lukoil and its Litasco trading arm have a short window to wind down business dealings. Gunvor is in talks with US regulators to secure an extension to a license to transact with the Russian company. The US license

Energy Department Announces $625 Million to Advance the Next Phase of National Quantum Information Science Research Centers

WASHINGTON— The U.S. Department of Energy (DOE) today announced $625 million in funding to renew its five National Quantum Information Science (QIS) Research Centers, originally established under the National Quantum Initiative Act signed into law by President Trump in December 2018. The renewal of DOE’s National Quantum Information Science Research Centers advances President Trump’s directive to restore American leadership in quantum science and technology. The DOE is aligning its quantum research enterprise with national priorities, focusing resources on advancing critical R&D across the American QIS, strengthening the quantum innovation ecosystem, accelerating discoveries that power next-generation technologies, and securing American leadership in quantum computing, hardware, and applications. “President Trump positioned America to lead the world in quantum science and technology and today, a new frontier of scientific discovery lies before us. Breakthroughs in QIS have the potential to revolutionize the ways we sense, communicate, and compute, sparking entirely new technologies and industries,” said U.S. Department of Energy Under Secretary for Science Darío Gil. “The renewal of DOE’s National Quantum Information Science Research Centers will empower America to secure our advantage in pioneering the next generation of scientific and engineering advancements needed for this technology.” Each NQISRC: Supports fundamental science with disruptive potential across quantum computing, simulation, networking, and sensing. Develops unique tools, equipment, and instrumentation that unlock transformative new QIS capabilities. Advances quantum technology through application to DOE’s most pressing scientific and national security challenge areas. Establishes community resources, workforce opportunities, and industry partnerships to strengthen the entire QIS ecosystem. Center renewals include: Co-design Center for Quantum Advantage (C2QA) – Brookhaven National Laboratory will advance quantum computing and sensing by improving materials used in superconducting and plasma-grown, diamond-based quantum devices and developing modular approaches for superconducting and neutral-atom systems. Superconducting Quantum Materials and Systems Center (SQMS) – Fermi National Accelerator Laboratory

Xcel proposes doubling battery storage at Minnesota coal plant

Xcel Energy on Friday asked Minnesota regulators for permission to double the battery storage capacity at a location ajacent to its coal-fired Sherco power plant, which is slated to retire at the end of 2030. “We’re making a significant investment in battery storage because we see it as a critical part of Minnesota’s energy future,” Bria Shea, president of Xcel Energy-Minnesota, North Dakota and South Dakota, said in a statement. The Minnesota Public Utilities Commission has already approved 300 MW to be installed at Sherco. Xcel’s proposal would increase that capacity to 600 MW, making it the largest battery storage site in the upper Midwest, according to the utility. It would also add another 135.5 MW at the company’s Blue Lake facility and expand the company’s Sherco Solar facility with an additional 200-MW array. Xcel plans to start construction on the battery storage projects in 2026, and bring them online in late 2027. The projects will use lithium iron phosphate battery cell technology that “discharge energy in four-hour increments and are quick to recharge, allowing for regular use,” the utility said in a statement. Xcel said it plans to reuse existing grid connections for the batteries to store energy produced by wind, solar, nuclear and natural gas facilities across its system. “Batteries help us store energy when it’s inexpensive to produce and dispatch it when needed, allowing us to continue delivering reliable electricity to customers while keeping bills low,” Shea said. Xcel said it anticipates the projects will qualify for federal tax credits, offsetting 30% of the cost for the Blue Lake battery and 40% for the Sherco solar and battery projects. Xcel serves about 3.9 million electric customers across eight states, and expects retail sales to grow 5% through 2030. The utility on Thursday unveiled a $15 billion addition to

BP Profit Exceeds Expectations

BP Plc’s profit exceeded expectations, with operational improvements and higher oil and gas production outweighing lower prices, as the company’s turnaround plan builds momentum. The British energy giant posted adjusted third-quarter net income of $2.21 billion, higher than the average analyst estimate of $1.98 billion. Its quarterly share buyback plan was maintained and net debt rose slightly. The results signal Chief Executive Officer Murray Auchincloss is starting to deliver a turnaround plan to win back investor confidence by focusing on oil and gas production, selling non-strategic assets and cutting costs. “We continue to make good progress to cut costs, strengthen our balance sheet and increase cash flow and returns,” Auchincloss said in BP’s earnings statement. “We are looking to accelerate delivery of our plans, including undertaking a thorough review of our portfolio.” BP shares were little changed in London trading, as crude prices declined. BP’s plan to divest $20 billion of assets by the end of 2027 to improve the balance sheet still includes expectations of a transaction for lubricants business Castrol, Auchincloss said in an interview on Bloomberg TV. The firm also raised its disposal expectations for 2025, saying proceeds will exceed $4 billion after previously guiding between $3 to $4 billion. Quarterly share buybacks were held at $750 million, a reduced level BP announced earlier this year along with a strategic reset. Gearing — a ratio of net debt to equity that analysts have flagged as elevated compared to peers — ticked higher to 25.1%, from 24.6% in the previous quarter. Even though the company returned to focusing on fossil fuels, BP said its full year reported upstream production is expected to be slightly lower than last year. But in a telephone interview on Tuesday, Auchincloss said “maybe we’ll do better than that, but we don’t want to

Biden staffers say IRA was hobbled by slow deployment

Implementation of the Inflation Reduction Act and Bipartisan Infrastructure Law suffered from muddled aims, and projects took too long to materialize under the Biden administration, according to an October report from former Department of Energy staffers who interviewed more than 80 of their former colleagues on the topic. The slow rollout meant that the “political theory animating the [Biden] administration’s approach — that the economic development generated by clean energy projects and industries would create a durable bipartisan coalition — was never truly tested,” and the Trump administration has been able to claw back much of the associated funding, the report says. “Programs frequently tried to satisfy multiple aims at once: decarbonization, onshoring, labor, equity, national security,” the report says. “This layering of priorities blurred mandates and slowed action. This proved to be particularly challenging for requirements that were at odds with energy industry realities (e.g., impractical [Build America Buy America] requirements for every component; labor union requirements for transmission projects where union labor didn’t exist).” The report was written by Ramsey Fahs, a former policy advisor at DOE; Louise White, a former senior consultant with DOE’s Loan Programs Office and Office of Technology Transitions; and Alan Propp, who first worked as a senior strategy consultant with DOE’s LPO and then served as a deputy chief of staff in its Loan Underwriting and Structuring Division. All three left the agency in January. The authors say they interviewed more than 80 “political appointees and career staff who sat at the heart of implementation, with a primary focus on the infrastructure offices” at DOE, and noted that the interviews “are not exhaustive and at times interviewees reported conflicting information or divergent experiences.” However, interviewees seemed to agree that the implementation of the IRA and BIL was hampered by jumbled priorities, as well

Cisco centralizes customer experience around AI

The idea is to make sure enterprises are effectively choosing, implementing, and using the technologies they purchase to achieve their business goals, according to the company. Cisco CX offers a suite of services to help customers optimize their network infrastructure, security, collaboration, cloud and data center operations – from planning and design to implementation and maintenance. “For too long, the delivery of services has been fragmented, with support and professional services using different tools optimized for specific functions or lifecycle stages. This has led to a fragmented experience where customers, partners, and Cisco teams spend more time on data collection and tool maintenance than on high-value analysis,” wrote Bhaskar Jayakrishnan, senior vice president of engineering with the Cisco CX group in a blog about the new technology. “Historically, the handoffs between these stages have been inefficient. Designs are interpreted by humans and then converted into code. Operational data is manually analyzed to inform optimizations. This process is slow, error-prone, and loses critical context at every step.” “Cisco IQ represents a shift from this tool-centric model to an intelligence-centric one. It is a multi-persona system, serving customers, partners, and our own services teams through an API-first architecture. Our objective is to turn decades of institutional knowledge into a living, adaptive system that makes your infrastructure smarter, more resilient, and more secure,” Jayakrishnan wrote.

Data Center Jobs: Engineering, Construction, Commissioning, Sales, Field Service and Facility Tech Jobs Available in Major Data Center Hotspots

Each month Data Center Frontier, in partnership with Pkaza, posts some of the hottest data center career opportunities in the market. Here’s a look at some of the latest data center jobs posted on the Data Center Frontier jobs board, powered by Pkaza Critical Facilities Recruiting. Looking for Data Center Candidates? Check out Pkaza’s Active Candidate / Featured Candidate Hotlist Data Center Facility Technician (All Shifts Available) Impact, TX This position is also available in: Ashburn, VA; Abilene, TX; Needham, MA and New York, NY. Navy Nuke / Military Vets leaving service accepted! This opportunity is working with a leading mission-critical data center provider. This firm provides data center solutions custom-fit to the requirements of their client’s mission-critical operational facilities. They provide reliability of mission-critical facilities for many of the world’s largest organizations facilities supporting enterprise clients, colo providers and hyperscale companies. This opportunity provides a career-growth minded role with exciting projects with leading-edge technology and innovation as well as competitive salaries and benefits. Electrical Commissioning Engineer Montvale, NJ This traveling position is also available in: New York, NY; White Plains, NY; Richmond, VA; Ashburn, VA; Charlotte, NC; Atlanta, GA; Hampton, GA; Fayetteville, GA; New Albany, OH; Cedar Rapids, IA; Phoenix, AZ; Dallas, TX or Chicago IL *** ALSO looking for a LEAD EE and ME CxA Agents and CxA PMs. *** Our client is an engineering design and commissioning company that has a national footprint and specializes in MEP critical facilities design. They provide design, commissioning, consulting and management expertise in the critical facilities space. They have a mindset to provide reliability, energy efficiency, sustainable design and LEED expertise when providing these consulting services for enterprise, colocation and hyperscale companies. This career-growth minded opportunity offers exciting projects with leading-edge technology and innovation as well as competitive salaries and benefits. Data Center MEP Construction

NVIDIA at GTC 2025: Building the AI Infrastructure of Everything

Omniverse DSX Blueprint Unveiled Also at the conference, NVIDIA released a blueprint for how other firms should build massive, gigascale AI data centers, or AI factories, in which Oracle, Microsoft, Google, and other leading tech firms are investing billions. The most powerful and efficient of those, company representatives said, will include NVIDIA chips and software. A new NVIDIA AI Factory Research Center in Virginia will use that technology. This new “mega” Omniverse DSX Blueprint is a comprehensive, open blueprint for designing and operating gigawatt-scale AI factories. It combines design, simulation, and operations across factory facilities, hardware, and software. • The blueprint expands to include libraries for building factory-scale digital twins, with Siemens’ Digital Twin software first to support the blueprint and FANUC and Foxconn Fii first to connect their robot models. • Belden, Caterpillar, Foxconn, Lucid Motors, Toyota, Taiwan Semiconductor Manufacturing Co. (TSMC), and Wistron build Omniverse factory digital twins to accelerate AI-driven manufacturing. • Agility Robotics, Amazon Robotics, Figure, and Skild AI build a collaborative robot workforce using NVIDIA’s three-computer architecture. NVIDIA Quantum Gains And then there’s quantum computing. It can help data centers become more energy-efficient and faster with specific tasks such as optimization and AI model training. Conversely, the unique infrastructure needs of quantum computers, such as power, cooling, and error correction, are driving the development of specialized quantum data centers. Huang said it’s now possible to make one logical qubit, or quantum bit, that’s coherent, stable, and error corrected. However, these qubits—the units of information enabling quantum computers to process information in ways ordinary computers can’t—are “incredibly fragile,” creating a need for powerful technology to do quantum error correction and infer the qubit’s state. To connect quantum and GPU computing, Huang announced the release of NVIDIA NVQLink — a quantum‑GPU interconnect that enables real‑time CUDA‑Q calls from quantum

The Evolution of the Neocloud: From Niche to Mainstream Hyperscale Challenger

Infrastructure and Supply Chain Race Cloud competition is increasingly defined by the ability to secure power, land, and chips— three resources that dictate project timelines and customer onboarding. Neoclouds and hyperscalers face a common set of constraints: local utility availability, substation interconnection bottlenecks, and fierce competition for high-density GPU inventory. Power stands as the gating factor for expansion, often outpacing even chip shortages in severity. Facilities are increasingly being sited based on access to dedicated, reliable megawatt-scale electricity, rather than traditional latency zones or network proximity. AI growth forecasts point to four key ceilings: electrical capacity, chip procurement cycles, latency wall between computation and data, and scalable data throughput for model training. With hyperscaler and neocloud deployments now competing for every available GPU from manufacturers, deployment agility has become a prime differentiator. Neoclouds distinguish themselves by orchestrating microgrid agreements, securing direct-source utility contracts, and compressing build-to-operational timelines. Converting a bare site to a functional data hall with operators that can viably offer a shortened deployment timeline gives neoclouds a material edge over traditional hyperscale deployments that require broader campus and network-level integration cycles. The aftereffects of the COVID era supply chain disruptions linger, with legacy operators struggling to source critical electrical components, switchgear, and transformers, sometimes waiting more than a year for equipment. As a result, neocloud providers have moved aggressively into site selection strategies, regional partnerships, and infrastructure stack integration to hedge risk and shorten delivery cycles. Microgrid solutions and island modes for power supply are increasingly utilized to ensure uninterrupted access to electricity during ramp-up periods and supply chain outages, fundamentally rebalancing the competitive dynamics of AI infrastructure deployment. Creditworthiness, Capital, and Risk Management Securing capital remains a decisive factor for the growth and sustainability of neoclouds. Project finance for campus-scale deployments hinges on demonstrable creditworthiness; lenders demand

Canyon Magnet Energy: The Superconducting Future of Powering AI Data Centers

At this year’s Data Center Frontier Trends Summit, Honghai Song, founder of Canyon Magnet Energy, presented his company’s breakthrough superconducting magnet technology during the “6 Moonshot Trends for the 2026 Data Center Frontier” panel—showcasing how high-temperature superconductors (HTS) could reshape both fusion energy and AI data-center power systems. In this episode of the Data Center Frontier Show, Editor in Chief Matt Vincent speaks with Song about how Canyon Magnet Energy—founded in 2023 and based in New Jersey with research roots at Stony Brook University—is bridging fusion research and AI infrastructure through next-generation magnet and energy-storage technology. From Fusion Research to Data Center Reality Founded in 2023, Canyon Magnet Energy emerged from the advanced-magnet research ecosystem around Stony Brook and now operates a manufacturing line in Newark, New Jersey. Its team draws on decades of experience designing the ultra-strong magnetic fields that enable the confinement and stability of fusion plasma—but their ambitions go far beyond the laboratory. “Super magnets are the foundation of fusion,” Song explains in the interview. “But the same high-temperature superconductors that can make fusion practical can also dramatically improve how we move and store electricity in data centers.” The company’s magnets are built using REBCO (Rare Earth Barium Copper Oxide) tape, which operates at around 77 Kelvin—cold, but far warmer and more manageable than traditional low-temperature superconductors. The result is a zero-resistance pathway for electricity, unlocking new possibilities in power transmission, energy storage, and grid integration. Why High-Temperature Superconductors Matter Since their discovery in 1986, high-temperature superconductors have progressed from exotic physics experiments to industrial-scale wire and magnet manufacturing. Canyon Magnet Energy is among a new generation of companies moving this technology into the AI data-center context—where efficiency and instantaneous power responsiveness are increasingly critical. With AI training clusters consuming power at hundreds of megawatts per campus,

OpenAI spends even more money it doesn’t have

The aim, said Gogia, “is continuity, not cost efficiency. These deals are forward leaning, relying on revenue forecasts that remain speculative. In that context, OpenAI must continue to draw heavily on outside capital, whether through venture rounds, debt, or a future public offering.” He pointed out, “the company’s recent legal and corporate restructuring was designed to open the doors to that capital. Removing Microsoft’s exclusivity makes room for more vendors but also signals that no one provider can meet OpenAI’s demands. In several cases, suppliers are stepping in with financing arrangements that link product sales to future performance. While these strategies help close funding gaps, they introduce fragility. What looks like revenue is often pre-paid consumption, not realized margin.” Execution risks, he said, add to the concern. “Building and energizing enough data centers to meet OpenAI’s projected needs is not a function of ambition alone. It requires grid access, cooling capacity, and regional stability. Microsoft has acknowledged that it lacks the power infrastructure to fully deploy the GPUs it owns. Without physical readiness, all of these agreements sit on shaky ground.” Lots of equity swapping going on Scott Bickley, advisory fellow at Info-Tech Research Group, said he has not only been astounded by the funding announcements over the last few months, but is also appalled, primarily, he said, “because of the disconnect to what this does to the underlying technology stocks and their market prices versus where the technology is at from a development and ROI perspective … and from a boots on the ground perspective.” He added that while the financial pledges involve “huge, staggering numbers, most of them are tied up in ways that are not necessarily going to require all the cash to come from OpenAI. In a lot of cases, there is equity swapping. You have

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Stay Ahead, Stay ONMINE