Stay Ahead, Stay ONMINE

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too big of a scope to write about… but when a model like DeepSeek […]

Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too big of a scope to write about… but when a model like DeepSeek comes out of nowhere with a steel chair, boasting similar performance levels to other models, what does performance really mean in this context?

If you follow AI releases, you’ve seen this dance before. Every new model drops with its graphs showing how it’s somehow simultaneously better than GPT-4 on math problems while being smaller and more efficient. But what exactly are these benchmarks measuring? How are they created? And more importantly, how can we cut through the hype to create our own benchmarks for specific use cases?

I wanted to learn more about LLM Benchmarking.

Part 1: What is a Benchmark? (in 3 seconds)

TL:DR — The SATs (multiple, actually) for LLMs.

Part 1.1: What is a Benchmark? (in more than 3 seconds)

Before we dive into the nitty-gritty of specific benchmarks, let’s take a moment to unpack what we even mean by “LLM Benchmark.” Because calling them the “SATs for AI” feels both right and also slightly oversimplified.

LLM benchmarks are, at their core, structured tests used to measure how well large language models perform on certain tasks. These tasks can be anything from identifying if a statement is true or false, to summarizing a legal document, to generating valid Python functions. Think of them as curated obstacle courses specially designed by AI researchers to test every relevant muscle these models might have. These frameworks typically provide a dataset of inputs with known correct outputs, allowing for consistent comparison between models.

Modern benchmarks employ various evaluation methodologies. Classification metrics like accuracy work for tasks with discrete correct answers, while overlap-based metrics (BLEU, ROUGE) evaluate free-form text generation. Some benchmarks use functional testing for code generation, or employ other LLMs as judges to evaluate response quality.

A typical benchmark usually comes packaged as:

  • A standardized dataset of questions, prompts, or tasks (with correct or reference answers).
  • An evaluation protocol specifying how to measure success, like accuracy, F1 score, BLEU/ROUGE for text generation, or pass/fail rates for coding tasks.
  • A leaderboard or some form of comparative scoreboard, often with big flashy graphs.

Some really famous benchmarks include MMLU for testing multitask language understanding, TruthfulQA for assessing factual accuracy, and HumanEval for measuring coding capabilities. Results are pretty often published on public leaderboards, which let’s people perform some transparent comparison between different models.

From the DeepSeek paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

What Makes a Good Benchmark?

  1. A Clear Task Definition: We want tasks that are unambiguous. The more straightforward and well-specified the challenge, the easier it is to trust the results.
  2. Data Integrity: The test set shouldn’t be floating around in the training data. Because if the model’s seen the exact same question 50 times before, the evaluation is about as useful as giving a math quiz to someone who already has the answer key.
  3. Quantifiable Metrics: You need a standard for scoring performance — like how many times the model’s code passes test cases or how close the generated summary is to a “ground-truth” summary.
  4. Task Diversity & Difficulty: If a benchmark is too easy, everyone just ACES it on day one, and we learn… well, nothing. If it’s too niche (like “We test only the model’s ability to count the digits of Pi for 20 minutes”), that’s also not so helpful.

Life Ain’t All about The Grades

Benchmarks capture only a slice of what LLMs can do. In the real world, your chatbot might need to juggle domain knowledge, keep track of conversation context, abide by your company’s policies, and produce fluent, non-offensive replies. No single standardized test out there fully covers that. As we’ll see in the upcoming case studies, the design and execution of a benchmark can heavily shape the picture you get of your model’s performance… and sometimes lead you astray if you’re not careful with how you measure success.

Now that we have a sense of what Llm Benchmarks are designed to accomplish (and where they might fall short), let’s explore a couple of examples to see how people actually build and use them in practice — with mixed results!

Case Study #1: Leetcode as an LLM Benchmark

As a student in the tech space, the word “Leetcode” popping up during my search for cool benchmarks raised by blood pressure by a statistically significant amount. Unlike Leetcode, which sucks, the paper “Performance Study of LLM-Generated Code on Leetcode” was very interesting — it asks a deceptively simple question: can we use Leetcode to benchmark LLM code generation? Their findings reveal both the promise and pitfalls of this approach.

The Benchmark Design

The researchers built a three-stage validation system. Local tests catch basic errors, Leetcode’s judge verifies correctness, and a custom benchmarking setup measures performance. This setup revealed something critical: benchmarking code performance is harder than it looks.

When they compared local measurements to Leetcode’s metrics, they found only a 0.28 correlation. Leetcode’s measurements showed much higher variation (0.089 vs 0.035 locally). Even worse, Leetcode’s rankings proved unstable — identical solutions could drop from the 77th to 54th percentile just based on submission timing.

A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

The Real Problems

Three major issues emerged that challenge Leetcode’s viability as a benchmark:

Data Contamination: Using public problems risks LLMs having seen the solutions during training. The researchers had to use only problems from 2023 to mitigate this.

Platform Instability: Leetcode’s metrics drift over time — memory measurements showed a -0.24 correlation with test date. This makes reproducible benchmarking nearly impossible.

Measurement Reliability: The weak correlation between local and platform measurements raises questions about what we’re actually testing.

What It Means for LLM Benchmarking

This study doesn’t just critique Leetcode — it highlights what we need in a code generation benchmark: reproducible measurements, reliable performance metrics, and guaranteed training-test separation. Until we have platforms built specifically for this purpose, we need to be extremely cautious about using competition platforms as benchmarks.

So! We know that not all benchmarks are viable benchmarks — what about a more mainstream one?

Case Study #2: SuperGLUE — Building a Better Language Understanding Benchmark

The SuperGLUE paper tackles a fascinating problem in AI benchmarking: what do you do when models get too good at your tests? When GLUE became insufficient (with models surpassing human performance), the researchers had to rethink how we measure language understanding.

The Benchmark Design

SuperGLUE’s core innovation is its task selection methodology. The researchers collected task proposals from the NLP community and filtered them through a rigorous process: each task needed clear evaluation metrics, public training data, and — most importantly — significant headroom between machine and human performance.

This resulted in eight tasks (I’ve simplified the table from the document here, it’s a little less readable but you should get the sense of what the questions are asking):

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

What makes these tasks special is their diversity in format. Unlike GLUE’s focus on sentence classification, SuperGLUE includes coreference resolution, reading comprehension, and more com plex reasoning tasks. Each task measures different aspects of language understanding while maintaining clear, quantifiable metrics.


Part 2: Let’s Build a Physical Reasoning Benchmark: To Cheat at Escape Rooms

After looking at some benchmarks like SuperGLUE and Leetcode, I had an idea: what if we tested LLMs on something completely different — physical reasoning… through escape room puzzles?

It’s a pretty valid idea — escape rooms poses possibilities and consequences for failure — screw up one too many puzzles, and your friends will think you’re pretty stupid, and relegate you to spectator duty. Luckily for us however, they (or the poor employees) don’t know that you can sneak a phone into an escape room — and you know just who to ask for the answers. Today, LLMs face off against the puzzles of a physical escape room.

Note: This is NOT a rigorous academic benchmark (please don’t cite this in papers, why would you even want to do that?), or even close to it, and it’s just supposed to be a fun way to test LLM benchmarking and evaluation. Please do not destroy my prompts, I am aware they are bad.

Why Physical Reasoning?

For real, though… most LLM benchmarks focus on linguistic tasks (like SuperGLUE) or code generation (like Leetcode). And for good reason — these are well-defined domains with clear evaluation metrics. But real-world problem solving often requires understanding physical principles and their interactions. The famous “Can GPT-4 do physics?” debates usually center around mathematical problem-solving, not practical physical reasoning.

Looking at existing benchmarks taught me a few key principles:

  1. Clear evaluation metrics are crucial (from SuperGLUE’s task-specific scores)
  2. Problems should have unambiguous solutions (from HumanEval’s test cases)
  3. The benchmark should test distinct capabilities (from MMLU’s subject categories)

Designing the Problems

I settled on escape room puzzles for two reasons. First, they naturally combine physical reasoning with clear goals. Second, they have unambiguous success conditions — either you solve it through the intended way, or you don’t. Third, and most importantly, they let me include “red herrings” — irrelevant items that test if the LLM can identify what matters physically. Fourth, I just really like doing escape rooms (did I mention that already?),

I am aware that this is more than two reasons, but if LLMs can’t count how many rs’ there are in strawberry, I’m allowed to mess up once in a while too.

Here’s how I structured the five core problems:

Fluid Dynamics (FLUID_001) (Ping pong ball stuck in a tube)

  • Tests understanding of buoyancy and fluid displacement
  • Inspired by classic physics problems but in practical context
  • Includes intentionally irrelevant items (like squishy food models)

Light Properties (UV_001) (UV light on a push numebr lock)

  • Tests understanding of UV fluorescence and material properties
  • Combines multiple physical principles (light, material science)
  • Requires understanding of environmental conditions

Mechanical Understanding (CIPHER_001) (A cipher ring)

  • Tests spatial reasoning and mechanical alignment
  • No red herrings — tests for correlating a dial to a cypher wheel
  • Requires understanding rotational symmetry

Force Application (VAC_001) (Can stuck in hole)

  • Tests understanding of vacuum forces and surface adhesion
  • Multiple possible solution approaches
  • Requires understanding force multiplication

Collaborative Physics (COLLAB_001) (Can two people shimmy a key?)

  • Tests understanding of physical constraints in multi-agent scenarios
  • Requires combining multiple physical principles
  • Tests understanding of tool creation and friction

Sounds really fancy… but it’s just some basic physical puzzles. You can access them on my GitHub.

The Technical Part

The benchmark implementation has three main components:

Problem Definition Layer

Problems are defined in a structured JSON format that enforces consistent evaluation:

{
    "problem_id": "FLUID_001",
    "setup": {
        "scenario": "A ping pong ball is at the bottom of a narrow tube...",
        "available_items": ["bottle of water", "squishy food models"...],
        "constraints": ["tube too narrow for manual retrieval"]
    },
    "physical_principles": ["buoyancy", "fluid displacement"],
    "red_herrings": ["squishy food models", "milk carton"],
    "solution": {
        "steps": ["pour water into tube", "allow ball to float"],
        "key_insights": ["water displaces air", "ping pong ball less dense"]
    }
}

This structure draws from SuperGLUE’s design — each component is clearly separated and machine-readable. The physical_principles field explicitly lists what’s being tested, while red_herrings helps in scoring the LLM’s ability to ignore irrelevant information.

2. Evaluation Framework

The evaluation system uses Python’s asyncio for concurrent testing, with retry logic for a little bit more API stability:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def evaluate_response(self, criteria: JudgingCriteria) -> Dict:
    """Evaluate a model's response using GPT-4 as judge."""
    async with aiohttp.ClientSession() as session:
        # ... evaluation logic

The scoring system looks at three components:

Physical Understanding Score (PUS) ∈ [0,2]

  • Measures understanding of relevant physical principles
  • Calculated as normalized sum of demonstrated principles

Solution Path Score (SPS) ∈ [0,2]

  • Evaluates completeness and correctness of solution steps
  • Considers practical feasibility of proposed solutions

Red Herring Handling (RHH) ∈ {0,1}

  • A Binary score for avoiding irrelevant items
  • Tests ability to focus on physically relevant factors

And yes, there are also so many other scoring methods, better and worse, that could be used! For example, RHH could be about how many irrelevant items are used in the solution, or it could be a measure of how viable the use is… the point is that picking these metrics are often times pretty arbitrary, but are very very important to making your benchmark is credible, which mine is very much not.

Additionally, I did not want to rewrite any code after. Sue me.

3. Model Interface Layer

The benchmark supports multiple LLM backends through a common interface:

class ModelInterface:
    """Interface for different LLM APIs."""
    async def generate_response(self, prompt: str) -> str:
        raise NotImplementedError
class GPT4Interface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # GPT-4 specific implementation
class ClaudeInterface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # Claude specific implementation

Two models… I can’t really afford any more, please understand.

Let’s Test It!

So after some refilling of API balances, I was ready to go. I ran the benchmark, and after totally not wasting credits on like, 5 different errors, the results were in. And they were… interesting! I used GPT4 and Claude 3.5 Haiku (I am aware these are not similarly spec-ed models) as our victims.

The Challenge: A can is stuck in a hole that almost exactly matches its volume. A combination is written on the bottom of the can that needs to be retrieved.

The inspiration
"VAC_001": {
        "problem_id": "VAC_001",
        "setup": {
            "available_items": [
                "ruler",
                "notebook",
                "suction cup",
                "UV light"
            ],
            "constraints": [
                "can fits too tightly for finger grip",
                "hole is same diameter as can",
                "can is clean and dry"
            ]
        },
        "physical_principles": [
            "vacuum suction",
            "surface adhesion",
            "force multiplication"
        ],
        "red_herrings": [
            "ruler",
            "notebook",
            "UV light"
        ],
        "solution": {
            "steps": [
                "inspect suction cup and can surface",
                "ensure can surface is clean",
                "firmly press suction cup onto can top",
                "create vacuum seal",
                "pull can out using suction cup"
            ],
            "key_insights": [
                "suction cup creates vacuum seal",
                "vacuum force allows extraction",
                "other items don't provide necessary force"
            ]
        }

GPT-4’s Solution

Physical Principles: The main physical principle involved in this solution is the concept of air pressure differential, which is the same principle that allows a suction cup to stick to a surface. Air pressure is higher on one side of the suction cup (outside) compared to the other (inside), and this difference in pressure causes the cup to stick to the surface.

Solution Steps:

  1. Attach the suction cup to the top of the can. The flat surface of the can should provide a good contact for the suction cup to adhere to.
  2. Once the suction cup is firmly attached, gently pull upwards. The air pressure differential should cause the suction cup (and thus the can) to be lifted out of the hole.
  3. Once the can is out of the hole, remove the suction cup from the top of the can.
  4. Use the UV light to illuminate the combination written on the bottom of the can. Some inks only become visible under UV light, which is why this step is necessary.
  5. Write down the combination on the notebook to ensure it’s not forgotten.

Required Items:

  • Suction cup: This is required to adhere to the can and pull it out of the hole, as the can fits too tightly for finger grip.
  • UV light: This is required to reveal the combination written on the bottom of the can, as it may be written in ink that only becomes visible under UV light.
  • Notebook: This is required to record the combination once it’s visible, ensuring it’s not forgotten.”

Not bad at all! GPT-4 manages to figure out the physical principle behind the solution (even if it is very wordy), and manages to step-by-step describe how to solve the problem. It does end up using the UV light and notebook, but that’s more of a problem on my end, not punishing the use of unnecessary items in the prompt enough. Also, the uses for these items seem decently reasonable… so why not!

Haiku-3.5’s Solution

Physical Principles: The key physical principles involved in this scenario are the concept of volume and the use of suction to extract the can from the hole. Since the can fits almost exactly into the hole, it is likely that the can is stuck due to the tight fit and the lack of any air gaps or space between the can and the hole. This tight fit creates a vacuum-like effect, making it difficult to grip and pull the can out.

Solution Steps:

  1. Measure the diameter of the can using the ruler. This will help determine the size of the hole and confirm that the can fits tightly within it.
  2. Inspect the bottom of the can to locate the combination written on it. Use the UV light to illuminate the bottom of the can and make the combination visible, if it is not already clearly visible.
  3. Attach the suction cup to the top of the can. The suction cup will create a temporary seal on the can’s surface, allowing you to grip and pull it out of the hole.
  4. Gently pull the can upwards using the suction cup. The suction force should be strong enough to overcome the vacuum-like effect and extract the can from the hole.
  5. Once the can is out, record the combination written on the bottom of the can in the notebook.

Required Items:

  • Ruler: To measure the diameter of the can and confirm the tight fit in the hole.
  • UV light: To illuminate the bottom of the can and make the combination visible, if it is not already clearly visible.
  • Suction cup: To create a temporary seal on the can’s surface and provide the necessary grip to pull the can out of the hole.
  • Notebook: To record the combination written on the bottom of the can.

On the other hand, Haiku’s solution was.. certainly a solution of all time.

First, it wants to measure the diameter of the can using the ruler??? To determine the size of the hole and confirm that the can fits tightly within it? Why would we need to do this? And do we need a ruler for that?

Second, it tells us to inspect the bottom of the can to locate the combination, when the entire problem is about not being able to pull the can out of the hole conventionally. This might just be an issue of order, but now I truly understand my friends’ feelings whenever I would tell them “just fix it man” to their numerous problems.

But it eventually does get the solution. So… not the worst.

Here’s a fancy radar graph of the results!

We see that both models are pretty similar in their capabilities, with GPT-4 being slightly better in physical understanding and solution path, and Haiku being slightly better in red herring handling. Overall though, both models kind of suck. Dang.

There are also only… 5 questions.

If you’d like to see the full breadth of questions, they’re on my GitHub.

LLM-as-a-Judge

By the way, the method I used to generate the evaluations, LLM-as-a-judge, has gained significant traction in the AI community, particularly after the work of Zheng et al. in their 2023 paper “Judging LLM-as-a-Judge.” The technique has proven remarkably effective, achieving over 80% agreement with human evaluators in tasks ranging from code assessment to dialogue quality evaluation!

Here’s where my experiment gets kind of cool (arguably, maybe, subjectively) — I used this methodology and had GPT-4 judge other LLMs’ physical reasoning abilities. Yes, I’m using an AI to judge other AIs.

Why does this work? Well, judging a response is actually a simpler task than generating one. When GPT-4 generates a solution to a physical puzzle, it needs to:

  • Understand the physical principles involved
  • Plan a sequence of steps
  • Consider all constraints
  • Generate a coherent explanation

But when judging, it only needs to check if specific criteria are met in an existing solution. The evaluation prompt is very focused:

def _create_evaluation_prompt(self, criteria: JudgingCriteria) -> str:
    return f"""You are an expert judge evaluating an LLM's understanding of physical reasoning puzzles.
Evaluate based on three criteria:
2. Physical Understanding Score (0-2): Does the solution correctly apply relevant physical principles?
3. Solution Path Score (0-2): Are the steps complete and feasible?
4. Red Herring Handling (0-1): Does it avoid using irrelevant items?
Scenario: {criteria.scenario}
Physical Principles Required: {criteria.correct_principles}
Solution Given: {criteria.model_response}
"""

To validate this approach, I followed the validation framework suggested by Zheng et al., performing spot-checks of GPT-4’s evaluations against my own judgments. Surprisingly (or perhaps unsurprisingly, given the broader research on LLM evaluation), it was remarkably consistent in identifying both correct physical understanding and flawed reasoning.

Is this perfect? Absolutely not. There’s something philosophically weird about using one LLM to evaluate another. But in practice, it can work surprisingly well — just like how I moan and groan about the visual presentation of a dish on Masterchef, while setting my kitchen aflame trying to microwave a hot dog.

What I Learned

Building this benchmark taught me several things about benchmark design:

Clear Metrics Matter: Even for complex tasks like physical reasoning, you need unambiguous scoring criteria.

Red Herrings Are Powerful: Including irrelevant items reveals a lot about an LLM’s reasoning process.

Context Control is Hard: Ensuring LLMs don’t “hallucinate” additional physical context is challenging.

Is this a perfect benchmark? Not even close. Please don’t rub it in. Is it scientifically rigorous? Definitely not. But it’s been a fascinating exploration into an aspect of LLM capabilities, and sometimes the best we can learn can come from just trying things out and seeing what happens.

Now, if you’ll excuse me, I will be sneaking in a phone with an internet connection into my next escape room, for reasons that I am legally unmotivated to disclose.

[1] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks Track (2023)

[2] T. Coignion, C. Quinton, R. Rouvoy, “A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

[3] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

[5] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948 (2025)

[6] Unless otherwise stated, all images are created by the author.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Eying AI factories, Nvidia buys bigger stake in CoreWeave

Nvidia continues to throw its sizable bank account around, this time making a $2 billion investment in GPU cloud service provider CoreWeave. The company says the investment reflects Nvidia’s “confidence in CoreWeave’s business, team and growth strategy as a cloud platform built on Nvidia infrastructure.” CoreWeave is not the only

Read More »

AI, security tailwinds signal promising 2026 for Cisco

A big component of AI in communications is agentic agents talking to employees and customers, and bringing trust to the system is where Cisco should shine. It builds and runs its own infrastructure, which is secure by design. Cisco has relationships with governments all over the world, and between Webex

Read More »

Enterprise Spotlight: Manufacturing Reimagined

Emerging technologies from AI and extended reality to edge computing, digital twins, and more are driving big changes in the manufacturing world.  Download the February 2026 issue of the Enterprise Spotlight from the editors of CIO, Computerworld, CSO, InfoWorld, and Network World and learn about the new tech at the forefront

Read More »

Texas Upstream Employment Rises

Employment in the Texas upstream sector increased between November and December 2025. That’s what the Texas Independent Producers and Royalty Owners Association (TIPRO) said in a statement sent to Rigzone on Friday, which cited the latest Current Employment Statistics (CES) report from the U.S. Bureau of Labor Statistics (BLS) at the time. TIPRO highlighted in the statement that oil and natural gas extraction jobs rose by 500, or 0.7 percent, month on month, to 70,200, and support activities employment grew by 1,500, or 1.1 percent month on month, to 133,200. TIPRO reported in the statement that combined upstream employment increased by 2,000 jobs, or 1.0 percent month on month, to 203,400. “From January to December 2025, employment in the Texas upstream sector showed early gains followed by later fluctuations,” TIPRO said in the statement. “Oil and Gas Extraction added a net 2,000 jobs (+2.9 percent), reaching a peak of 70,200 in June, July, and December, driven by robust Permian production despite market pressures,” it added. “Support Activities employment recorded a net loss of 2,100 jobs (-1.6 percent), with a February0May surge (+2,800) partially offset by mid-year declines (-3,400 in June-July) and subsequent volatility, reflecting rig count reductions and service sector adjustments,” it continued. “Combined, the sectors ended essentially flat, with a net change of -100 jobs (-0.05 percent), reaching 203,400 by December and underscoring the industry’s critical yet volatile role in sustaining Texas’ energy workforce,” TIPRO noted. In the statement, TIPRO said its workforce data “continues to indicate strong job postings for the Texas oil and natural gas industry in December” but added that analysis “revealed a continued decline in Q4 driven by lower oil prices, industry consolidation, and ongoing efficiency gains, which allow companies to maintain or increase production with reduced hiring activity”. There were 7,887 unique industry job postings in Texas during the

Read More »

Strategists Forecast Week on Week USA Crude Inventory Rise

In an oil and gas report sent to Rigzone by the Macquarie team prior to the release of the U.S. Energy Information Administration’s (EIA) weekly petroleum status report, Macquarie strategists revealed that they are forecasting that U.S. crude inventories will increase week on week. “We are forecasting U.S. crude inventories up 1.9 million barrels for the week ending January 30,” the strategists, including Walt Chancellor, said in the report. “This follows a 2.3 million barrel draw in the prior week, with the crude balance realizing tighter relative to our expectations,” they added. “For this week’s stats, we see significant room for volatility due to winter storm impacts on oil production, refinery runs, and product demand,” they continued. “In any event, for the week ending 1/30, from refineries, we look for another reduction in crude runs (-0.3 million barrels per day), with storm effects and turnaround timing adding noise to the picture,” they noted. “Among net imports”, the strategists said in the report that they “model a large increase, with exports sharply lower (-1.0 million barrels per day) and imports sharply higher (+0.8 million barrels per day) on a nominal basis”. The strategists warned that timing of cargoes remains a source of potential volatility in the weekly crude balance. They went on to state in the report that, “from implied domestic supply (prod.+adj.+transfers)”, they “look for a large nominal reduction (-1.6 million barrels per day) following a strong print in the prior week and accounting for freeze impacts”. “Notably, while visibility on the ultimate impact of last week’s freeze event remains limited, we believe oil production has largely recovered to this point. Rounding out the picture, we anticipate a smaller increase (+0.2 million barrels) in SPR [U.S. Strategic Petroleum Reserve] stocks for the week ending 1/30,” they added. “Among products, we

Read More »

Equinor Reduces Share Buyback

Equinor ASA reined in its share buyback after fourth-quarter profit missed analyst estimates amid a drop in oil and gas prices. The Norwegian energy giant will repurchase as much as $1.5 billion of shares this year, down from $5 billion in 2025, the company said Wednesday. Profit slumped by almost a third year-on-year. Equinor was among many oil and gas producers to funnel surplus cash to shareholders after Russia’s 2022 invasion of Ukraine drove up energy prices, generating massive profits for the industry. Some companies are now seeking to scale back payouts after the markets weakened amid plentiful supplies. “We are coming out of a supercycle in natural gas,” Equinor Chief Financial Officer Torgrim Reitan said in a Bloomberg Television interview. “This is the first year where we are normalized, where we have to manage within our means and this is a normal level.” Adjusted operating income after tax dropped to $1.55 billion, falling short of the $1.59 billion average estimate. Equinor is the first of Europe’s major energy companies to report quarterly numbers, and the results may set the tone for the earnings season. Oil closed out the year with its steepest annual loss since 2020. European gas also posted a sharp annual decline.   At Equinor, the impact of lower prices was mitigated by a strong quarter for its midstream unit and an increase in oil and gas production at home and abroad, with full-year volumes rising to a record. The company’s new Johan Castberg field and Brazil’s Bacalhau development both contributed to the gain, and Equinor sees output growing about 3% this year. Equinor’s marketing, midstream and processing, or MMP, business reported adjusted operating income of $678 million following a boost in third-party volumes. The company in October revised its quarterly guidance for the unit, saying it would target earnings of “around $400 million

Read More »

Oil Ends Day Higher After Drone Incident

Oil edged higher after US and Iranian forces appeared to square off in the sea and air, heightening concerns about an escalation in tensions. West Texas Intermediate rose to settle above $63 a barrel after an Iranian drone approached an American aircraft carrier in the Arabian Sea and was shot down. The episode restored some geopolitical risk premium that had ebbed in recent days amid signs Washington was softening its stance on Tehran. Futures pared some gains after White House Press Secretary Karoline Leavitt said US President Donald Trump wants to pursue diplomacy “first” with Iran. Prices advanced in post-settlement trading, rising as much as 3.3%. The development came hours after an oil tanker that is part of a US military fuel procurement program was hailed by Iranian ships in the Strait of Hormuz, evincing renewed risks to maritime traffic in the region. Tanker rates have soared in recent days over concerns about the Hormuz chokepoint through which about one-third of the world’s oil flows. The events underscore how recent US moves toward diplomacy with Iran reflect not a desire to deescalate but a calculation that Washington has sufficient leverage to strong-arm Tehran into a nuclear agreement, among other demands, according to Gregory Brew, geopolitical analyst at the Eurasia Group. He estimates that a $3 to $5 risk premium is currently baked into prices. Leavitt’s comments are likely an attempt “to brush off efforts by the Iranians to destabilize the environment, because the environment right now is favorable to the US,” Brew added. Still, Tuesday’s episode whipsawed investors who had been watching moves that suggested the US was steering clear of military strikes on the country over its nuclear program and handling of recent protests. Trump earlier said talks could begin within days after Tehran signaled it was ready to

Read More »

Energy Department Announces Members of the Office of Science Advisory Committee, Strengthening Gold Standard Science in America

WASHINGTON—The U.S. Department of Energy (DOE) today announced the chair and members of the newly established Office of Science Advisory Committee (SCAC), a unified advisory body that will provide independent advice on complex scientific and technical challenges across the Department’s Office of Science. Today’s announcement advances the Department’s implementation of President Trump’s Executive Order Restoring Gold Standard Science as the cornerstone of federal research—ensuring that the Department and its National Laboratory systems’ science is collaborative, transparent, and guided by evidence to rebuild public trust in science. As DOE modernizes and strengthens its scientific enterprise, SCAC will provide expert input to help inform priorities, improve coordination, and address cross-cutting research challenges across the Office of Science. “The establishment of SCAC underscores the Department’s commitment to scientific integrity and the power of partnership,” said DOE Under Secretary for Science Darío Gil. “By bringing together leading minds from diverse institutions, we’re forging a collaborative framework that will not only enhance our scientific endeavors but also accelerate the translation of fundamental research into tangible benefits for the American people. This committee exemplifies how shared vision and collective expertise are essential for navigating the complex scientific landscape of today and tomorrow.” Members of SCAC, appointed by Under Secretary Gil, represent the full breadth of Office of Science research, drawing expertise from leaders across academia, industry, science philanthropy, and the Department’s National Laboratories. The Committee will help the Office of Science adapt to a rapidly evolving research landscape and address interdisciplinary challenges in a streamlined and flexible manner. It will also provide advice on initiatives that are priorities for the entire Office, including the Genesis Mission, scientific discovery, fusion energy, and quantum science. SCAC will be chaired by Persis Drell, professor of materials science and engineering and physics at Stanford University, provost emerita of Stanford, and

Read More »

Transmission planning, development improved since 2023 in most US regions: report

Listen to the article 4 min This audio is auto-generated. Please let us know if you have feedback. Dive Brief: Transmission planning and development is improving in most parts of the United States, driven by new federal planning requirements, according to a report released Tuesday by Americans for a Clean Energy Grid. New England’s grade jumped to a “B” from the “D+” it received in the benchmark report ACEG issued in 2023. However, the grade for Texas slipped to a “D-” from a “D+” two years ago, and the grade for the Southeast remained unchanged at “F.” “In the Southeast, a key hurdle for regional transmission planning is the lack of access to information and transparency,” Grid Strategies, which wrote the report, said. “Beyond the projects under development in Georgia, there is resistance to building large, high-voltage transmission.” Dive Insight: Transmission planning and development grades by region. Permission granted by Americans for a Clean Energy Grid “Transmission planning works when it’s proactive, coordinated, and long-term,” Christina Hayes, ACEG executive director, said in a press release. “The challenge now is scaling those successes fast enough — across and between regions — to keep electricity affordable and reliable for all Americans as demand continues to grow.” Regional transmission planning reforms from Federal Energy Regulatory Commission’s Order No. 1920 are beginning to take hold, and early progress is visible in several regions, Grid Strategies said in the report. “However, many regions continue to fall well short of best practices, and progress remains uneven relative to the scale and urgency of today’s transmission needs,” Grid Strategies said. The report comes amid surging load growth forecasts, which could lead to short-term, inefficient transmission fixes, it says. The report’s authors call for the power sector to embrace long‑term regional and interregional planning. “Proactive, holistic long‑term planning that also

Read More »

Azure outage disrupts VMs and identity services for over 10 hours

After multiple infrastructure scale-up attempts failed to handle the backlog and retry volumes, Microsoft ultimately removed traffic from the affected service to repair the underlying infrastructure without load. “The outage didn’t just take websites offline, but it halted development workflows and disrupted real-world operations,” said Pareekh Jain, CEO at EIIRTrend & Pareekh Consulting. Cloud outages on the rise Cloud outages have become more frequent in recent years, with major providers such as AWS, Google Cloud, and IBM all experiencing high-profile disruptions. AWS services were severely impacted for more than 15 hours when a DNS problem rendered the DynamoDB API unreliable. In November, a bad configuration file in Cloudflare’s Bot Management system led to intermittent service disruptions across several online platforms. In June, an invalid automated update disrupted the company’s identity and access management (IAM) system, resulting in users being unable to use Google to authenticate on third-party apps. “The evolving data center architecture is shaped by the shift to more demanding, intricate workloads driven by the new velocity and variability of AI. This rapid expansion is not only introducing complexities but also challenging existing dependencies. So any misconfiguration or mismanagement at the control layer can disrupt the environment,” said Neil Shah, co-founder and VP at Counterpoint Research. Preparing for the next cloud incident This is not an isolated incident. For CIOs, the event only reinforces the need to rethink resilience strategies. In the immediate aftermath when a hyperscale dependency fails, waiting is not a recommended strategy for CIOs, and they should focus on a strategy of stabilize, prioritize, and communicate, stated Jain. “First, stabilize by declaring a formal cloud incident with a single incident commander, quickly determining whether the issue affects control-plane operations or running workloads, and freezing all non-essential changes such as deployments and infrastructure updates.”

Read More »

Intel sets sights on data center GPUs amid AI-driven infrastructure shifts

Supply chain reliability is another underappreciated advantage. Hyperscalers want a credible second source, but only if Intel can offer stable, predictable roadmaps across multiple product generations. However, the company runs into a major constraint at the software layer. “The decisive bottleneck is software,” Rawat said. “CUDA functions as an industry operating standard, embedded across models, pipelines, and DevOps. Intel’s challenge is to prove that migration costs are low, and that ongoing optimization does not become a hidden engineering tax.” For enterprise buyers, that software gap translates directly into switching risk. Tighter integration of Intel CPUs, GPUs, and networking could improve system-level efficiency for enterprises and cloud providers, but the dominance of the CUDA ecosystem remains the primary barrier to switching, said Charlie Dai, VP and principal analyst at Forrester. “Even with strong hardware integration, buyers will hesitate without seamless compatibility with mainstream ML/DL frameworks and tooling,” Dai added.

Read More »

8 hot networking trends for 2026

Recurring license fees may have dissuaded enterprises from adopting AIOps in the past, but that’s changing, Morgan adds: “Over the past few years, vendors have added features and increased the value of those licenses, including 24×7 support. Now, by paying the equivalent of a fraction of a network engineer’s salary in license fees, a mid-sized enterprise can reduce hours spent on operations and level-one support in order to allocate more of their valuable networking experts’ time to AI projects. Every enterprise’s business case will be different, but with networking expertise in high demand, we predict that in 2026, the labor savings will outweigh the additional license costs for the majority of mid-to-large sized enterprises.” 2. AI boosts data center networking investments Enterprise data centers, which not so long ago were on the endangered species list, have made a remarkable comeback, driven by the reality that many AI workloads need to be hosted on premises, either for privacy, security, regulatory, latency or cost considerations. The global market for data center networking technologies was estimated at around $46 billion in 2025 and is projected to reach $103 billion by the end of 2030, a growth rate of nearly 18%, according to BCC Research: “The data center networking technologies market is rapidly changing due to increasing use of AI-powered solutions across data centers and sectors like telecom, IT, banking, financial services, insurance, government and commercial industries.” McKinsey predicts that global demand for data center capacity could nearly triple by 2030, with about 70% of that demand coming from AI workloads. McKinsey says both training and inference workloads are contributing to data center growth, with inference expected to become the dominant workload by 2030. 3. Private clouds roll in Clearly, the hyperscalers are driving most of the new data center construction, but enterprises are

Read More »

Cisco: Infrastructure, trust, model development are key AI challenges

“The G200 chip was for the scale out, because what’s happening now is these models are getting bigger where they don’t just fit within a single data center. You don’t have enough power to just pull into a single data center,” Patel said. “So now you need to have data centers that might be hundreds of kilometers apart, that operate like an ultra-cluster that are coherent. And so that requires a completely different chip architecture to make sure that you have capabilities like deep buffering and so on and so forth… You need to make sure that these data centers can be scaled across physical boundaries.”  “In addition, we are reaching the physical limits of copper and optics, and coherent optics especially are going to be extremely important as we go start building out this data center infrastructure. So that’s an area that you’re starting to see a tremendous amount of progress being made,” Patel said. The second constraint is the AI trust deficit, Patel said. “We currently need to make sure that these systems are trusted by the people that are using them, because if you don’t trust these systems, you’ll never use them,” Patel said. “This is the first time that security is actually becoming a prerequisite for adoption. In the past, you always ask the question whether you want to be secure, or you want to be productive. And those were kind of needs that offset each other,” Patel said. “We need to make sure that we trust not just using AI for cyber defense, but we trust AI itself,” Patel said. The third constraint is the notion of a data gap. AI models get trained on human-generated data that’s publicly available on the Internet, but “we’re running out,” Patel said. “And what you’re starting to see happen

Read More »

How Robotics Is Re-Engineering Data Center Construction and Operations

Physical AI: A Reusable Robotics Stack for Data Center Operations This is where the recent collaboration between Multiply Labs and NVIDIA becomes relevant, even though the application is biomanufacturing rather than data centers. Multiply Labs has outlined a robotics approach built on three core elements: Digital twins using NVIDIA Isaac Sim to model hardware and validate changes in simulation before deployment. Foundation-model-based skill learning via NVIDIA Isaac GR00T, enabling robots to generalize tasks rather than rely on brittle, hard-coded behaviors. Perception pipelines including FoundationPose and FoundationStereo, that convert expert demonstrations into structured training data. Taken together, this represents a reusable blueprint for data center robotics. Applying the Lesson to Data Center Environments The same physical-AI techniques now being applied in lab and manufacturing environments map cleanly onto the realities of data center operations, particularly where safety, uptime, and variability intersect. Digital-twin-first deployment Before a robot ever enters a live data hall, it needs to be trained in simulation. That means modeling aisle geometry, obstacles, rack layouts, reflective surfaces, and lighting variation; along with “what if” scenarios such as blocked aisles, emergency egress conditions, ladders left in place, or spill events. Simulation-first workflows make it possible to validate behavior and edge cases before introducing any new system into a production environment. Skill learning beats hard-coded rules Data centers appear structured, but in practice they are full of variability: temporary cabling, staged parts, mixed-vendor racks, and countless human exceptions. Foundation-model approaches to manipulation are designed to generalize across that messiness far better than traditional rule-based automation, which tends to break when conditions drift even slightly from the expected state. Imitation learning captures tribal knowledge Many operational tasks rely on tacit expertise developed over years in the field, such as how to manage stiff patch cords, visually confirm latch engagement, or stage a

Read More »

Applied Digital CEO Wes Cummins On the Hard Part of the AI Boom: Execution

Designing for What Comes After the Current AI Cycle Applied Digital’s design philosophy starts with a premise many developers still resist: today’s density assumptions may not hold. “We’re designing for maximum flexibility for the future—higher density power, lower density power, higher voltage delivery, and more floor space,” Cummins said. “It’s counterintuitive because densities are going up, but we don’t know what comes next.” That choice – to allocate more floor space even as rack densities climb – signals a long-view approach. Facilities are engineered to accommodate shifts in voltage, cooling topology, and customer requirements without forcing wholesale retrofits. Higher-voltage delivery, mixed cooling configurations, and adaptable data halls are baked in from the start. The goal is not to predict the future perfectly, Cummins stressed, but to avoid painting infrastructure into a corner. Supply Chain as Competitive Advantage If flexibility is the design thesis, supply chain control is the execution weapon. “It’s a huge advantage that we locked in our MEP supply chain 18 to 24 months ago,” Cummins said. “It’s a tight environment, and more timelines are going to get missed in 2026 because of it.” Applied Digital moved early to secure long-lead mechanical, electrical, and plumbing components; well before demand pressure fully rippled through transformers, switchgear, chillers, generators, and breakers. That foresight now underpins the company’s ability to make credible delivery commitments while competitors confront procurement bottlenecks. Cummins was blunt: many delays won’t stem from poor planning, but from simple unavailability. From 100 MW to 700 MW Without Losing Control The past year marked a structural pivot for Applied Digital. What began as a single, 100-megawatt “field of dreams” facility in North Dakota has become more than 700 MW under construction, with expansion still ahead. “A hundred megawatts used to be considered scale,” Cummins said. “Now we’re at 700

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »