Stay Ahead, Stay ONMINE

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too big of a scope to write about… but when a model like DeepSeek […]

Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too big of a scope to write about… but when a model like DeepSeek comes out of nowhere with a steel chair, boasting similar performance levels to other models, what does performance really mean in this context?

If you follow AI releases, you’ve seen this dance before. Every new model drops with its graphs showing how it’s somehow simultaneously better than GPT-4 on math problems while being smaller and more efficient. But what exactly are these benchmarks measuring? How are they created? And more importantly, how can we cut through the hype to create our own benchmarks for specific use cases?

I wanted to learn more about LLM Benchmarking.

Part 1: What is a Benchmark? (in 3 seconds)

TL:DR — The SATs (multiple, actually) for LLMs.

Part 1.1: What is a Benchmark? (in more than 3 seconds)

Before we dive into the nitty-gritty of specific benchmarks, let’s take a moment to unpack what we even mean by “LLM Benchmark.” Because calling them the “SATs for AI” feels both right and also slightly oversimplified.

LLM benchmarks are, at their core, structured tests used to measure how well large language models perform on certain tasks. These tasks can be anything from identifying if a statement is true or false, to summarizing a legal document, to generating valid Python functions. Think of them as curated obstacle courses specially designed by AI researchers to test every relevant muscle these models might have. These frameworks typically provide a dataset of inputs with known correct outputs, allowing for consistent comparison between models.

Modern benchmarks employ various evaluation methodologies. Classification metrics like accuracy work for tasks with discrete correct answers, while overlap-based metrics (BLEU, ROUGE) evaluate free-form text generation. Some benchmarks use functional testing for code generation, or employ other LLMs as judges to evaluate response quality.

A typical benchmark usually comes packaged as:

  • A standardized dataset of questions, prompts, or tasks (with correct or reference answers).
  • An evaluation protocol specifying how to measure success, like accuracy, F1 score, BLEU/ROUGE for text generation, or pass/fail rates for coding tasks.
  • A leaderboard or some form of comparative scoreboard, often with big flashy graphs.

Some really famous benchmarks include MMLU for testing multitask language understanding, TruthfulQA for assessing factual accuracy, and HumanEval for measuring coding capabilities. Results are pretty often published on public leaderboards, which let’s people perform some transparent comparison between different models.

From the DeepSeek paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

What Makes a Good Benchmark?

  1. A Clear Task Definition: We want tasks that are unambiguous. The more straightforward and well-specified the challenge, the easier it is to trust the results.
  2. Data Integrity: The test set shouldn’t be floating around in the training data. Because if the model’s seen the exact same question 50 times before, the evaluation is about as useful as giving a math quiz to someone who already has the answer key.
  3. Quantifiable Metrics: You need a standard for scoring performance — like how many times the model’s code passes test cases or how close the generated summary is to a “ground-truth” summary.
  4. Task Diversity & Difficulty: If a benchmark is too easy, everyone just ACES it on day one, and we learn… well, nothing. If it’s too niche (like “We test only the model’s ability to count the digits of Pi for 20 minutes”), that’s also not so helpful.

Life Ain’t All about The Grades

Benchmarks capture only a slice of what LLMs can do. In the real world, your chatbot might need to juggle domain knowledge, keep track of conversation context, abide by your company’s policies, and produce fluent, non-offensive replies. No single standardized test out there fully covers that. As we’ll see in the upcoming case studies, the design and execution of a benchmark can heavily shape the picture you get of your model’s performance… and sometimes lead you astray if you’re not careful with how you measure success.

Now that we have a sense of what Llm Benchmarks are designed to accomplish (and where they might fall short), let’s explore a couple of examples to see how people actually build and use them in practice — with mixed results!

Case Study #1: Leetcode as an LLM Benchmark

As a student in the tech space, the word “Leetcode” popping up during my search for cool benchmarks raised by blood pressure by a statistically significant amount. Unlike Leetcode, which sucks, the paper “Performance Study of LLM-Generated Code on Leetcode” was very interesting — it asks a deceptively simple question: can we use Leetcode to benchmark LLM code generation? Their findings reveal both the promise and pitfalls of this approach.

The Benchmark Design

The researchers built a three-stage validation system. Local tests catch basic errors, Leetcode’s judge verifies correctness, and a custom benchmarking setup measures performance. This setup revealed something critical: benchmarking code performance is harder than it looks.

When they compared local measurements to Leetcode’s metrics, they found only a 0.28 correlation. Leetcode’s measurements showed much higher variation (0.089 vs 0.035 locally). Even worse, Leetcode’s rankings proved unstable — identical solutions could drop from the 77th to 54th percentile just based on submission timing.

A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

The Real Problems

Three major issues emerged that challenge Leetcode’s viability as a benchmark:

Data Contamination: Using public problems risks LLMs having seen the solutions during training. The researchers had to use only problems from 2023 to mitigate this.

Platform Instability: Leetcode’s metrics drift over time — memory measurements showed a -0.24 correlation with test date. This makes reproducible benchmarking nearly impossible.

Measurement Reliability: The weak correlation between local and platform measurements raises questions about what we’re actually testing.

What It Means for LLM Benchmarking

This study doesn’t just critique Leetcode — it highlights what we need in a code generation benchmark: reproducible measurements, reliable performance metrics, and guaranteed training-test separation. Until we have platforms built specifically for this purpose, we need to be extremely cautious about using competition platforms as benchmarks.

So! We know that not all benchmarks are viable benchmarks — what about a more mainstream one?

Case Study #2: SuperGLUE — Building a Better Language Understanding Benchmark

The SuperGLUE paper tackles a fascinating problem in AI benchmarking: what do you do when models get too good at your tests? When GLUE became insufficient (with models surpassing human performance), the researchers had to rethink how we measure language understanding.

The Benchmark Design

SuperGLUE’s core innovation is its task selection methodology. The researchers collected task proposals from the NLP community and filtered them through a rigorous process: each task needed clear evaluation metrics, public training data, and — most importantly — significant headroom between machine and human performance.

This resulted in eight tasks (I’ve simplified the table from the document here, it’s a little less readable but you should get the sense of what the questions are asking):

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

What makes these tasks special is their diversity in format. Unlike GLUE’s focus on sentence classification, SuperGLUE includes coreference resolution, reading comprehension, and more com plex reasoning tasks. Each task measures different aspects of language understanding while maintaining clear, quantifiable metrics.


Part 2: Let’s Build a Physical Reasoning Benchmark: To Cheat at Escape Rooms

After looking at some benchmarks like SuperGLUE and Leetcode, I had an idea: what if we tested LLMs on something completely different — physical reasoning… through escape room puzzles?

It’s a pretty valid idea — escape rooms poses possibilities and consequences for failure — screw up one too many puzzles, and your friends will think you’re pretty stupid, and relegate you to spectator duty. Luckily for us however, they (or the poor employees) don’t know that you can sneak a phone into an escape room — and you know just who to ask for the answers. Today, LLMs face off against the puzzles of a physical escape room.

Note: This is NOT a rigorous academic benchmark (please don’t cite this in papers, why would you even want to do that?), or even close to it, and it’s just supposed to be a fun way to test LLM benchmarking and evaluation. Please do not destroy my prompts, I am aware they are bad.

Why Physical Reasoning?

For real, though… most LLM benchmarks focus on linguistic tasks (like SuperGLUE) or code generation (like Leetcode). And for good reason — these are well-defined domains with clear evaluation metrics. But real-world problem solving often requires understanding physical principles and their interactions. The famous “Can GPT-4 do physics?” debates usually center around mathematical problem-solving, not practical physical reasoning.

Looking at existing benchmarks taught me a few key principles:

  1. Clear evaluation metrics are crucial (from SuperGLUE’s task-specific scores)
  2. Problems should have unambiguous solutions (from HumanEval’s test cases)
  3. The benchmark should test distinct capabilities (from MMLU’s subject categories)

Designing the Problems

I settled on escape room puzzles for two reasons. First, they naturally combine physical reasoning with clear goals. Second, they have unambiguous success conditions — either you solve it through the intended way, or you don’t. Third, and most importantly, they let me include “red herrings” — irrelevant items that test if the LLM can identify what matters physically. Fourth, I just really like doing escape rooms (did I mention that already?),

I am aware that this is more than two reasons, but if LLMs can’t count how many rs’ there are in strawberry, I’m allowed to mess up once in a while too.

Here’s how I structured the five core problems:

Fluid Dynamics (FLUID_001) (Ping pong ball stuck in a tube)

  • Tests understanding of buoyancy and fluid displacement
  • Inspired by classic physics problems but in practical context
  • Includes intentionally irrelevant items (like squishy food models)

Light Properties (UV_001) (UV light on a push numebr lock)

  • Tests understanding of UV fluorescence and material properties
  • Combines multiple physical principles (light, material science)
  • Requires understanding of environmental conditions

Mechanical Understanding (CIPHER_001) (A cipher ring)

  • Tests spatial reasoning and mechanical alignment
  • No red herrings — tests for correlating a dial to a cypher wheel
  • Requires understanding rotational symmetry

Force Application (VAC_001) (Can stuck in hole)

  • Tests understanding of vacuum forces and surface adhesion
  • Multiple possible solution approaches
  • Requires understanding force multiplication

Collaborative Physics (COLLAB_001) (Can two people shimmy a key?)

  • Tests understanding of physical constraints in multi-agent scenarios
  • Requires combining multiple physical principles
  • Tests understanding of tool creation and friction

Sounds really fancy… but it’s just some basic physical puzzles. You can access them on my GitHub.

The Technical Part

The benchmark implementation has three main components:

Problem Definition Layer

Problems are defined in a structured JSON format that enforces consistent evaluation:

{
    "problem_id": "FLUID_001",
    "setup": {
        "scenario": "A ping pong ball is at the bottom of a narrow tube...",
        "available_items": ["bottle of water", "squishy food models"...],
        "constraints": ["tube too narrow for manual retrieval"]
    },
    "physical_principles": ["buoyancy", "fluid displacement"],
    "red_herrings": ["squishy food models", "milk carton"],
    "solution": {
        "steps": ["pour water into tube", "allow ball to float"],
        "key_insights": ["water displaces air", "ping pong ball less dense"]
    }
}

This structure draws from SuperGLUE’s design — each component is clearly separated and machine-readable. The physical_principles field explicitly lists what’s being tested, while red_herrings helps in scoring the LLM’s ability to ignore irrelevant information.

2. Evaluation Framework

The evaluation system uses Python’s asyncio for concurrent testing, with retry logic for a little bit more API stability:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def evaluate_response(self, criteria: JudgingCriteria) -> Dict:
    """Evaluate a model's response using GPT-4 as judge."""
    async with aiohttp.ClientSession() as session:
        # ... evaluation logic

The scoring system looks at three components:

Physical Understanding Score (PUS) ∈ [0,2]

  • Measures understanding of relevant physical principles
  • Calculated as normalized sum of demonstrated principles

Solution Path Score (SPS) ∈ [0,2]

  • Evaluates completeness and correctness of solution steps
  • Considers practical feasibility of proposed solutions

Red Herring Handling (RHH) ∈ {0,1}

  • A Binary score for avoiding irrelevant items
  • Tests ability to focus on physically relevant factors

And yes, there are also so many other scoring methods, better and worse, that could be used! For example, RHH could be about how many irrelevant items are used in the solution, or it could be a measure of how viable the use is… the point is that picking these metrics are often times pretty arbitrary, but are very very important to making your benchmark is credible, which mine is very much not.

Additionally, I did not want to rewrite any code after. Sue me.

3. Model Interface Layer

The benchmark supports multiple LLM backends through a common interface:

class ModelInterface:
    """Interface for different LLM APIs."""
    async def generate_response(self, prompt: str) -> str:
        raise NotImplementedError
class GPT4Interface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # GPT-4 specific implementation
class ClaudeInterface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # Claude specific implementation

Two models… I can’t really afford any more, please understand.

Let’s Test It!

So after some refilling of API balances, I was ready to go. I ran the benchmark, and after totally not wasting credits on like, 5 different errors, the results were in. And they were… interesting! I used GPT4 and Claude 3.5 Haiku (I am aware these are not similarly spec-ed models) as our victims.

The Challenge: A can is stuck in a hole that almost exactly matches its volume. A combination is written on the bottom of the can that needs to be retrieved.

The inspiration
"VAC_001": {
        "problem_id": "VAC_001",
        "setup": {
            "available_items": [
                "ruler",
                "notebook",
                "suction cup",
                "UV light"
            ],
            "constraints": [
                "can fits too tightly for finger grip",
                "hole is same diameter as can",
                "can is clean and dry"
            ]
        },
        "physical_principles": [
            "vacuum suction",
            "surface adhesion",
            "force multiplication"
        ],
        "red_herrings": [
            "ruler",
            "notebook",
            "UV light"
        ],
        "solution": {
            "steps": [
                "inspect suction cup and can surface",
                "ensure can surface is clean",
                "firmly press suction cup onto can top",
                "create vacuum seal",
                "pull can out using suction cup"
            ],
            "key_insights": [
                "suction cup creates vacuum seal",
                "vacuum force allows extraction",
                "other items don't provide necessary force"
            ]
        }

GPT-4’s Solution

Physical Principles: The main physical principle involved in this solution is the concept of air pressure differential, which is the same principle that allows a suction cup to stick to a surface. Air pressure is higher on one side of the suction cup (outside) compared to the other (inside), and this difference in pressure causes the cup to stick to the surface.

Solution Steps:

  1. Attach the suction cup to the top of the can. The flat surface of the can should provide a good contact for the suction cup to adhere to.
  2. Once the suction cup is firmly attached, gently pull upwards. The air pressure differential should cause the suction cup (and thus the can) to be lifted out of the hole.
  3. Once the can is out of the hole, remove the suction cup from the top of the can.
  4. Use the UV light to illuminate the combination written on the bottom of the can. Some inks only become visible under UV light, which is why this step is necessary.
  5. Write down the combination on the notebook to ensure it’s not forgotten.

Required Items:

  • Suction cup: This is required to adhere to the can and pull it out of the hole, as the can fits too tightly for finger grip.
  • UV light: This is required to reveal the combination written on the bottom of the can, as it may be written in ink that only becomes visible under UV light.
  • Notebook: This is required to record the combination once it’s visible, ensuring it’s not forgotten.”

Not bad at all! GPT-4 manages to figure out the physical principle behind the solution (even if it is very wordy), and manages to step-by-step describe how to solve the problem. It does end up using the UV light and notebook, but that’s more of a problem on my end, not punishing the use of unnecessary items in the prompt enough. Also, the uses for these items seem decently reasonable… so why not!

Haiku-3.5’s Solution

Physical Principles: The key physical principles involved in this scenario are the concept of volume and the use of suction to extract the can from the hole. Since the can fits almost exactly into the hole, it is likely that the can is stuck due to the tight fit and the lack of any air gaps or space between the can and the hole. This tight fit creates a vacuum-like effect, making it difficult to grip and pull the can out.

Solution Steps:

  1. Measure the diameter of the can using the ruler. This will help determine the size of the hole and confirm that the can fits tightly within it.
  2. Inspect the bottom of the can to locate the combination written on it. Use the UV light to illuminate the bottom of the can and make the combination visible, if it is not already clearly visible.
  3. Attach the suction cup to the top of the can. The suction cup will create a temporary seal on the can’s surface, allowing you to grip and pull it out of the hole.
  4. Gently pull the can upwards using the suction cup. The suction force should be strong enough to overcome the vacuum-like effect and extract the can from the hole.
  5. Once the can is out, record the combination written on the bottom of the can in the notebook.

Required Items:

  • Ruler: To measure the diameter of the can and confirm the tight fit in the hole.
  • UV light: To illuminate the bottom of the can and make the combination visible, if it is not already clearly visible.
  • Suction cup: To create a temporary seal on the can’s surface and provide the necessary grip to pull the can out of the hole.
  • Notebook: To record the combination written on the bottom of the can.

On the other hand, Haiku’s solution was.. certainly a solution of all time.

First, it wants to measure the diameter of the can using the ruler??? To determine the size of the hole and confirm that the can fits tightly within it? Why would we need to do this? And do we need a ruler for that?

Second, it tells us to inspect the bottom of the can to locate the combination, when the entire problem is about not being able to pull the can out of the hole conventionally. This might just be an issue of order, but now I truly understand my friends’ feelings whenever I would tell them “just fix it man” to their numerous problems.

But it eventually does get the solution. So… not the worst.

Here’s a fancy radar graph of the results!

We see that both models are pretty similar in their capabilities, with GPT-4 being slightly better in physical understanding and solution path, and Haiku being slightly better in red herring handling. Overall though, both models kind of suck. Dang.

There are also only… 5 questions.

If you’d like to see the full breadth of questions, they’re on my GitHub.

LLM-as-a-Judge

By the way, the method I used to generate the evaluations, LLM-as-a-judge, has gained significant traction in the AI community, particularly after the work of Zheng et al. in their 2023 paper “Judging LLM-as-a-Judge.” The technique has proven remarkably effective, achieving over 80% agreement with human evaluators in tasks ranging from code assessment to dialogue quality evaluation!

Here’s where my experiment gets kind of cool (arguably, maybe, subjectively) — I used this methodology and had GPT-4 judge other LLMs’ physical reasoning abilities. Yes, I’m using an AI to judge other AIs.

Why does this work? Well, judging a response is actually a simpler task than generating one. When GPT-4 generates a solution to a physical puzzle, it needs to:

  • Understand the physical principles involved
  • Plan a sequence of steps
  • Consider all constraints
  • Generate a coherent explanation

But when judging, it only needs to check if specific criteria are met in an existing solution. The evaluation prompt is very focused:

def _create_evaluation_prompt(self, criteria: JudgingCriteria) -> str:
    return f"""You are an expert judge evaluating an LLM's understanding of physical reasoning puzzles.
Evaluate based on three criteria:
2. Physical Understanding Score (0-2): Does the solution correctly apply relevant physical principles?
3. Solution Path Score (0-2): Are the steps complete and feasible?
4. Red Herring Handling (0-1): Does it avoid using irrelevant items?
Scenario: {criteria.scenario}
Physical Principles Required: {criteria.correct_principles}
Solution Given: {criteria.model_response}
"""

To validate this approach, I followed the validation framework suggested by Zheng et al., performing spot-checks of GPT-4’s evaluations against my own judgments. Surprisingly (or perhaps unsurprisingly, given the broader research on LLM evaluation), it was remarkably consistent in identifying both correct physical understanding and flawed reasoning.

Is this perfect? Absolutely not. There’s something philosophically weird about using one LLM to evaluate another. But in practice, it can work surprisingly well — just like how I moan and groan about the visual presentation of a dish on Masterchef, while setting my kitchen aflame trying to microwave a hot dog.

What I Learned

Building this benchmark taught me several things about benchmark design:

Clear Metrics Matter: Even for complex tasks like physical reasoning, you need unambiguous scoring criteria.

Red Herrings Are Powerful: Including irrelevant items reveals a lot about an LLM’s reasoning process.

Context Control is Hard: Ensuring LLMs don’t “hallucinate” additional physical context is challenging.

Is this a perfect benchmark? Not even close. Please don’t rub it in. Is it scientifically rigorous? Definitely not. But it’s been a fascinating exploration into an aspect of LLM capabilities, and sometimes the best we can learn can come from just trying things out and seeing what happens.

Now, if you’ll excuse me, I will be sneaking in a phone with an internet connection into my next escape room, for reasons that I am legally unmotivated to disclose.

[1] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks Track (2023)

[2] T. Coignion, C. Quinton, R. Rouvoy, “A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

[3] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

[5] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948 (2025)

[6] Unless otherwise stated, all images are created by the author.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Four new vulnerabilities found in Ingress NGINX

NGINX is a reverse proxy/load balancer that generally acts as the front-end web traffic receiver and directs it to the application service for data transformation. Ingress NGINX is a version used in Kubernetes as the controller for traffic coming into the infrastructure. It takes care of mapping traffic to pods

Read More »

WTI, Brent Gain as Talks Ease Conflict Fears

Oil edged marginally higher after a choppy session as investors assessed the status of nuclear talks between the US and Iran. West Texas Intermediate settled above $63 a barrel, with markets reacting sharply to headlines tied to the meeting. Iranian Foreign Minister Abbas Araghchi said the talks had a “good start,” even as the Wall Street Journal reported that Tehran stood by its refusal to end enrichment of nuclear fuel, a major sticking point for the US. The escalation in the Middle East, which provides about a third of the world’s crude, has added a risk premium to benchmark oil prices. Traders have weighed the geopolitical tensions against an outlook for oversupply. Still, futures in New York notched their first weekly retreat since mid-December as the US-Iran talks helped allay concerns over a broader conflict in the region. Prices also extended gains after data showed US consumer sentiment unexpectedly improved to the highest in six months, calming some concerns over an economic slowdown in the country that could lead to weaker oil demand. Meanwhile, in trilateral negotiations with the US, Ukraine and Russia agreed to exchange prisoners for the first time in five months as they sought to end their four-year conflict. Talks were making progress, with results expected “in the coming weeks,” President Donald Trump’s special envoy said. Saudi Arabia cut prices for buyers in Asia by less than expected, signaling confidence in demand for its barrels, although prices have still been reduced to the lowest levels since late 2020. Oil Prices WTI for March delivery settled 0.4% higher at $63.55 a barrel in New York. Brent for April settlement rose 0.7% to close at $68.05 a barrel. What do you think? We’d love to hear from you, join the conversation on the Rigzone Energy Network. The Rigzone Energy

Read More »

Saudis Cut Key Oil Price for Asian Buyers

Saudi Arabia cut the price of its main oil grade for buyers in Asia to the lowest in years, a further sign that global supplies are running ahead of demand. State oil producer Saudi Aramco will reduce the price of its Arab Light grade by 30 cents a barrel to parity with the regional benchmark for March, according to a price list seen by Bloomberg. That brings pricing for the kingdom’s most plentiful crude blend to the lowest level since late 2020. Still, Aramco’s cut was not as deep as buyers expected, coming in smaller than even the most modest estimate of a reduction in a survey of refiners and traders. That offers a sign that the kingdom has faith in demand for its barrels and Aramco’s Chief Executive Officer Amin Nasser has previously said that fears of a glut are overblown. Saudi Arabia’s monthly crude pricing is keenly watched by traders across the globe as it sets the tone for other sellers in the world’s top producing regions. Asia is the biggest market for Middle Eastern crude, with the prices set for refiners determining the profitability of processing and influencing the cost of fuels like gasoline and diesel the world over. Aramco also cut pricing for its Arab Medium and Arab Heavy crude grades to Asia to the lowest levels since mid 2020, while it increased prices for the Extra Light and Super Light blends. That split reflects that dynamic in the Middle East market where prices for the heavier and more sulfurous crudes that are most plentiful in the region have trailed those for the lighter blends. The OPEC+ producers group, led by Saudi Arabia and Russia, agreed to keep production levels steady during talks on Feb. 1, maintaining an earlier decision to forgo output increases to avoid

Read More »

Shell to Pause Kazakh Oil and Gas Investments

Shell Plc will pause investment in Kazakhstan as it navigates legal claims from the OPEC+ nation against oil majors that could stretch into the billions of dollars, Chief Executive Officer Wael Sawan said. Kazakhstan is pressing multiple western oil companies for compensation across a series of cases both in the Central Asian country’s courts and in international arbitration. This month, it emerged that Shell and partners lost a dispute that could see them pay as much as $4 billion. There is also ongoing litigation about sulfur breaches and project costs. “It does impact our appetite to invest further in Kazakhstan,” Sawan said Thursday during an earnings conference call with analysts. While the company sees plenty of investment opportunities in the future, “we will hold until we have a better line of sight to where things end up.” The setbacks in Kazakhstan come as Shell seeks to ensure future production growth with a healthy inventory of projects. Acquisitions have largely filled the company’s production gap through 2030, buying time to deal with the 2030-2035 period, Sawan said in an interview on Thursday. The Kazakh energy ministry didn’t reply to an emailed request for comment sent outside normal working hours. Sawan didn’t elaborate on whether the pause would apply to new or existing projects. Shell didn’t immediately respond to a request to clarify whether the CEO was talking about new or existing investments. The latest dispute was against the Karachaganak field joint venture led by Italy’s Eni SpA and Shell, over cost deductions. Other partners include Chevron Corp., Lukoil PJSC and KazMunayGas National Co. The venture may still appeal the decision.   Last year, the companies proposed settling the dispute by building a plant that would process natural gas from the field for domestic use. WHAT DO YOU THINK? Generated by readers,

Read More »

Tankers With Russian Oil Flock to East Asia

More than a dozen tankers loaded with Russian Urals oil are sailing toward Asia or idling along the route, a sign of producers racing to get cargoes closer to China as India pulls back from the trade.  These vessels — carrying a combined 10 million to 12 million barrels of oil — are spread across the Indian Ocean, and off the coasts of Malaysia, China and Russia. Five of them are indicating ‘for orders’ or ‘China for orders’ as their status, according to data intelligence firm Kpler, a category that usually means they don’t yet have a specific buyer or discharge port. Another six are signaling Singapore and Malaysia, and are likely heading to a popular spot for ship-to-ship transfers in the South China Sea where they can wait until the crude is bought. Four are floating off Malaysia, China and Russia’s Far East, without indicating a clear destination. Urals — Russia’s flagship crude grade, which is loaded from ports in the Baltic Sea — has become the variety of choice for Indian refiners since the invasion of Ukraine in early 2022 saw it become heavily discounted. But pressure from Washington has pushed imports lower, reaching an average of 1.2 million barrels a day in January compared with a peak of more than 2 million barrels a day in mid-2024. Indian imports of the crude could be trimmed further after President Donald Trump said on Monday the country would stop buying Russian oil as part of deal to cut trade tariffs. Prime Minister Narendra Modi confirmed the agreement but didn’t comment on oil. Some refiners are holding off purchases while they seek clarification from New Delhi.  The big question is where the surplus cargoes of Urals — the bulk of which have gone to India over the last few years — will now end up. China’s

Read More »

BP, KOC Sign ETSA Extension

In a statement sent to Rigzone on Thursday, BP announced that it and Kuwait Oil Company have signed an extension of the Enhanced Technical Services Agreement (ETSA) between the companies. The agreement “paves the way for both companies to collaboratively progress Kuwait’s most strategic asset fields”, BP noted in the statement. BP added that the deal enables it to “bring expertise in enhanced oil recovery to the Greater Burgan oil field and develop local capabilities with Kuwait Oil Company to manage the development of South and East Kuwait fields through 50 secondment opportunities of BP’s technical experts”. Rigzone asked BP to disclose the deal’s value. A BP spokesperson was unable to do so. The ETSA was originally signed in 2016 for a period of 10 years, the statement highlighted, adding that it will now extend through to March 2029. BP Executive Vice President, Gas & Low Carbon Energy, William Lin, noted in the statement, “BP’s commitment to Kuwait dates back to our participation in the discovery of the Greater Burgan oil field in the 1930s, and we appreciate the trust placed in our expertise in giant oil and gas fields to continue to help develop this important strategic asset”. “This is another example of the deep relationships we’ve formed across governments, partners, and supply chains in the regions where we operate. We look forward to continuing our strong collaboration with Kuwait and to working with KOC to help support the country’s long-term energy resilience,” he added. BP notes on its website that it was one of the founders of the original Kuwait Oil Company, which it highlighted first discovered oil at Burgan in 1938. “Exportation of KOC began in 1946, in which the first export of Kuwait crude was loaded on to the bp vessel ‘Fusilier’,” BP’s site adds. BP

Read More »

IPAA Promotes Naatz to Chief Policy Officer

In a statement sent to Rigzone on Thursday, the Independent Petroleum Association of America (IPAA) announced that Dan Naatz has been promoted to Executive Vice President and Chief Policy Officer. As Chief Policy Officer, Naatz will lead IPAA’s policy priorities and oversee all government relations and advocacy efforts, the organization said in the statement, adding that he will focus “on the issues most critical to independent producers across regulatory, legislative, and permitting environments”. “He will also continue to build consensus across IPAA’s diverse membership and strengthen partnerships with aligned organizations,” the IPAA noted. Naatz also serves as Corporate Secretary on the IPAA board of directors, the statement highlighted. Prior to joining the IPAA in 2003, Naatz spent 12 years on Capitol Hill working for the late Senator Craig Thomas in various capacities, the IPAA pointed out in its statement. “Dan has led IPAA’s advocacy efforts on Capitol Hill, securing meaningful wins for independent oil and natural gas producers on issues including methane regulation, federal leasing, and permitting reform,” the statement said. “His decades of leadership and judgment have strengthened IPAA’s voice in Washington at a critical time for IPAA members,” it added. A bio on Naatz hosted on the IPAA website states that, “for more than two decades, Dan has been the public face for independent oil and natural gas producers in Washington, representing the industry in congressional hearings on Capitol Hill, roundtables, coalition meetings, and with federal regulating agencies”. “IPAA actively engages with agencies including the Environmental Protection Agency (EPA), Department of Interior (DOI), Bureau of Land Management (BLM) in support of rules and timelines that are technically feasible and cost-effective, so producers can do what they do best: provide reliable and affordable energy for Americans,” the bio page adds. IPAA President and CEO Edith Naegele said in the IPAA statement, “IPAA

Read More »

Nvidia’s $100 Billion OpenAI Bet Shrinks and Signals a New Phase in the AI Infrastructure Cycle

One of the most eye-popping figures of the AI boom – a proposed $100 billion Nvidia commitment to OpenAI and as much as 10 gigawatts of compute for the company’s Stargate AI infrastructure buildout – is no longer on the table. And that partial retreat tells the data center industry something important. According to multiple reports surfacing at the end of January, Nvidia has paused and re-scoped its previously discussed, non-binding investment framework with OpenAI, shifting from an unprecedented capital-plus-infrastructure commitment to a much smaller (though still massive) equity investment. What was once framed as a potential $100 billion alignment is now being discussed in the $20-30 billion range, as part of OpenAI’s broader effort to raise as much as $100 billion at a valuation approaching $830 billion. For data center operators, infrastructure developers, and power providers, the recalibration matters less for the headline number and more for what it reveals about risk discipline, competitive dynamics, and the limits of vertical circularity in AI infrastructure finance. From Moonshot to Measured Capital The original September 2025 memorandum reportedly contemplated not just capital, but direct alignment on compute delivery: a structure that would have tightly coupled Nvidia’s balance sheet with OpenAI’s AI-factory roadmap. By late January, however, sources indicated Nvidia executives had grown uneasy with both the scale and the structure of the deal. Speaking in Taipei on January 31, Nvidia CEO Jensen Huang pushed back on reports of friction, calling them “nonsense” and confirming Nvidia would “absolutely” participate in OpenAI’s current fundraising round. But Huang was also explicit on what had changed: the investment would be “nothing like” $100 billion, even if it ultimately becomes the largest single investment Nvidia has ever made. That nuance matters. Nvidia is not walking away from OpenAI. But it is drawing a clearer boundary around

Read More »

Data Center Jobs: Engineering, Construction, Commissioning, Sales, Field Service and Facility Tech Jobs Available in Major Data Center Hotspots

Each month Data Center Frontier, in partnership with Pkaza, posts some of the hottest data center career opportunities in the market. Here’s a look at some of the latest data center jobs posted on the Data Center Frontier jobs board, powered by Pkaza Critical Facilities Recruiting. Looking for Data Center Candidates? Check out Pkaza’s Active Candidate / Featured Candidate Hotlist Onsite Engineer – Critical FacilitiesCharleston, SC This is NOT a traveling position. Having degreed engineers seems to be all the rage these days. I can also use this type of candidate in following cities: Ashburn, VA; Moncks Corner, SC; Binghamton, NY; Dallas, TX or Indianapolis, IN. Our client is an engineering design and commissioning company that is a subject matter expert in the data center space. This role will be onsite at a customer’s data center. They will provide onsite design coordination and construction administration, consulting and management support for the data center / mission critical facilities space with the mindset to provide reliability, energy efficiency, sustainable design and LEED expertise when providing these consulting services for enterprise, colocation and hyperscale companies. This career-growth minded opportunity offers exciting projects with leading-edge technology and innovation as well as competitive salaries and benefits. Electrical Commissioning Engineer Ashburn, VA This traveling position is also available in: New York, NY; White Plains, NY;  Richmond, VA; Montvale, NJ; Charlotte, NC; Atlanta, GA; Hampton, GA; New Albany, OH; Cedar Rapids, IA; Phoenix, AZ; Salt Lake City, UT; Dallas, TX; Kansas City, MO; Omaha, NE; Chesterton, IN or Chicago, IL. *** ALSO looking for a LEAD EE and ME CxA Agents and CxA PMs *** Our client is an engineering design and commissioning company that has a national footprint and specializes in MEP critical facilities design. They provide design, commissioning, consulting and management expertise in the critical facilities space. They

Read More »

Operationalizing AI at Scale: Google Cloud on Data Infrastructure, Search, and Enterprise AI

The AI conversation has been dominated by model announcements, benchmark races, and the rapid evolution of large language models. But in enterprise environments, the harder problem isn’t building smarter models. It’s making them work reliably with real-world data. On the latest episode of the Data Center Frontier Show Podcast, Sailesh Krishnamurthy, VP of Engineering for Databases at Google Cloud, pulled back the curtain on the infrastructure layer where many ambitious AI initiatives succeed, or quietly fail. Krishnamurthy operates at the intersection of databases, search, and AI systems. His perspective underscores a growing reality across enterprise IT: AI success increasingly depends on how organizations manage, integrate, and govern data across operational systems, not just how powerful their models are. The Disconnect Between LLMs and Reality Enterprises today face a fundamental challenge: connecting LLMs to real-time operational data. Search systems handle documents and unstructured information well. Operational databases manage transactions, customer data, and financial records with precision. But combining the two remains difficult. Krishnamurthy described the problem as universal. “Inside enterprises, knowledge workers are often searching documents while separately querying operational systems,” he said. “But combining unstructured information with operational database data is still hard to do.” Externally, customers encounter the opposite issue. Portals expose personal data but struggle to incorporate broader contextual information. “You get a narrow view of your own data,” he explained, “but combining that with unstructured information that might answer your real question is still challenging.” The result: AI systems often operate with incomplete context. Vector Search Moves Into the Database Vector search has emerged as a bridge between structured and unstructured worlds. But its evolution over the past three years has changed how enterprises deploy it. Early use cases focused on semantic search, i.e. finding meaning rather than exact keyword matches. Bug tracking systems, for example, began

Read More »

Transmission at the Breaking Point: Why the Grid Is Becoming the Defining Constraint for AI Data Centers

Regions in a Position to Scale California (A- overall)California continues to lead in long-term, scenario-based transmission planning. CAISO’s most recent transmission plan identifies $4.8 billion in new projects to accommodate approximately 76 gigawatts of additional capacity by 2039, explicitly accounting for data center growth alongside broader electrification. For data center developers, California’s challenge is less about planning quality and more about execution. Permitting timelines, cost allocation debates, and political scrutiny remain significant hurdles. Plains / Southwest Power Pool (B- overall, A in regional planning)SPP stands out nationally for embracing ultra-high-voltage transmission as a backbone strategy. Its recent Integrated Transmission Plans approve more than $16 billion in new projects, including multiple 765-kV lines, with benefit-cost ratios exceeding 10:1. This approach positions the Plains region as one of the most structurally “AI-ready” grids in North America, particularly for multi-gigawatt campuses supported by wind, natural gas, and emerging nuclear resources. Midwest / MISO (B overall)MISO’s Long-Range Transmission Planning framework aligns closely with federal best practices, co-optimizing generation and transmission over long planning horizons. While challenges remain—particularly around interregional coordination—the Midwest is comparatively well positioned for sustained data center growth. Regions Facing Heightened Risk Texas / ERCOT (D- overall)Texas has approved massive new transmission investments, including 765-kV projects tied to explosive load growth in the Permian Basin. However, the report criticizes ERCOT’s planning for remaining largely siloed and reliability-driven, with limited long-term scenario analysis and narrow benefit assessments. For data centers, ERCOT still offers speed to market, but increasingly with risks tied to congestion, price volatility, and political backlash surrounding grid reliability. Southeast (F overall)The Southeast receives failing grades across all categories, with transmission development remaining fragmented, utility-driven, and largely disconnected from durable regional planning frameworks. As AI data centers increasingly target the region for its land availability and tax incentives, the lack of

Read More »

From Row-Level CDUs to Facility-Scale Cooling: DCX Ramps Liquid Cooling for the AI Factory Era

Enter the 8MW CDU Era The next evolution arrived just days later. On Jan. 20, DCX announced its second-generation facility-scale unit, the FDU V2AT2, pushing capacity into territory previously unimaginable for single CDU platforms. The system delivers up to 8.15 megawatts of heat transfer capacity with record flow rates designed to support 45°C warm-water cooling, aligning directly with NVIDIA’s roadmap for rack-scale AI systems, including Vera Rubin-class deployments. That temperature target is significant. Warm-water cooling at this level allows many facilities to eliminate traditional chillers for heat rejection, depending on climate and deployment design. Instead of relying on compressor-driven refrigeration, operators can shift toward dry coolers or other simplified heat rejection strategies. The result: • Reduced mechanical complexity• Lower energy consumption• Improved efficiency at scale• New opportunities for heat reuse According to DCX CTO Maciek Szadkowski, the goal is to avoid obsolescence in a single hardware generation: “As the datacenter industry transitions to AI factories, operators need cooling systems that won’t be obsolete in one platform cycle. The FDU V2AT2 replaces multiple legacy CDUs and enables 45°C supply water operation while simplifying cooling topology and significantly reducing both CAPEX and OPEX.” The unit incorporates a high-capacity heat exchanger with a 2°C approach temperature, N+1 redundant pump configuration, integrated water quality control, and diagnostics systems designed for predictive maintenance. In short, this is infrastructure built not for incremental density growth, but for hyperscale AI facilities where megawatts of cooling must scale as predictably as compute capacity. Liquid Cooling Becomes System Architecture The broader industry implication is clear: cooling is no longer an auxiliary mechanical function. It is becoming system architecture. DCX’s broader 2025 performance metrics underscore the speed of this transition. The company reported 600% revenue growth, expanded its workforce fourfold, and shipped or secured contracts covering more than 500 MW

Read More »

AI Infrastructure Scales Out and Up: Edge Expansion Meets the Gigawatt Campus Era

The AI infrastructure boom is often framed around massive hyperscale campuses racing to secure gigawatts of power. But an equally important shift is happening in parallel: AI infrastructure is also becoming more distributed, modular, and sovereign, extending compute far beyond traditional data center hubs. A wave of recent announcements across developers, infrastructure investors, and regional operators shows the market pursuing a dual strategy. On one end, developers are accelerating delivery of hyperscale campuses measured in hundreds of megawatts, and increasingly gigawatts, often located where power availability and energy economics offer structural advantage, and in some cases pairing compute directly with dedicated generation. On the other, providers are building increasingly capable regional and edge facilities designed to bring AI compute closer to users, industrial operations, and national jurisdictions. Taken together, these moves point toward a future in which AI infrastructure is no longer purely centralized, but built around interconnected hub-and-spoke architectures combining energy-advantaged hyperscale cores with rapidly deployable edge capacity. Recent developments across hyperscale developers, edge specialists, infrastructure investors, and regional operators illustrate how quickly this model is taking shape. Sovereign AI Moves Beyond the Core On Feb. 5, 2026, San Francisco-based Armada and European AI infrastructure builder Nscale signed a letter of intent to jointly deploy both large-scale and edge AI infrastructure worldwide. The collaboration targets enterprise and public sector customers seeking sovereign, secure, geographically distributed AI environments. Nscale is building large AI supercomputer clusters globally, offering vertically integrated capabilities spanning power, data centers, compute, and software. Armada specializes in modular deployments through its Galleon data centers and Armada Edge Platform, delivering compute and storage into remote or infrastructure-poor environments. The combined offering addresses a growing challenge: many governments and enterprises want AI capability deployed within their own jurisdictions, even where traditional hyperscale infrastructure does not yet exist. “There is

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »