Stay Ahead, Stay ONMINE

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too big of a scope to write about… but when a model like DeepSeek […]

Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too big of a scope to write about… but when a model like DeepSeek comes out of nowhere with a steel chair, boasting similar performance levels to other models, what does performance really mean in this context?

If you follow AI releases, you’ve seen this dance before. Every new model drops with its graphs showing how it’s somehow simultaneously better than GPT-4 on math problems while being smaller and more efficient. But what exactly are these benchmarks measuring? How are they created? And more importantly, how can we cut through the hype to create our own benchmarks for specific use cases?

I wanted to learn more about LLM Benchmarking.

Part 1: What is a Benchmark? (in 3 seconds)

TL:DR — The SATs (multiple, actually) for LLMs.

Part 1.1: What is a Benchmark? (in more than 3 seconds)

Before we dive into the nitty-gritty of specific benchmarks, let’s take a moment to unpack what we even mean by “LLM Benchmark.” Because calling them the “SATs for AI” feels both right and also slightly oversimplified.

LLM benchmarks are, at their core, structured tests used to measure how well large language models perform on certain tasks. These tasks can be anything from identifying if a statement is true or false, to summarizing a legal document, to generating valid Python functions. Think of them as curated obstacle courses specially designed by AI researchers to test every relevant muscle these models might have. These frameworks typically provide a dataset of inputs with known correct outputs, allowing for consistent comparison between models.

Modern benchmarks employ various evaluation methodologies. Classification metrics like accuracy work for tasks with discrete correct answers, while overlap-based metrics (BLEU, ROUGE) evaluate free-form text generation. Some benchmarks use functional testing for code generation, or employ other LLMs as judges to evaluate response quality.

A typical benchmark usually comes packaged as:

  • A standardized dataset of questions, prompts, or tasks (with correct or reference answers).
  • An evaluation protocol specifying how to measure success, like accuracy, F1 score, BLEU/ROUGE for text generation, or pass/fail rates for coding tasks.
  • A leaderboard or some form of comparative scoreboard, often with big flashy graphs.

Some really famous benchmarks include MMLU for testing multitask language understanding, TruthfulQA for assessing factual accuracy, and HumanEval for measuring coding capabilities. Results are pretty often published on public leaderboards, which let’s people perform some transparent comparison between different models.

From the DeepSeek paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

What Makes a Good Benchmark?

  1. A Clear Task Definition: We want tasks that are unambiguous. The more straightforward and well-specified the challenge, the easier it is to trust the results.
  2. Data Integrity: The test set shouldn’t be floating around in the training data. Because if the model’s seen the exact same question 50 times before, the evaluation is about as useful as giving a math quiz to someone who already has the answer key.
  3. Quantifiable Metrics: You need a standard for scoring performance — like how many times the model’s code passes test cases or how close the generated summary is to a “ground-truth” summary.
  4. Task Diversity & Difficulty: If a benchmark is too easy, everyone just ACES it on day one, and we learn… well, nothing. If it’s too niche (like “We test only the model’s ability to count the digits of Pi for 20 minutes”), that’s also not so helpful.

Life Ain’t All about The Grades

Benchmarks capture only a slice of what LLMs can do. In the real world, your chatbot might need to juggle domain knowledge, keep track of conversation context, abide by your company’s policies, and produce fluent, non-offensive replies. No single standardized test out there fully covers that. As we’ll see in the upcoming case studies, the design and execution of a benchmark can heavily shape the picture you get of your model’s performance… and sometimes lead you astray if you’re not careful with how you measure success.

Now that we have a sense of what Llm Benchmarks are designed to accomplish (and where they might fall short), let’s explore a couple of examples to see how people actually build and use them in practice — with mixed results!

Case Study #1: Leetcode as an LLM Benchmark

As a student in the tech space, the word “Leetcode” popping up during my search for cool benchmarks raised by blood pressure by a statistically significant amount. Unlike Leetcode, which sucks, the paper “Performance Study of LLM-Generated Code on Leetcode” was very interesting — it asks a deceptively simple question: can we use Leetcode to benchmark LLM code generation? Their findings reveal both the promise and pitfalls of this approach.

The Benchmark Design

The researchers built a three-stage validation system. Local tests catch basic errors, Leetcode’s judge verifies correctness, and a custom benchmarking setup measures performance. This setup revealed something critical: benchmarking code performance is harder than it looks.

When they compared local measurements to Leetcode’s metrics, they found only a 0.28 correlation. Leetcode’s measurements showed much higher variation (0.089 vs 0.035 locally). Even worse, Leetcode’s rankings proved unstable — identical solutions could drop from the 77th to 54th percentile just based on submission timing.

A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

The Real Problems

Three major issues emerged that challenge Leetcode’s viability as a benchmark:

Data Contamination: Using public problems risks LLMs having seen the solutions during training. The researchers had to use only problems from 2023 to mitigate this.

Platform Instability: Leetcode’s metrics drift over time — memory measurements showed a -0.24 correlation with test date. This makes reproducible benchmarking nearly impossible.

Measurement Reliability: The weak correlation between local and platform measurements raises questions about what we’re actually testing.

What It Means for LLM Benchmarking

This study doesn’t just critique Leetcode — it highlights what we need in a code generation benchmark: reproducible measurements, reliable performance metrics, and guaranteed training-test separation. Until we have platforms built specifically for this purpose, we need to be extremely cautious about using competition platforms as benchmarks.

So! We know that not all benchmarks are viable benchmarks — what about a more mainstream one?

Case Study #2: SuperGLUE — Building a Better Language Understanding Benchmark

The SuperGLUE paper tackles a fascinating problem in AI benchmarking: what do you do when models get too good at your tests? When GLUE became insufficient (with models surpassing human performance), the researchers had to rethink how we measure language understanding.

The Benchmark Design

SuperGLUE’s core innovation is its task selection methodology. The researchers collected task proposals from the NLP community and filtered them through a rigorous process: each task needed clear evaluation metrics, public training data, and — most importantly — significant headroom between machine and human performance.

This resulted in eight tasks (I’ve simplified the table from the document here, it’s a little less readable but you should get the sense of what the questions are asking):

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

What makes these tasks special is their diversity in format. Unlike GLUE’s focus on sentence classification, SuperGLUE includes coreference resolution, reading comprehension, and more com plex reasoning tasks. Each task measures different aspects of language understanding while maintaining clear, quantifiable metrics.


Part 2: Let’s Build a Physical Reasoning Benchmark: To Cheat at Escape Rooms

After looking at some benchmarks like SuperGLUE and Leetcode, I had an idea: what if we tested LLMs on something completely different — physical reasoning… through escape room puzzles?

It’s a pretty valid idea — escape rooms poses possibilities and consequences for failure — screw up one too many puzzles, and your friends will think you’re pretty stupid, and relegate you to spectator duty. Luckily for us however, they (or the poor employees) don’t know that you can sneak a phone into an escape room — and you know just who to ask for the answers. Today, LLMs face off against the puzzles of a physical escape room.

Note: This is NOT a rigorous academic benchmark (please don’t cite this in papers, why would you even want to do that?), or even close to it, and it’s just supposed to be a fun way to test LLM benchmarking and evaluation. Please do not destroy my prompts, I am aware they are bad.

Why Physical Reasoning?

For real, though… most LLM benchmarks focus on linguistic tasks (like SuperGLUE) or code generation (like Leetcode). And for good reason — these are well-defined domains with clear evaluation metrics. But real-world problem solving often requires understanding physical principles and their interactions. The famous “Can GPT-4 do physics?” debates usually center around mathematical problem-solving, not practical physical reasoning.

Looking at existing benchmarks taught me a few key principles:

  1. Clear evaluation metrics are crucial (from SuperGLUE’s task-specific scores)
  2. Problems should have unambiguous solutions (from HumanEval’s test cases)
  3. The benchmark should test distinct capabilities (from MMLU’s subject categories)

Designing the Problems

I settled on escape room puzzles for two reasons. First, they naturally combine physical reasoning with clear goals. Second, they have unambiguous success conditions — either you solve it through the intended way, or you don’t. Third, and most importantly, they let me include “red herrings” — irrelevant items that test if the LLM can identify what matters physically. Fourth, I just really like doing escape rooms (did I mention that already?),

I am aware that this is more than two reasons, but if LLMs can’t count how many rs’ there are in strawberry, I’m allowed to mess up once in a while too.

Here’s how I structured the five core problems:

Fluid Dynamics (FLUID_001) (Ping pong ball stuck in a tube)

  • Tests understanding of buoyancy and fluid displacement
  • Inspired by classic physics problems but in practical context
  • Includes intentionally irrelevant items (like squishy food models)

Light Properties (UV_001) (UV light on a push numebr lock)

  • Tests understanding of UV fluorescence and material properties
  • Combines multiple physical principles (light, material science)
  • Requires understanding of environmental conditions

Mechanical Understanding (CIPHER_001) (A cipher ring)

  • Tests spatial reasoning and mechanical alignment
  • No red herrings — tests for correlating a dial to a cypher wheel
  • Requires understanding rotational symmetry

Force Application (VAC_001) (Can stuck in hole)

  • Tests understanding of vacuum forces and surface adhesion
  • Multiple possible solution approaches
  • Requires understanding force multiplication

Collaborative Physics (COLLAB_001) (Can two people shimmy a key?)

  • Tests understanding of physical constraints in multi-agent scenarios
  • Requires combining multiple physical principles
  • Tests understanding of tool creation and friction

Sounds really fancy… but it’s just some basic physical puzzles. You can access them on my GitHub.

The Technical Part

The benchmark implementation has three main components:

Problem Definition Layer

Problems are defined in a structured JSON format that enforces consistent evaluation:

{
    "problem_id": "FLUID_001",
    "setup": {
        "scenario": "A ping pong ball is at the bottom of a narrow tube...",
        "available_items": ["bottle of water", "squishy food models"...],
        "constraints": ["tube too narrow for manual retrieval"]
    },
    "physical_principles": ["buoyancy", "fluid displacement"],
    "red_herrings": ["squishy food models", "milk carton"],
    "solution": {
        "steps": ["pour water into tube", "allow ball to float"],
        "key_insights": ["water displaces air", "ping pong ball less dense"]
    }
}

This structure draws from SuperGLUE’s design — each component is clearly separated and machine-readable. The physical_principles field explicitly lists what’s being tested, while red_herrings helps in scoring the LLM’s ability to ignore irrelevant information.

2. Evaluation Framework

The evaluation system uses Python’s asyncio for concurrent testing, with retry logic for a little bit more API stability:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def evaluate_response(self, criteria: JudgingCriteria) -> Dict:
    """Evaluate a model's response using GPT-4 as judge."""
    async with aiohttp.ClientSession() as session:
        # ... evaluation logic

The scoring system looks at three components:

Physical Understanding Score (PUS) ∈ [0,2]

  • Measures understanding of relevant physical principles
  • Calculated as normalized sum of demonstrated principles

Solution Path Score (SPS) ∈ [0,2]

  • Evaluates completeness and correctness of solution steps
  • Considers practical feasibility of proposed solutions

Red Herring Handling (RHH) ∈ {0,1}

  • A Binary score for avoiding irrelevant items
  • Tests ability to focus on physically relevant factors

And yes, there are also so many other scoring methods, better and worse, that could be used! For example, RHH could be about how many irrelevant items are used in the solution, or it could be a measure of how viable the use is… the point is that picking these metrics are often times pretty arbitrary, but are very very important to making your benchmark is credible, which mine is very much not.

Additionally, I did not want to rewrite any code after. Sue me.

3. Model Interface Layer

The benchmark supports multiple LLM backends through a common interface:

class ModelInterface:
    """Interface for different LLM APIs."""
    async def generate_response(self, prompt: str) -> str:
        raise NotImplementedError
class GPT4Interface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # GPT-4 specific implementation
class ClaudeInterface(ModelInterface):
    async def generate_response(self, prompt: str) -> str:
        # Claude specific implementation

Two models… I can’t really afford any more, please understand.

Let’s Test It!

So after some refilling of API balances, I was ready to go. I ran the benchmark, and after totally not wasting credits on like, 5 different errors, the results were in. And they were… interesting! I used GPT4 and Claude 3.5 Haiku (I am aware these are not similarly spec-ed models) as our victims.

The Challenge: A can is stuck in a hole that almost exactly matches its volume. A combination is written on the bottom of the can that needs to be retrieved.

The inspiration
"VAC_001": {
        "problem_id": "VAC_001",
        "setup": {
            "available_items": [
                "ruler",
                "notebook",
                "suction cup",
                "UV light"
            ],
            "constraints": [
                "can fits too tightly for finger grip",
                "hole is same diameter as can",
                "can is clean and dry"
            ]
        },
        "physical_principles": [
            "vacuum suction",
            "surface adhesion",
            "force multiplication"
        ],
        "red_herrings": [
            "ruler",
            "notebook",
            "UV light"
        ],
        "solution": {
            "steps": [
                "inspect suction cup and can surface",
                "ensure can surface is clean",
                "firmly press suction cup onto can top",
                "create vacuum seal",
                "pull can out using suction cup"
            ],
            "key_insights": [
                "suction cup creates vacuum seal",
                "vacuum force allows extraction",
                "other items don't provide necessary force"
            ]
        }

GPT-4’s Solution

Physical Principles: The main physical principle involved in this solution is the concept of air pressure differential, which is the same principle that allows a suction cup to stick to a surface. Air pressure is higher on one side of the suction cup (outside) compared to the other (inside), and this difference in pressure causes the cup to stick to the surface.

Solution Steps:

  1. Attach the suction cup to the top of the can. The flat surface of the can should provide a good contact for the suction cup to adhere to.
  2. Once the suction cup is firmly attached, gently pull upwards. The air pressure differential should cause the suction cup (and thus the can) to be lifted out of the hole.
  3. Once the can is out of the hole, remove the suction cup from the top of the can.
  4. Use the UV light to illuminate the combination written on the bottom of the can. Some inks only become visible under UV light, which is why this step is necessary.
  5. Write down the combination on the notebook to ensure it’s not forgotten.

Required Items:

  • Suction cup: This is required to adhere to the can and pull it out of the hole, as the can fits too tightly for finger grip.
  • UV light: This is required to reveal the combination written on the bottom of the can, as it may be written in ink that only becomes visible under UV light.
  • Notebook: This is required to record the combination once it’s visible, ensuring it’s not forgotten.”

Not bad at all! GPT-4 manages to figure out the physical principle behind the solution (even if it is very wordy), and manages to step-by-step describe how to solve the problem. It does end up using the UV light and notebook, but that’s more of a problem on my end, not punishing the use of unnecessary items in the prompt enough. Also, the uses for these items seem decently reasonable… so why not!

Haiku-3.5’s Solution

Physical Principles: The key physical principles involved in this scenario are the concept of volume and the use of suction to extract the can from the hole. Since the can fits almost exactly into the hole, it is likely that the can is stuck due to the tight fit and the lack of any air gaps or space between the can and the hole. This tight fit creates a vacuum-like effect, making it difficult to grip and pull the can out.

Solution Steps:

  1. Measure the diameter of the can using the ruler. This will help determine the size of the hole and confirm that the can fits tightly within it.
  2. Inspect the bottom of the can to locate the combination written on it. Use the UV light to illuminate the bottom of the can and make the combination visible, if it is not already clearly visible.
  3. Attach the suction cup to the top of the can. The suction cup will create a temporary seal on the can’s surface, allowing you to grip and pull it out of the hole.
  4. Gently pull the can upwards using the suction cup. The suction force should be strong enough to overcome the vacuum-like effect and extract the can from the hole.
  5. Once the can is out, record the combination written on the bottom of the can in the notebook.

Required Items:

  • Ruler: To measure the diameter of the can and confirm the tight fit in the hole.
  • UV light: To illuminate the bottom of the can and make the combination visible, if it is not already clearly visible.
  • Suction cup: To create a temporary seal on the can’s surface and provide the necessary grip to pull the can out of the hole.
  • Notebook: To record the combination written on the bottom of the can.

On the other hand, Haiku’s solution was.. certainly a solution of all time.

First, it wants to measure the diameter of the can using the ruler??? To determine the size of the hole and confirm that the can fits tightly within it? Why would we need to do this? And do we need a ruler for that?

Second, it tells us to inspect the bottom of the can to locate the combination, when the entire problem is about not being able to pull the can out of the hole conventionally. This might just be an issue of order, but now I truly understand my friends’ feelings whenever I would tell them “just fix it man” to their numerous problems.

But it eventually does get the solution. So… not the worst.

Here’s a fancy radar graph of the results!

We see that both models are pretty similar in their capabilities, with GPT-4 being slightly better in physical understanding and solution path, and Haiku being slightly better in red herring handling. Overall though, both models kind of suck. Dang.

There are also only… 5 questions.

If you’d like to see the full breadth of questions, they’re on my GitHub.

LLM-as-a-Judge

By the way, the method I used to generate the evaluations, LLM-as-a-judge, has gained significant traction in the AI community, particularly after the work of Zheng et al. in their 2023 paper “Judging LLM-as-a-Judge.” The technique has proven remarkably effective, achieving over 80% agreement with human evaluators in tasks ranging from code assessment to dialogue quality evaluation!

Here’s where my experiment gets kind of cool (arguably, maybe, subjectively) — I used this methodology and had GPT-4 judge other LLMs’ physical reasoning abilities. Yes, I’m using an AI to judge other AIs.

Why does this work? Well, judging a response is actually a simpler task than generating one. When GPT-4 generates a solution to a physical puzzle, it needs to:

  • Understand the physical principles involved
  • Plan a sequence of steps
  • Consider all constraints
  • Generate a coherent explanation

But when judging, it only needs to check if specific criteria are met in an existing solution. The evaluation prompt is very focused:

def _create_evaluation_prompt(self, criteria: JudgingCriteria) -> str:
    return f"""You are an expert judge evaluating an LLM's understanding of physical reasoning puzzles.
Evaluate based on three criteria:
2. Physical Understanding Score (0-2): Does the solution correctly apply relevant physical principles?
3. Solution Path Score (0-2): Are the steps complete and feasible?
4. Red Herring Handling (0-1): Does it avoid using irrelevant items?
Scenario: {criteria.scenario}
Physical Principles Required: {criteria.correct_principles}
Solution Given: {criteria.model_response}
"""

To validate this approach, I followed the validation framework suggested by Zheng et al., performing spot-checks of GPT-4’s evaluations against my own judgments. Surprisingly (or perhaps unsurprisingly, given the broader research on LLM evaluation), it was remarkably consistent in identifying both correct physical understanding and flawed reasoning.

Is this perfect? Absolutely not. There’s something philosophically weird about using one LLM to evaluate another. But in practice, it can work surprisingly well — just like how I moan and groan about the visual presentation of a dish on Masterchef, while setting my kitchen aflame trying to microwave a hot dog.

What I Learned

Building this benchmark taught me several things about benchmark design:

Clear Metrics Matter: Even for complex tasks like physical reasoning, you need unambiguous scoring criteria.

Red Herrings Are Powerful: Including irrelevant items reveals a lot about an LLM’s reasoning process.

Context Control is Hard: Ensuring LLMs don’t “hallucinate” additional physical context is challenging.

Is this a perfect benchmark? Not even close. Please don’t rub it in. Is it scientifically rigorous? Definitely not. But it’s been a fascinating exploration into an aspect of LLM capabilities, and sometimes the best we can learn can come from just trying things out and seeing what happens.

Now, if you’ll excuse me, I will be sneaking in a phone with an internet connection into my next escape room, for reasons that I am legally unmotivated to disclose.

[1] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks Track (2023)

[2] T. Coignion, C. Quinton, R. Rouvoy, “A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024)

[3] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)

[5] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948 (2025)

[6] Unless otherwise stated, all images are created by the author.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Petrofac Back to Drawing Board in Restructuring Bid

Petrofac Ltd said Thursday it was ending its “advanced stage” financial restructuring after TenneT dropped the British energy engineering company from multiple projects. “Having carefully assessed the impact of TenneT’s decision, the board has determined that the restructuring, which had last week reached an advanced stage, is no longer deliverable in its current form”, Petrofac said in a statement on its website. “The Group is in close and constant dialogue with its key creditors and other stakeholders as it actively pursues alternative options for the Group. “In the meantime, Petrofac remains focused on serving its clients and maintaining operational capability and delivery of services across its businesses. “Further information will be shared in due course”. European grid operator TenneT said Thursday it had terminated Petrofac’s scope under a March 2023 agreement also signed with Hitachi Energy Ltd for six direct current connection projects on the Dutch and German sides of the North Sea. The projects each have a capacity of two gigawatts. “Since 2024 Petrofac has been working on a financial restructuring of its business”, TenneT said in an online statement. ”In the past period TenneT has worked extensively with the Petrofac/Hitachi Energy consortium on mitigation measures. “Since Petrofac has not been able to meet its contractual obligations, TenneT has exercised its right to partial termination of the contract related to the Petrofac scope. “At the same time, a solution has been put in place involving a consortium of Hitachi Energy and a replacement contractor. Hitachi Energy and the replacement contractor will be responsible for the project portfolio of the Dutch offshore grid connections IJmuiden Ver Alpha, Nederwiek 1, Nederwiek 3, Doordewind 1, Doordewind 2 and the German offshore grid connection LanWin5”. On September 11 Petrofac said it had reached an agreement in principle with Saipem SpA and Samsung E&A Co Ltd on their claims from

Read More »

Baker Hughes Plans to Buy Into Tamboran

Tamboran Resources Corp on Thursday confirmed an underwritten public offering of about 2.32 million common shares, under which Baker Hughes expressed interest to subscribe for up to $10 million. RBC Capital Markets LLC and Wells Fargo Securities LLC are underwriters in the placement. “The company expects to grant the underwriters a 30-day option to purchase up to an additional 348,666 shares of common stock from the company”, Sydney-based Tamboran said in a stock filing with the Australian Securities Exchange (ASX). Tamboran’s common stock trades on the New York Stock Exchange (NYSE). At ASX, it holds CHESS Depositary Interests (CDIs), each representing 1/200th of one common share, according to Tamboran. “The company intends to use the net proceeds of the offering to fund Tamboran’s development plan, working capital and other general corporate purposes”, the early-stage, Beetaloo basin-focused natural gas exploration and production company said. “Tamboran is in the process of conducting a bookbuild and price discovery process in relation to the public offering”, Tamboran added. In a prospectus filing with the United States Securities and Exchange Commission on Wednesday, Tamboran said, “Baker Hughes Energy Services LLC (the Interested Purchaser) has indicated an interest in purchasing up to an aggregate of approximately $10,000,000 of shares of common stock in this offering at the public offering price per share”. “Because this indication of interest is not a binding agreement or commitment to purchase, we can provide no assurances with respect to whether the Interested Purchaser will purchase shares in this offering or, if they elect to purchase shares, the number of shares they ultimately will acquire”, Tamboran said. “In addition, the underwriters may elect to sell fewer shares or not to sell any shares in this offering to the Interested Purchaser”. The prospectus added, “Concurrently with this offering, we are also offering to

Read More »

Merz Expects US Exemption for Rosneft in Germany

German Chancellor Friedrich Merz said he’s optimistic that the US will exempt Rosneft PJSC’s German unit from Washington’s latest sanctions against Russia. “We will discuss this with the Americans,” Merz told reporters at a European Union summit in Brussels on Thursday. “I assume that a corresponding exemption for Rosneft will be granted.” The chancellor added that it was actually unclear whether the German business, Rosneft Deutschland, “even needs” an exemption, as the penalties say Rosneft must own at least 50 percent of the business. “It is 50 percent,” he said.  There are concerns that Rosneft’s German unit may be cut off from key customers without a US sanctions exemption, Bloomberg reported earlier. Oil traders, banks and oil companies have already threatened to end relationships with the company. Merz welcomed the latest US sanctions against Russia on Thursday as an indication of President Donald Trump’s determination to pressure Russia into ending its war against Ukraine. The new US sanctions give customers until Nov. 21 to withdraw from “any entity” that’s more than 50 percent-owned by the penalized Russian firms.  While Germany put Rosneft’s local assets under a temporary trusteeship after Russia invaded Ukraine in 2022, it stopped short of nationalizing the business. That means Berlin will likely have to negotiate a carve out from the latest restrictions. What do you think? We’d love to hear from you, join the conversation on the Rigzone Energy Network. The Rigzone Energy Network is a new social experience created for you and all energy professionals to Speak Up about our industry, share knowledge, connect with peers and industry insiders and engage in a professional community that will empower your career in energy.

Read More »

Saipem Posts Higher Profit for First 9 Months

Saipem SpA has reported EUR 221 million ($256.45 million) in net income and adjusted net income for the first three quarters, up 7.3 percent from the same nine-month period last year. Third-quarter (Q3) net result was EUR 81 million, up from EUR 63 million for Q2 but down from EUR 88 million for Q3 2024, the Italian energy engineering company said in a statement on its website. Saipem said it did not record any non-recurring item for January-September 2025. “The trend of improvement in operational, economic and financial performance that started in 2022 continues in the third quarter of 2025”, it said. January-September 2025 operating profit and adjusted operating profit totaled EUR 464 million, up 11.3 percent. “The positive change in adjusted operating profit of EUR 47 million, to which is added the effect of the improvement in the balance of tax operations of EUR 14 million, is partly offset by the worsening of the balance of financial operations of EUR 46 million”, Saipem, backed by state-controlled energy producer Eni SpA, said. Q3 2025 operating profit was EUR 159 million, up from Q2’s EUR 148 million but down from EUR 162 million for Q3 2024. January-September 2025 revenue totaled EUR 10.98 billion, up 8.4 percent against the first nine months of 2024. Q3 2025 revenue increased both quarter-on-quarter and year-on-year to EUR 3.77 billion. Backlog as of Q3 was EUR30.56 billion: EUR20.01 billion in Asset-Based Services, EUR 9.42 billion in Energy Carriers and EUR 1.13 billion in Offshore Drilling. “The Offshore Drilling backlog of EUR 1,129 million reflects the impact of the cancellation of the Perro Negro 12 jack-up rental contract, valued at EUR 35 million, following the notification of the termination for convenience by the client Saudi Aramco, in the second quarter of 2025”, Saipem said. Saipem expects to

Read More »

Secretary Wright Acts to Unleash American Industry and Innovation with Newly Proposed Rules

WASHINGTON—U.S. Secretary of Energy Chris Wright directed the Federal Energy Regulatory Commission (FERC) today to initiate rulemaking procedures with a proposed rule to rapidly accelerate the interconnection of large loads, including data centers, positioning the United States to lead in AI innovation and in the revitalization of domestic manufacturing. Secretary Wright’s proposed rule allows customers to file joint, co-located load and generation interconnection requests. It will also significantly reduce study times and grid upgrade costs, while reducing the time needed for additional generation and power to come online. The proposed rule advances President Trump’s agenda to ensure all Americans and domestic industries have access to affordable, reliable, and secure electricity. Click here to read Secretary Wright’s letter and proposed rule. Secretary Wright also directed FERC today to initiate rulemaking procedures with a proposed rule to remove unnecessary burdens for preliminary hydroelectric power permits. Secretary Wright’s proposed rule clarifies that third parties do not have veto rights over the issuance of preliminary hydroelectric power permits. Click here to read Secretary Wright’s letter and proposed rule. President Trump and Secretary Wright have been clear: The United States is experiencing an unprecedented surge in electricity demand and the United States’ ability to remain at the forefront of technological innovation depends on an affordable, reliable, and secure supply of energy. ###

Read More »

Brazil Oil Auction Shows Renewed Interest in Offshore Exploration

Brazil sold five of the seven blocks offered in an oil auction on Wednesday to majors including Petrobras and Equinor ASA in a sign of renewed interest to explore in deep waters off the country’s southern coast despite low oil prices.  The auction presented surprises with Melbourne-based Karoon Energy Ltd winning the Esmeralda block by itself. Chinese oil majors CNOOC Ltd. and China Petroleum & Chemical Corp., or Sinopec, won a block without any local partners.  “The bid round was successful,” said Marcelo De Assis, a Rio de Janeiro-based independent oil consultant. “A Chinese operated block in the pre-salt is a first.” The fields are located in the so-called pre-salt region that is so productive that the single biggest project produces more oil than all of Colombia. Discoveries made in the 2000s propelled Brazil to become Latin America’s biggest producer, but exploration was lackluster for more than a decade until BP Plc announced the Bumerangue discovery in the pre-salt this year.  State-controlled Petroleo Brasileiro SA, as it is formally known, has also announced a series of oil finds at the Aram block in the pre-salt.  “What’s most important is that the pre-salt has heated up again,” said Pedro Zalan a geologist and consultant who previously worked at Petrobras. “The pre-salt, where there was little interest recently, got a new lease on life.”  Still, European majors who operate in Brazil, Shell Plc, BP Plc and TotalEnergies SE didn’t present any bids. The National Agency for Petroleum, Natural Gas and Biofuels, or ANP, received investment commitments even amid low prices that are prompting the oil companies to slash spending and downsize staff. WHAT DO YOU THINK? Generated by readers, the comments included herein do not reflect the views and opinions of Rigzone. All comments are subject to editorial review. Off-topic, inappropriate or

Read More »

How to set up an AI data center in 90 days

“Personally, I think that a brownfield is very creative way to deal with what I think is the biggest problem that we’ve got right now, which is time and speed to market,” he said. “On a brownfield, I can go into a building that’s already got power coming into the building. Sometimes they’ve already got chiller plants, like what we’ve got with the building I’m in right now.” Patmos certainly made the most of the liquid facilities in the old printing press building. The facility is built to handle anywhere from 50 to over 140 kilowatts per cabinet, a leap far beyond the 1–2 kW densities typical of legacy data centers. The chips used in the servers are Nvidia’s Grace Blackwell processors, which run extraordinarily hot. To manage this heat load, Patmos employs a multi-loop liquid cooling system. The design separates water sources into distinct, closed loops, each serving a specific function and ensuring that municipal water never directly contacts sensitive IT equipment. “We have five different, completely separated water loops in this building,” said Morgan. “The cooling tower uses city water for evaporation, but that water never mixes with the closed loops serving the data hall. Everything is designed to maximize efficiency and protect the hardware.” The building taps into Kansas City’s district chilled water supply, which is sourced from a nearby utility plant. This provides the primary cooling resource for the facility. Inside the data center, a dedicated loop circulates a specialized glycol-based fluid, filtered to extremely low micron levels and formulated to be electronically safe. Heat exchangers transfer heat from the data hall fluid to the district chilled water, keeping the two fluids separate and preventing corrosion or contamination. Liquid-to-chip and rear-door heat exchangers are used for immediate heat removal.

Read More »

INNIO and VoltaGrid: Landmark 2.3 GW Modular Power Deal Signals New Phase for AI Data Centers

Why This Project Marks a Landmark Shift The deployment of 2.3 GW of modular generation represents utility-scale capacity, but what makes it distinct is the delivery model. Instead of a centralized plant, the project uses modular gas-reciprocating “power packs” that can be phased in step with data-hall readiness. This approach allows staged energization and limits the bottlenecks that often stall AI campuses as they outgrow grid timelines or wait in interconnection queues. AI training loads fluctuate sharply, placing exceptional stress on grid stability and voltage quality. The INNIO/VoltaGrid platform was engineered specifically for these GPU-driven dynamics, emphasizing high transient performance (rapid load acceptance) and grid-grade power quality, all without dependence on batteries. Each power pack is also designed for maximum permitting efficiency and sustainability. Compared with diesel generation, modern gas-reciprocating systems materially reduce both criteria pollutants and CO₂ emissions. VoltaGrid markets the configuration as near-zero criteria air emissions and hydrogen-ready, extending allowable runtimes under air permits and making “prime-as-a-service” viable even in constrained or non-attainment markets. 2025: Momentum for Modular Prime Power INNIO has spent 2025 positioning its Jenbacher platform as a next-generation power solution for data centers: combining fast start, high transient performance, and lower emissions compared with diesel. While the 3 MW J620 fast-start lineage dates back to 2019, this year the company sharpened its data center narrative and booked grid stability and peaking projects in markets where rapid data center growth is stressing local grids. This momentum was exemplified by an 80 MW deployment in Indonesia announced earlier in October. The same year saw surging AI-driven demand and INNIO’s growing push into North American data-center markets. Specifications for the 2.3 GW VoltaGrid package highlight the platform’s heat tolerance, efficiency, and transient response, all key attributes for powering modern AI campuses. VoltaGrid’s 2025 Milestones VoltaGrid’s announcements across 2025 reflect

Read More »

Inside Google’s multi-architecture revolution: Axion Arm joins x86 in production clusters

Matt Kimball, VP and principal analyst with Moor Insights and Strategy, pointed out that AWS and Microsoft have already moved many workloads from x86 to internally designed Arm-based servers. He noted that, when Arm first hit the hyperscale datacenter market, the architecture was used to support more lightweight, cloud-native workloads with an interpretive layer where architectural affinity was “non-existent.” But now there’s much more focus on architecture, and compatibility issues “largely go away” as Arm servers support more and more workloads. “In parallel, we’ve seen CSPs expand their designs to support both scale out (cloud-native) and traditional scale up workloads effectively,” said Kimball. Simply put, CSPs are looking to monetize chip investments, and this migration signals that Google has found its performance-per-dollar (and likely performance-per-watt) better on Axion than x86. Google will likely continue to expand its Arm footprint as it evolves its Axion chip; as a reference point, Kimball pointed to AWS Graviton, which didn’t really support “scale up” performance until its v3 or v4 chip. Arm is coming to enterprise data centers too When looking at architectures, enterprise CIOs should ask themselves questions such as what instance do they use for cloud workloads, and what servers do they deploy in their data center, Kimball noted. “I think there is a lot less concern about putting my workloads on an Arm-based instance on Google Cloud, a little more hesitance to deploy those Arm servers in my datacenter,” he said. But ultimately, he said, “Arm is coming to the enterprise datacenter as a compute platform, and Nvidia will help usher this in.” Info-Tech’s Jain agreed that Nvidia is the “biggest cheerleader” for Arm-based architecture, and Arm is increasingly moving from niche and mobile use to general-purpose and AI workload execution.

Read More »

AMD Scales the AI Factory: 6 GW OpenAI Deal, Korean HBM Push, and Helios Debut

What 6 GW of GPUs Really Means The 6 GW of accelerator load envisioned under the OpenAI–AMD partnership will be distributed across multiple hyperscale AI factory campuses. If OpenAI begins with 1 GW of deployment in 2026, subsequent phases will likely be spread regionally to balance supply chains, latency zones, and power procurement risk. Importantly, this represents entirely new investment in both power infrastructure and GPU capacity. OpenAI and its partners have already outlined multi-GW ambitions under the broader Stargate program; this new initiative adds another major tranche to that roadmap. Designing for the AI Factory Era These upcoming facilities are being purpose-built for next-generation AI factories, where MI450-class clusters could drive rack densities exceeding 100 kW. That level of compute concentration makes advanced power and cooling architectures mandatory, not optional. Expected solutions include: Warm-water liquid cooling (manifold, rear-door, and CDU variants) as standard practice. Facility-scale water loops and heat-reuse systems—including potential district-heating partnerships where feasible. Medium-voltage distribution within buildings, emphasizing busway-first designs and expanded fault-current engineering. While AMD has not yet disclosed thermal design power (TDP) specifications for the MI450, a 1 GW campus target implies tens of thousands of accelerators. That scale assumes liquid cooling, ultra-dense racks, and minimal network latency footprints, pushing architectures decisively toward an “AI-first” orientation. Design considerations for these AI factories will likely include: Liquid-to-liquid cooling plants engineered for step-function capacity adders (200–400 MW blocks). Optics-friendly white space layouts with short-reach topologies, fiber raceways, and aisles optimized for module swaps. Substation adjacency and on-site generation envelopes negotiated during early land-banking phases. Networking, Memory, and Power Integration As compute density scales, networking and memory bottlenecks will define infrastructure design. Expect fat-tree and dragonfly network topologies, 800 G–1.6 T interconnects, and aggressive optical-module roadmaps to minimize collective-operation latency, aligning with recent disclosures from major networking vendors.

Read More »

Study Finds $4B in Data Center Grid Costs Shifted to Consumers Across PJM Region

In a new report spanning 2022 through 2024, the Union of Concerned Scientists (UCS) identifies a significant regulatory gap in the PJM Interconnection’s planning and rate-making process—one that allows most high-voltage (“transmission-level”) interconnection costs for large, especially AI-scale, data centers to be socialized across all utility customers. The result, UCS argues, is a multi-billion-dollar pass-through that is poised to grow as more data center projects move forward, because these assets are routinely classified as ordinary transmission infrastructure rather than customer-specific hookups. According to the report, between 2022 and 2024, utilities initiated more than 150 local transmission projects across seven PJM states specifically to serve data center connections. In 2024 alone, 130 projects were approved with total costs of approximately $4.36 billion. Virginia accounted for nearly half that total—just under $2 billion—followed by Ohio ($1.3 billion) and Pennsylvania ($492 million) in data-center-related interconnection spending. Yet only six of those 130 projects, about 5 percent, were reported as directly paid for by the requesting customer. The remaining 95 percent, representing more than $4 billion in 2024 connection costs, were rolled into general transmission charges and ultimately recovered from all retail ratepayers. How Does This Happen? When data center project costs are discussed, the focus is usually on the price of the power consumed, or megawatts multiplied by rate. What the UCS report isolates, however, is something different: the cost of physically delivering that power: the substations, transmission lines, and related infrastructure needed to connect hyperscale facilities to the grid. So why aren’t these substantial consumer-borne costs more visible? The report identifies several structural reasons for what effectively functions as a regulatory loophole in how development expenses are reported and allocated: Jurisdictional split. High-voltage facilities fall under the Federal Energy Regulatory Commission (FERC), while retail electricity rates are governed by state public utility

Read More »

OCP Global Summit 2025 Highlights: Advancing Data Center Densification and Security

With the conclusion of the 2025 OCP Global Summit, William G. Wong, Senior Content Director at DCF’s sister publications Electronic Design and Microwaves & RF, published a comprehensive roundup of standout technologies unveiled at the event. For Data Center Frontier readers, we’ve revisited those innovations through the lens of data center impact, focusing on how they reshape infrastructure design and operational strategy. This year’s OCP Summit marked a decisive shift toward denser GPU racks, standardized direct-to-chip liquid cooling, 800-V DC power distribution, high-speed in-rack fabrics, and “crypto-agile” platform security. Collectively, these advances aim to accelerate time-to-capacity, reduce power-distribution losses at megawatt rack scales, simplify retrofits in legacy halls, and fortify data center platforms against post-quantum threats. Rack Design and Cooling: From Ad-Hoc to Production-Grade Liquid Cooling NVIDIA’s Vera Rubin compute tray, newly offered to OCP for standardization, packages Rubin-generation GPUs with an integrated liquid-cooling manifold and PCB midplane. Compared with the GB300 tray, Vera Rubin represents a production-ready module delivering four times the memory and three times the memory bandwidth: a 7.5× performance factor at rack scale, with 150 TB of memory at 1.7 PB/s per rack. The system implements 45 °C liquid cooling, a 5,000-amp liquid-cooled busbar, and on-tray energy storage with power-resilience features such as flexible 100-amp whips and automatic-transfer power-supply units. NVIDIA also previewed a Kyber rack generation targeted for 2027, pivoting from 415/480 VAC to 800 V DC to support up to 576 Rubin Ultra GPUs, potentially eliminating the 200-kg copper busbars typical today. These refinements are aimed at both copper reduction and aisle-level manageability. Wiwynn’s announcements filled in the practicalities of deploying such densities. The company showcased rack- and system-level designs across NVIDIA GB300 NVL72 (72 Blackwell Ultra GPUs with 800 Gb/s ConnectX-8 SuperNICs) for large-scale inference and reasoning, and HGX B300 (eight GPUs /

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »