Avoidable and Unavoidable Randomness in GPT-4o

Stay Ahead, Stay ONMINE

Avoidable and Unavoidable Randomness in GPT-4o

Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces subtle, frustrating randomness. There’s no fix for this, and it might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the sources of randomness in GPT-4o output, which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism. First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results, and prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings. Flipping a coin We’re going to keep things extremely simple here and prompt the most recent version of GPT-4o (gpt-4o-2024-08-06 in the API) with this: Flip a coin. Return Heads or Tails only. Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof. This is our first attempt. import os from openai import OpenAI client = OpenAI(api_key=os.getenv(‘OPENAI_API_KEY’)) prompt = ‘Flip a coin. Return Heads or Tails only.’ response = client.chat.completions.create( model=’gpt-4o-2024-08-06′, messages=[{‘role’: ‘user’, ‘content’: prompt}], ) print(response.choices[0].message.content) Running the code gave me Heads. Maybe you’ll get Tails, or if you’re really lucky, something far more interesting. The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with client.chat.completions.create, where we specify the model to use and send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message, which is response.choices[0].message.content. > > > responseChatCompletion(id=’chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr’, choices=[Choice(finish_reason=’stop’, index=0, logprobs=None, message=ChatCompletionMessage(content=’Heads’, refusal=None, role=’assistant’, audio=None, function_call=None, tool_calls=None))], created=1740324680, model=’gpt-4o-2024-08-06′, object=’chat.completion’, service_tier=’default’, system_fingerprint=’fp_eb9dce56a8′, usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))) Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that. import os from openai import OpenAI client = OpenAI(api_key=os.getenv(‘OPENAI_API_KEY’)) prompt = ‘Flip a coin. Return Heads or Tails only.’ for _ in range(10): response = client.chat.completions.create( model=’gpt-4o-2024-08-06′, messages=[{‘role’: ‘user’, ‘content’: prompt}], ) print(response.choices[0].message.content) Running this code gave me the following output, although you might get different output, of course. HeadsHeadsHeadsHeadsHeadsHeadsTailsHeadsHeadsHeads GPT-4o’s coin is clearly biased, but so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism. Just how biased is GPT-4o’s coin? Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads. Rather than the obvious (but expensive) approach of flipping it a million times, there’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and 1 output token. OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of 1.0, and is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that. import os import math from openai import OpenAI from tabulate import tabulate client = OpenAI(api_key=os.getenv(‘OPENAI_API_KEY’)) prompt = ‘Flip a coin. Return Heads or Tails only.’ response = client.chat.completions.create( model=’gpt-4o-2024-08-06′, max_tokens=1, logprobs=True, top_logprobs=20, messages=[{‘role’: ‘user’, ‘content’: prompt}], ) logprobs_list = response.choices[0].logprobs.content[0].top_logprobs data = [] total_pct = 0.0 for logprob_entry in logprobs_list: token = logprob_entry.token logprob = logprob_entry.logprob pct = math.exp(logprob) * 100 # Convert logprob to a percentage total_pct += pct data.append([token, logprob, pct]) print( tabulate( data, headers=[“Token”, “Log Probability”, “Percentage (%)”], tablefmt=”github”, floatfmt=(“s”, “.10f”, “.10f”) ) ) print(f”nTotal probabilities: {total_pct:.6f}%”) If you run this, you’ll get something like the following output, but actual numbers will vary. | Token | Log Probability | Percentage (%) ||———–|——————-|——————|| Heads | -0.0380541235 | 96.2660836887 || T | -3.2880542278 | 3.7326407467 || Sure | -12.5380544662 | 0.0003587502 || Head | -12.7880544662 | 0.0002793949 || Tail | -13.2880544662 | 0.0001694616 || Certainly | -13.5380544662 | 0.0001319768 || “T | -14.2880544662 | 0.0000623414 || I’m | -14.5380544662 | 0.0000485516 || heads | -14.5380544662 | 0.0000485516 || Heads | -14.9130544662 | 0.0000333690 || ” | -15.1630544662 | 0.0000259878 || _heads | -15.1630544662 | 0.0000259878 || tails | -15.5380544662 | 0.0000178611 || HEAD | -15.7880544662 | 0.0000139103 || TAIL | -16.2880535126 | 0.0000084370 || T | -16.7880535126 | 0.0000051173 || “` | -16.7880535126 | 0.0000051173 || Here’s | -16.9130535126 | 0.0000045160 || I | -17.2880535126 | 0.0000031038 || As | -17.2880535126 | 0.0000031038 |Total probabilities: 99.999970% Looking at these probabilities, we see Heads at ≈96% and T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails? This is the tokenizer splitting Tails into T + ails, while keeping Heads as one piece, as we can see in this Python session: > > > import tiktoken > > > encoding = tiktoken.encoding_for_model(“gpt-4o-2024-08-06″) > > > encoding.encode(‘Tails’) [51, 2196] > > > encoding.decode([51]) ‘T’ > > > encoding.encode(‘Heads’) [181043] These probabilities are not deterministic Run the code to display the probabilities for the top 20 tokens again, and you’ll likely get different numbers. Here’s what I got on a second running. | Token | Log Probability | Percentage (%) ||———–|——————-|——————|| Heads | -0.0110520627 | 98.9008786933 || T | -4.5110521317 | 1.0986894433 || Certainly | -14.0110521317 | 0.0000822389 || Head | -14.2610521317 | 0.0000640477 || Sure | -14.2610521317 | 0.0000640477 || Tail | -14.3860521317 | 0.0000565219 || heads | -15.3860521317 | 0.0000207933 || Heads | -15.5110521317 | 0.0000183500 || “` | -15.5110521317 | 0.0000183500 || _heads | -15.6360521317 | 0.0000161938 || tails | -15.6360521317 | 0.0000161938 || I’m | -15.8860521317 | 0.0000126117 || “T | -15.8860521317 | 0.0000126117 || As | -16.3860511780 | 0.0000076494 || ” | -16.5110511780 | 0.0000067506 || HEAD | -16.6360511780 | 0.0000059574 || TAIL | -16.7610511780 | 0.0000052574 || Here’s | -16.7610511780 | 0.0000052574 || “ | -17.1360511780 | 0.0000036133 || T | -17.6360511780 | 0.0000021916 |Total probabilities: 99.999987% In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs: If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models. They also give “mostly identical” advice in the reproducible outputs section of their documentation. The request parameters that could affect randomness are temperature and seed. OpenAI also suggests we track system_fingerprint, because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism. Temperature, and why it won’t fix this Temperature controls how random the model’s responses are. Low temperatures (1.5) produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows), and typicality (how well it fits expected patterns). They observed that: temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality. But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: temperature=0.5 squares the probabilities, making likely tokens dominate, while temperature=2.0 applies a square root, flattening the distribution. What about temperature=0.0? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities. In summary: if the logprobs aren’t deterministic, setting temperature to 0.0 won’t make the model deterministic. In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses, the temperature setting doesn’t come into play in our code at all. Seeds, and why they won’t fix this After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling. In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic. In fact, seed only matters with non-zero temperatures. With temperature=0.0, the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism. System fingerprints, our last hope The system_fingerprint identifies the current combination of model weights, infrastructure, and configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below. Nothing can get you determinism Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint. Here’s the code: import os import math from openai import OpenAI from tabulate import tabulate from tqdm import tqdm client = OpenAI(api_key=os.getenv(‘OPENAI_API_KEY’)) prompt = ‘Flip a coin. Return Heads or Tails only.’ data = [] for _ in tqdm(range(10), desc=’Generating responses’): response = client.chat.completions.create( model=’gpt-4o-2024-08-06′, temperature=0.0, seed=42, max_tokens=1, logprobs=True, top_logprobs=20, messages=[{‘role’: ‘user’, ‘content’: prompt}], ) fingerprint = response.system_fingerprint logprobs_list = response.choices[0].logprobs.content[0].top_logprobs heads_logprob = next( entry.logprob for entry in logprobs_list if entry.token == ‘Heads’ ) pct = math.exp(heads_logprob) * 100 data.append([fingerprint, heads_logprob, f”{pct:.10f}%”]) headers = [“Fingerprint”, “Logprob”, “Probability”] print(tabulate(data, headers=headers, tablefmt=”pipe”)) Running this 10 times, here are the logprobs and probabilities for the token Heads: | Fingerprint | Logprob | Probability ||—————|————|—————-|| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% || fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% || fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% || fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% || fp_f9f4fb6dbf | -0.160339 | 85.1854886858% || fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% || fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% || fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% || fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% || fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | Mixture-of-experts makes determinism impossible OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts. According to a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism: Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level. In other words, when your prompt (a sequence of tokens, in the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch. This “competition” for experts introduces a real-world randomness completely beyond our control. Non-determinism beyond mixture-of-experts While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models. Making a few changes to our code above (switching to gpt-3.5-turbo-0125, looking for the token He since GPT-3.5’s tokenizer splits “Heads” differently, and ignoring system_fingerprint because this model doesn’t have it) reveals that GPT-3.5-turbo also exhibits non-deterministic logprobs: | Logprob | Probability ||————-|—————-|| -0.00278289 | 99.7220983436% || -0.00415331 | 99.5855302068% || -0.00258838 | 99.7414961980% || -0.00204034 | 99.7961735289% || -0.00240277 | 99.7600117933% || -0.00204034 | 99.7961735289% || -0.00204034 | 99.7961735289% || -0.00258838 | 99.7414961980% || -0.00351419 | 99.6491976144% || -0.00201214 | 99.7989878007% | No one is claiming that GPT-3.5-turbo uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism. What 10,000 GPT-4o coin flip probabilities tell us To better understand the patterns and magnitude of this non-determinism, I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case. The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42. In the output below, I clustered these probabilities, ensuring that each cluster was separated from the others by 0.01 (as a raw percentage). This groups the output into 12 clusters. Probability Count Fingerprints——————————————————————85.1854379113% 5 fp_eb9dce56a8, fp_f9f4fb6dbf85.1854455275% 74 fp_eb9dce56a8, fp_f9f4fb6dbf85.1854886858% 180 fp_eb9dce56a8, fp_f9f4fb6dbf——————————————————————88.0662448207% 31 fp_eb9dce56a8, fp_f9f4fb6dbf88.0678628883% 2 fp_f9f4fb6dbf——————————————————————92.3997629747% 1 fp_eb9dce56a892.3997733012% 4 fp_eb9dce56a892.3997836277% 3 fp_eb9dce56a8——————————————————————92.4128943690% 1 fp_f9f4fb6dbf92.4129143363% 21 fp_eb9dce56a8, fp_f9f4fb6dbf92.4129246643% 8 fp_eb9dce56a8, fp_f9f4fb6dbf——————————————————————93.9906837191% 4 fp_eb9dce56a8——————————————————————95.2569999350% 36 fp_eb9dce56a8——————————————————————96.2660836887% 3391 fp_eb9dce56a8, fp_f9f4fb6dbf96.2661285161% 2636 fp_eb9dce56a8, fp_f9f4fb6dbf——————————————————————97.0674551052% 1 fp_eb9dce56a897.0674778863% 3 fp_eb9dce56a897.0675003058% 4 fp_eb9dce56a897.0675116963% 1 fp_eb9dce56a897.0680739932% 19 fp_eb9dce56a8, fp_f9f4fb6dbf97.0681293191% 6 fp_eb9dce56a8, fp_f9f4fb6dbf97.0681521003% 74 fp_eb9dce56a8, fp_f9f4fb6dbf97.0682421405% 4 fp_eb9dce56a8——————————————————————97.7008960695% 1 fp_f9f4fb6dbf97.7011122645% 3 fp_eb9dce56a897.7011462953% 3 fp_eb9dce56a897.7018178132% 1 fp_eb9dce56a8——————————————————————98.2006069902% 426 fp_eb9dce56a8, fp_f9f4fb6dbf98.2006876548% 6 fp_f9f4fb6dbf98.2007107019% 1 fp_eb9dce56a898.2009525133% 5 fp_eb9dce56a898.2009751945% 1 fp_eb9dce56a898.2009867181% 1 fp_eb9dce56a8——————————————————————98.5930987656% 3 fp_eb9dce56a8, fp_f9f4fb6dbf98.5931104270% 235 fp_eb9dce56a8, fp_f9f4fb6dbf98.5931222721% 4 fp_eb9dce56a8, fp_f9f4fb6dbf98.5931340253% 9 fp_eb9dce56a898.5931571644% 159 fp_eb9dce56a8, fp_f9f4fb6dbf98.5931805790% 384 fp_eb9dce56a8——————————————————————98.9008436920% 95 fp_eb9dce56a8, fp_f9f4fb6dbf98.9008550214% 362 fp_eb9dce56a8, fp_f9f4fb6dbf98.9008786933% 1792 fp_eb9dce56a8, fp_f9f4fb6dbf (With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.) As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for the most part the two system fingerprints returned the same sets probabilities, rather than each fingerprint producing its own distinct set of probabilities. It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around 96.266% with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent. Conclusion: Non-determinism is baked in There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and the two flavors of GPT-3.5-turbo. Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it. While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in GPT-3.5-turbo can’t be explained by this at all, because GPT-3.5-turbo isn’t a mixture-of-experts model. While we can’t verify this claim any more because the model isn’t being served, this behaviour wasn’t seen with GPT-3, according to user _j on the OpenAI forum: It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values. This suggests that whatever is causing this randomness first emerged in either GPT-3.5 or GPT-3.5-turbo. But regardless of when it emerged, this non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?” Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other, hidden sources of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability. References Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005. Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492. Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.

There’s no fix for this, and it might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the sources of randomness in GPT-4o output, which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism.

First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results, and prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings.

Flipping a coin

We’re going to keep things extremely simple here and prompt the most recent version of GPT-4o (gpt-4o-2024-08-06 in the API) with this:

Flip a coin. Return Heads or Tails only.

Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof.

This is our first attempt.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    messages=[{'role': 'user', 'content': prompt}],
)

print(response.choices[0].message.content)

Running the code gave me Heads. Maybe you’ll get Tails, or if you’re really lucky, something far more interesting.

The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with client.chat.completions.create, where we specify the model to use and send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message, which is response.choices[0].message.content.

>>> response
ChatCompletion(id=’chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr’, choices=[Choice(finish_reason=’stop’, index=0, logprobs=None, message=ChatCompletionMessage(content=’Heads’, refusal=None, role=’assistant’, audio=None, function_call=None, tool_calls=None))], created=1740324680, model=’gpt-4o-2024-08-06′, object=’chat.completion’, service_tier=’default’, system_fingerprint=’fp_eb9dce56a8′, usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

for _ in range(10):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        messages=[{'role': 'user', 'content': prompt}],
    )
    print(response.choices[0].message.content)

Running this code gave me the following output, although you might get different output, of course.

Heads
Heads
Heads
Heads
Heads
Heads
Tails
Heads
Heads
Heads

GPT-4o’s coin is clearly biased, but so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism.

Just how biased is GPT-4o’s coin?

Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads.

Rather than the obvious (but expensive) approach of flipping it a million times, there’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and 1 output token.

OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of 1.0, and is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that.

import os
import math
from openai import OpenAI
from tabulate import tabulate

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    max_tokens=1,
    logprobs=True,
    top_logprobs=20,
    messages=[{'role': 'user', 'content': prompt}],
)

logprobs_list = response.choices[0].logprobs.content[0].top_logprobs

data = []
total_pct = 0.0

for logprob_entry in logprobs_list:
    token = logprob_entry.token
    logprob = logprob_entry.logprob
    pct = math.exp(logprob) * 100  # Convert logprob to a percentage
    total_pct += pct
    data.append([token, logprob, pct])

print(
    tabulate(
        data,
        headers=["Token", "Log Probability", "Percentage (%)"],
        tablefmt="github",
        floatfmt=("s", ".10f", ".10f")
    )
)
print(f"nTotal probabilities: {total_pct:.6f}%")

If you run this, you’ll get something like the following output, but actual numbers will vary.

| Token | Log Probability | Percentage (%) |
|———–|——————-|——————|
| Heads | -0.0380541235 | 96.2660836887 |
| T | -3.2880542278 | 3.7326407467 |
| Sure | -12.5380544662 | 0.0003587502 |
| Head | -12.7880544662 | 0.0002793949 |
| Tail | -13.2880544662 | 0.0001694616 |
| Certainly | -13.5380544662 | 0.0001319768 |
| “T | -14.2880544662 | 0.0000623414 |
| I’m | -14.5380544662 | 0.0000485516 |
| heads | -14.5380544662 | 0.0000485516 |
| Heads | -14.9130544662 | 0.0000333690 |
| ” | -15.1630544662 | 0.0000259878 |
| _heads | -15.1630544662 | 0.0000259878 |
| tails | -15.5380544662 | 0.0000178611 |
| HEAD | -15.7880544662 | 0.0000139103 |
| TAIL | -16.2880535126 | 0.0000084370 |
| T | -16.7880535126 | 0.0000051173 |
| “` | -16.7880535126 | 0.0000051173 |
| Here’s | -16.9130535126 | 0.0000045160 |
| I | -17.2880535126 | 0.0000031038 |
| As | -17.2880535126 | 0.0000031038 |

Total probabilities: 99.999970%

Looking at these probabilities, we see Heads at ≈96% and T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails? This is the tokenizer splitting Tails into T + ails, while keeping Heads as one piece, as we can see in this Python session:

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
>>> encoding.encode('Tails')
[51, 2196]
>>> encoding.decode([51])
'T'
>>> encoding.encode('Heads')
[181043]

These probabilities are not deterministic

Run the code to display the probabilities for the top 20 tokens again, and you’ll likely get different numbers. Here’s what I got on a second running.

| Token | Log Probability | Percentage (%) |
|———–|——————-|——————|
| Heads | -0.0110520627 | 98.9008786933 |
| T | -4.5110521317 | 1.0986894433 |
| Certainly | -14.0110521317 | 0.0000822389 |
| Head | -14.2610521317 | 0.0000640477 |
| Sure | -14.2610521317 | 0.0000640477 |
| Tail | -14.3860521317 | 0.0000565219 |
| heads | -15.3860521317 | 0.0000207933 |
| Heads | -15.5110521317 | 0.0000183500 |
| “` | -15.5110521317 | 0.0000183500 |
| _heads | -15.6360521317 | 0.0000161938 |
| tails | -15.6360521317 | 0.0000161938 |
| I’m | -15.8860521317 | 0.0000126117 |
| “T | -15.8860521317 | 0.0000126117 |
| As | -16.3860511780 | 0.0000076494 |
| ” | -16.5110511780 | 0.0000067506 |
| HEAD | -16.6360511780 | 0.0000059574 |
| TAIL | -16.7610511780 | 0.0000052574 |
| Here’s | -16.7610511780 | 0.0000052574 |
| “ | -17.1360511780 | 0.0000036133 |
| T | -17.6360511780 | 0.0000021916 |

Total probabilities: 99.999987%

In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs:

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

They also give “mostly identical” advice in the reproducible outputs section of their documentation.

The request parameters that could affect randomness are temperature and seed. OpenAI also suggests we track system_fingerprint, because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism.

Temperature, and why it won’t fix this

Temperature controls how random the model’s responses are. Low temperatures (1.5) produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows), and typicality (how well it fits expected patterns). They observed that:

temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality.

But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: temperature=0.5 squares the probabilities, making likely tokens dominate, while temperature=2.0 applies a square root, flattening the distribution.

What about temperature=0.0? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities.

In summary: if the logprobs aren’t deterministic, setting temperature to 0.0 won’t make the model deterministic.

In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses, the temperature setting doesn’t come into play in our code at all.

Seeds, and why they won’t fix this

After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling.

In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic.

In fact, seed only matters with non-zero temperatures. With temperature=0.0, the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism.

System fingerprints, our last hope

The system_fingerprint identifies the current combination of model weights, infrastructure, and configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below.

Nothing can get you determinism

Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint. Here’s the code:

import os
import math
from openai import OpenAI
from tabulate import tabulate
from tqdm import tqdm

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

data = []

for _ in tqdm(range(10), desc='Generating responses'):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        temperature=0.0,
        seed=42,
        max_tokens=1,
        logprobs=True,
        top_logprobs=20,
        messages=[{'role': 'user', 'content': prompt}],
    )

    fingerprint = response.system_fingerprint
    logprobs_list = response.choices[0].logprobs.content[0].top_logprobs
    heads_logprob = next(
        entry.logprob for entry in logprobs_list if entry.token == 'Heads'
    )
    pct = math.exp(heads_logprob) * 100
    data.append([fingerprint, heads_logprob, f"{pct:.10f}%"])

headers = ["Fingerprint", "Logprob", "Probability"]
print(tabulate(data, headers=headers, tablefmt="pipe"))

Running this 10 times, here are the logprobs and probabilities for the token Heads:

| Fingerprint | Logprob | Probability |
|—————|————|—————-|
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.160339 | 85.1854886858% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |

Mixture-of-experts makes determinism impossible

OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts.

According to a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism:

Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level.

In other words, when your prompt (a sequence of tokens, in the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch.

This “competition” for experts introduces a real-world randomness completely beyond our control.

Non-determinism beyond mixture-of-experts

While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models.

Making a few changes to our code above (switching to gpt-3.5-turbo-0125, looking for the token He since GPT-3.5’s tokenizer splits “Heads” differently, and ignoring system_fingerprint because this model doesn’t have it) reveals that GPT-3.5-turbo also exhibits non-deterministic logprobs:

| Logprob | Probability |
|————-|—————-|
| -0.00278289 | 99.7220983436% |
| -0.00415331 | 99.5855302068% |
| -0.00258838 | 99.7414961980% |
| -0.00204034 | 99.7961735289% |
| -0.00240277 | 99.7600117933% |
| -0.00204034 | 99.7961735289% |
| -0.00204034 | 99.7961735289% |
| -0.00258838 | 99.7414961980% |
| -0.00351419 | 99.6491976144% |
| -0.00201214 | 99.7989878007% |

No one is claiming that GPT-3.5-turbo uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism.

What 10,000 GPT-4o coin flip probabilities tell us

To better understand the patterns and magnitude of this non-determinism, I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case.

The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42.

In the output below, I clustered these probabilities, ensuring that each cluster was separated from the others by 0.01 (as a raw percentage). This groups the output into 12 clusters.

Probability Count Fingerprints
——————————————————————
85.1854379113% 5 fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854455275% 74 fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854886858% 180 fp_eb9dce56a8, fp_f9f4fb6dbf
——————————————————————
88.0662448207% 31 fp_eb9dce56a8, fp_f9f4fb6dbf
88.0678628883% 2 fp_f9f4fb6dbf
——————————————————————
92.3997629747% 1 fp_eb9dce56a8
92.3997733012% 4 fp_eb9dce56a8
92.3997836277% 3 fp_eb9dce56a8
——————————————————————
92.4128943690% 1 fp_f9f4fb6dbf
92.4129143363% 21 fp_eb9dce56a8, fp_f9f4fb6dbf
92.4129246643% 8 fp_eb9dce56a8, fp_f9f4fb6dbf
——————————————————————
93.9906837191% 4 fp_eb9dce56a8
——————————————————————
95.2569999350% 36 fp_eb9dce56a8
——————————————————————
96.2660836887% 3391 fp_eb9dce56a8, fp_f9f4fb6dbf
96.2661285161% 2636 fp_eb9dce56a8, fp_f9f4fb6dbf
——————————————————————
97.0674551052% 1 fp_eb9dce56a8
97.0674778863% 3 fp_eb9dce56a8
97.0675003058% 4 fp_eb9dce56a8
97.0675116963% 1 fp_eb9dce56a8
97.0680739932% 19 fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681293191% 6 fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681521003% 74 fp_eb9dce56a8, fp_f9f4fb6dbf
97.0682421405% 4 fp_eb9dce56a8
——————————————————————
97.7008960695% 1 fp_f9f4fb6dbf
97.7011122645% 3 fp_eb9dce56a8
97.7011462953% 3 fp_eb9dce56a8
97.7018178132% 1 fp_eb9dce56a8
——————————————————————
98.2006069902% 426 fp_eb9dce56a8, fp_f9f4fb6dbf
98.2006876548% 6 fp_f9f4fb6dbf
98.2007107019% 1 fp_eb9dce56a8
98.2009525133% 5 fp_eb9dce56a8
98.2009751945% 1 fp_eb9dce56a8
98.2009867181% 1 fp_eb9dce56a8
——————————————————————
98.5930987656% 3 fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931104270% 235 fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931222721% 4 fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931340253% 9 fp_eb9dce56a8
98.5931571644% 159 fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931805790% 384 fp_eb9dce56a8
——————————————————————
98.9008436920% 95 fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008550214% 362 fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008786933% 1792 fp_eb9dce56a8, fp_f9f4fb6dbf

(With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.)

As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for the most part the two system fingerprints returned the same sets probabilities, rather than each fingerprint producing its own distinct set of probabilities.

It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around 96.266% with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent.

Conclusion: Non-determinism is baked in

There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and the two flavors of GPT-3.5-turbo. Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it.

While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in GPT-3.5-turbo can’t be explained by this at all, because GPT-3.5-turbo isn’t a mixture-of-experts model.

While we can’t verify this claim any more because the model isn’t being served, this behaviour wasn’t seen with GPT-3, according to user _j on the OpenAI forum:

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

This suggests that whatever is causing this randomness first emerged in either GPT-3.5 or GPT-3.5-turbo.

But regardless of when it emerged, this non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?”

Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other, hidden sources of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability.

References

Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005.

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492.

Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

ExxonMobil bumps up 2030 target for Permian production

ExxonMobil Corp., Houston, is looking to grow production in the Permian basin to about 2.5 MMboe/d by 2030, an increase of 200,000 boe/d from executives’ previous forecasts and a jump of more than 45% from this year’s output. Helping drive that higher target is an expected 2030 cost profile that

Cloud providers continue to push EU court to undo Broadcom-VMware merger

CISPE director of communications Ben Maynard dismissed fears that any action by the Commission could lead to a fine on VMware that would be passed on to its users, increasing prices even further. “I’m not sure that a fine is a likely consequence. This isn’t an action against Broadcom; this

P4 programming: Redefining what’s possible in network infrastructure

Real problems P4 solves Visibility that actually tells you something Traditional monitoring gives you SNMP counters (updated every 30 seconds, way too slow) or NetFlow samples (statistically useful but incomplete). Neither tells you what happened to a specific transaction at a specific moment. P4 changes this completely. Your switches and

Cybersecurity skills matter more than headcount in an AI era: ISC2 study

Organizations have experienced oversights in cybersecurity processes and procedures (26%), been forced to put underqualified or inexperienced people into roles to cover them (25%), are lacking the time or resources to train cybersecurity staff (25%), and are dealing with misconfigured systems (24%), according to this year’s study. The report also

OPEC Data Points to Balanced Global Oil Market in 2026

OPEC kept forecasts for global oil supplies and demand in 2026 steady, pointing to a balanced world market that clashes with widespread predictions of a surplus. The Organization of the Petroleum Exporting Countries and its allies will need to produce an average of 43 million barrels a day next year to balance supply and demand, roughly in line with the amount pumped last month, according to a report on OPEC’s website. This runs counter to prevailing industry expectations for a supply excess in 2026. Top trader Trafigura Group said this week it could amount to a “super glut,” and the International Energy Agency — while paring its projections in its report earlier Thursday — continues to expect a record overhang. Key OPEC+ nations led by Saudi Arabia acknowledged the fragile backdrop last month by agreeing to pause further output increases during the first quarter after rapidly ramping up production earlier this year. The outlook from OPEC’s Vienna-based secretariat has proven excessively bullish in recent years. Last year, OPEC was ultimately forced to slash demand projections by 32% over the course of six monthly downgrades. In late 2023, it forecast a record inventory deficit that never materialized. WHAT DO YOU THINK? Generated by readers, the comments included herein do not reflect the views and opinions of Rigzone. All comments are subject to editorial review. Off-topic, inappropriate or insulting comments will be removed.

Antero adds to Marcellus portfolio, Infinity picks up divested Ohio Utica interests

Antero Resources Corp., Denver, Co., has signed deals to expand its Marcellus shale footprint in West Virginia and to divest its certain Ohio Utica shale assets. Adding the Marcellus assets expands Antero Resources’ core acreage position, enhancing its position “as the premier liquids developer in the Marcellus,” and provides the company “with further dry gas optionality for local demand from data centers and natural gas fired power plants,” said Michael Kennedy, president and chief executive officer, in a release Dec. 8. Marcellus acquisition from HG Energy Through a deal to acquire the upstream assets of HG Energy II LLC, Parkersburg, WV, Antero aims to add 850 MMcfed of expected Marcellus production in 2026. The deal, expected to close in second-quarter 2026, was signed for $2.8 billion in cash plus the assumption of HG Energy’s commodity hedge book. Antero said about 90% of HG natural gas production is hedged in 2026 and 2027 at average NYMEX prices of $4.00 and $3.88, respectively. The deal adds 385,000 net acres offsetting Antero’s existing 475,000 net core Marcellus acreage position and includes over 400 additional locations that immediately compete for capital (75% liquids), the company said in a related investor presentation. Antero said it anticipates capital synergies of about $550 million inclusive of development planning optimization and drilling and completions savings. Another $400 in income-related synergies is expected. Separately, Antero Midstream agreed to acquire the midstream assets from HG Energy for $1.1 billion in cash. The deal includes about 50 miles of bi-directional dry and rich gas gathering pipelines and water assets in which Antero plans to invest about $25 million to integrate with its legacy gathering and water system. Utica sale to Infinity Natural Resources Infinity Natural Resources Inc., in a release Dec. 8, said subsidiary Infinity Natural Resources LLC will acquire upstream and

Market Focus: Oversupply takes center stage, fundamentals catch up with the market

@import url(‘https://fonts.googleapis.com/css2?family=Inter:[email protected]&display=swap’); a { color: var(–color-primary-main); } .ebm-page__main h1, .ebm-page__main h2, .ebm-page__main h3, .ebm-page__main h4, .ebm-page__main h5, .ebm-page__main h6 { font-family: Inter; } body { line-height: 150%; letter-spacing: 0.025em; font-family: Inter; } button, .ebm-button-wrapper { font-family: Inter; } .label-style { text-transform: uppercase; color: var(–color-grey); font-weight: 600; font-size: 0.75rem; } .caption-style { font-size: 0.75rem; opacity: .6; } #onetrust-pc-sdk [id*=btn-handler], #onetrust-pc-sdk [class*=btn-handler] { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-policy a, #onetrust-pc-sdk a, #ot-pc-content a { color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-pc-sdk .ot-active-menu { border-color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-accept-btn-handler, #onetrust-banner-sdk #onetrust-reject-all-handler, #onetrust-consent-sdk #onetrust-pc-btn-handler.cookie-setting-link { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-consent-sdk .onetrust-pc-btn-handler { color: #c19a06 !important; border-color: #c19a06 !important; } <!–> In this Market Focus episode of the Oil & Gas Journal ReEnterprised podcast, Conglin Xu, managing editor, economics, takes a look at the growing oversupply in global crude markets and the shift now under way as fundamentals begin overtaking sentiment and geopolitics as the primary price driver. ]–>

Aramco, ExxonMobil weigh new chemical complex for Samref refinery

Saudi Aramco and partner ExxonMobil Corp. subsidiary Mobil Yanbu Refining Co. Inc. are discussing the possibility of executing a major overhaul and expansion of 50-50 joint venture Saudi Aramco-Mobil Refinery Co. Ltd.’s (Samref) 400,000-b/d Samref refinery in Yanbu, Saudi Arabia. As part of a venture framework agreement (VFA) signed on Dec. 8, the partners will evaluate potential capital investments to expand and diversify the refinery’s existing production slate, including the addition of a grassroots petrochemical complex at the site, Aramco said in a statement. In addition to upgrading and diversifying Samref’s production to include lower-emission, high-quality distillates and high-performance chemicals, the project scope would involve works to improve the refinery’s energy efficiency and implement a sitewide integrated emissions reduction strategy, according to Aramco. With the VFA now signed, the companies said they will begin the project’s preliminary front-end engineering and design (pre-FEED) study, which will focus on opportunities to maximize the site’s operational advantage and enhance its competitiveness while meeting Saudi Arabia’s growing demand for high-quality petrochemical products. For Aramco, the proposed project—the design of which aims to increase the conversion of crude oil and other petroleum liquids into higher-value chemicals—further reinforces the company’s commitment to creating further value of its overall downstream business as well as its liquids-to-chemicals strategy, according to Mohammed Y. Al Qahtani, Aramco’s downstream president. “[The proposed expansion and integration project] will also position Samref as a key driver in the growth of [Saudi Arabia’s] petrochemical sector,” Al Qahtani added. Without disclosing a timeline as to when the partners expect to complete the pre-FEED study or reach final investment decision, Aramco confirmed existing plans for the potential project would remain subject to market conditions and necessary regulatory approvals. Samref previously completed modifications and renovations at the Yanbu refinery in 2014-15 related to a two-phased clean-fuels project

Harbour Energy to add North Sea assets through Waldorf acquisition

@import url(‘https://fonts.googleapis.com/css2?family=Inter:[email protected]&display=swap’); a { color: var(–color-primary-main); } .ebm-page__main h1, .ebm-page__main h2, .ebm-page__main h3, .ebm-page__main h4, .ebm-page__main h5, .ebm-page__main h6 { font-family: Inter; } body { line-height: 150%; letter-spacing: 0.025em; font-family: Inter; } button, .ebm-button-wrapper { font-family: Inter; } .label-style { text-transform: uppercase; color: var(–color-grey); font-weight: 600; font-size: 0.75rem; } .caption-style { font-size: 0.75rem; opacity: .6; } #onetrust-pc-sdk [id*=btn-handler], #onetrust-pc-sdk [class*=btn-handler] { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-policy a, #onetrust-pc-sdk a, #ot-pc-content a { color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-pc-sdk .ot-active-menu { border-color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-accept-btn-handler, #onetrust-banner-sdk #onetrust-reject-all-handler, #onetrust-consent-sdk #onetrust-pc-btn-handler.cookie-setting-link { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-consent-sdk .onetrust-pc-btn-handler { color: #c19a06 !important; border-color: #c19a06 !important; } Harbour Energy plc has agreed to acquire substantially all the subsidiaries of Waldorf Energy Partners Ltd. and Waldorf Production Ltd., currently in administration, for $170 million. The company, in a release Dec. 12, said the deal would add oil-weighted production of 20,000 boe/d and 2P reserves of 35 MMboe. In addition, the deal would increase Harbour’s interest in its operated Catcher oil and gas field to 90% from 50% and provide a new production base for Harbour in the northern North Sea with the addition of a 29.5% non-operated interest in the EnQuest plc-operated Kraken oil field. The deal is expected to close in second-quarter 2026, subject to regulatory approvals and full and final settlement of all creditor claims against Waldorf’s subsidiaries.

EIA: US oil inventories drop 1.8 million bbl

US commercial crude inventories for the week ended Dec. 5, excluding those in the Strategic Petroleum Reserve, dropped 1.8 million bbl from the previous week to 425.7 million bbl, which is about 4% below the average range for this time of year, according to the US Energy Information Administration’s (EIA) Weekly Petroleum Status Report. Total motor gasoline inventories gained 6.4 million bbl last week and are about 1% below the 5-year average range for this time of year. Finished gasoline inventories and blending components inventories rose. Distillate fuel inventories increased by 2.5 million bbl but are 7% below the 5-year average for this time of year. EIA reported that US crude refinery inputs last week averaged 16.9 million b/d, down 17,000 b/d from the previous week’s average. Refineries operated at 94.5% of their operable capacity. Gasoline production decreased to 9.6 million b/d, while distillate fuel production increased by 380,000 b/d, averaging 5.4 million b/d. US crude imports averaged 6.6 million b/d, up 609,000 b/d from the previous week’s average. Over the last 4 weeks, crude imports averaged 6.2 million b/d, down 7.7% from the same 4-week period last year. Total motor gasoline imports, including both finished gasoline and gasoline blending components, averaged 659,000 b/d. Distillate fuel imports averaged 181,000 b/d last week.

Executive Roundtable: Converging Disciplines in the AI Buildout

At Data Center Frontier, we rely on industry leaders to help us understand the most urgent challenges facing digital infrastructure. And in the fourth quarter of 2025, the data center industry is adjusting to a new kind of complexity. AI-scale infrastructure is redefining what “mission critical” means, from megawatt density and modular delivery to the chemistry of cooling fluids and the automation of energy systems. Every project has arguably in effect now become an ecosystem challenge, demanding that electrical, mechanical, construction, and environmental disciplines act as one. For this quarter’s Executive Roundtable, DCF convened subject matter experts from Ecolab, EdgeConneX, Rehlko and Schneider Electric – leaders spanning the full chain of facilities design, deployment, and operation. Their insights illuminate how liquid cooling, energy management, and sustainable process design in data centers are now converging to set the pace for the AI era. Our distinguished executive panelists for this quarter include: Rob Lowe, Director RD&E – Global High Tech, Ecolab Phillip Marangella, Chief Marketing and Product Officer, EdgeConneX Ben Rapp, Manager, Strategic Project Development, Rehlko Joe Reele, Vice President, Datacenter Solution Architects, Schneider Electric Today: Engineering the New Normal – Liquid Cooling at Scale Today’s kickoff article grapples with how, as liquid cooling technology transitions to default hyperscale design, the challenge is no longer if, but how to scale builds safely, repeatably, and globally. Cold plates, immersion, dielectric fluids, and liquid-to-chip loops are converging into factory-integrated building blocks, yet variability in chemistry, serviceability, materials, commissioning practices, and long-term maintenance threatens to fragment adoption just as demand accelerates. Success now hinges on shared standards and tighter collaboration across OEMs, builders, and process specialists worldwide. So how do developers coordinate across the ecosystem to make liquid cooling a safe, maintainable global default? What’s Ahead in the Roundtable Over the coming days, our panel

DCF Trends Summit 2025: AI for Good – How Operators, Vendors and Cooling Specialists See the Next Phase of AI Data Centers

At the 2025 Data Center Frontier Trends Summit (Aug. 26-28) in Reston, Va., the conversation around AI and infrastructure moved well past the hype. In a panel sponsored by Schneider Electric—“AI for Good: Building for AI Workloads and Using AI for Smarter Data Centers”—three industry leaders explored what it really means to design, cool and operate the new class of AI “factories,” while also turning AI inward to run those facilities more intelligently. Moderated by Data Center Frontier Editor in Chief Matt Vincent, the session brought together: Steve Carlini, VP, Innovation and Data Center Energy Management Business, Schneider Electric Sudhir Kalra, Chief Data Center Operations Officer, Compass Datacenters Andrew Whitmore, VP of Sales, Motivair Together, they traced both sides of the “AI for Good” equation: building for AI workloads at densities that would have sounded impossible just a few years ago, and using AI itself to reduce risk, improve efficiency and minimize environmental impact. From Bubble Talk to “AI Factories” Carlini opened by acknowledging the volatility surrounding AI investments, citing recent headlines and even Sam Altman’s public use of the word “bubble” to describe the current phase of exuberance. “It’s moving at an incredible pace,” Carlini noted, pointing out that roughly half of all VC money this year has flowed into AI, with more already spent than in all of the previous year. Not every investor will win, he said, and some companies pouring in hundreds of billions may not recoup their capital. But for infrastructure, the signal is clear: the trajectory is up and to the right. GPU generations are cycling faster than ever. Densities are climbing from high double-digits per rack toward hundreds of kilowatts. The hyperscale “AI factories,” as NVIDIA calls them, are scaling to campus capacities measured in gigawatts. Carlini reminded the audience that in 2024,

FinOps Foundation sharpens FOCUS to reduce cloud cost chaos

“The big change that’s really started to happen in late 2024 early 2025 is that the FinOps practice started to expand past the cloud,” Storment said. “A lot of organizations got really good at using FinOps to manage the value of cloud, and then their organizations went, ‘oh, hey, we’re living in this happily hybrid state now where we’ve got cloud, SaaS, data center. Can you also apply the FinOps practice to our SaaS? Or can you apply it to our Snowflake? Can you apply it to our data center?’” The FinOps Foundation’s community has grown to approximately 100,000 practitioners. The organization now includes major cloud vendors, hardware providers like Nvidia and AMD, data center operators and data cloud platforms like Snowflake and Databricks. Some 96 of the Fortune 100 now participate in FinOps Foundation programs. The practice itself has shifted in two directions. It has moved left into earlier architectural and design processes, becoming more proactive rather than reactive. It has also moved up organizationally, from director-level cloud management roles to SVP and COO positions managing converged technology portfolios spanning multiple infrastructure types. This expansion has driven the evolution of FOCUS beyond its original cloud billing focus. Enterprises are implementing FOCUS as an internal standard for chargeback reporting even when their providers don’t generate native FOCUS data. Some newer cloud providers, particularly those focused on AI infrastructure, are using the FOCUS specification to define their billing data structures from the ground up rather than retrofitting existing systems. The FOCUS 1.3 release reflects this maturation, addressing technical gaps that have emerged as organizations apply cost management practices across increasingly complex hybrid environments. FOCUS 1.3 exposes cost allocation logic for shared infrastructure The most significant technical enhancement in FOCUS 1.3 addresses a gap in how shared infrastructure costs are allocated and

Aetherflux joins the race to launch orbital data centers by 2027

Enterprises will connect to and manage orbital workloads “the same way they manage cloud workloads today,” using optical links, the spokesperson added. The company’s approach is to “continuously launch new hardware and quickly integrate the latest architectures,” with older systems running lower-priority tasks to serve out the full useful lifetime of their high-end GPUs. The company declined to disclose pricing. Aetherflux plans to launch about 30 satellites at a time on SpaceX Falcon 9 rockets. Before the data center launch, the company will launch a power-beaming demonstration satellite in 2026 to test transmission of one kilowatt of energy from orbit to ground stations, using infrared lasers. Competition in the sector has intensified in recent months. In November, Starcloud launched its Starcloud-1 satellite carrying an Nvidia H100 GPU, which is 100 times more powerful than any previous GPU flown in space, according to the company, and demonstrated running Google’s Gemma AI model in orbit. In the same month, Google announced Project Suncatcher, with a 2027 demonstration mission planned. Analysts see limited near-term applications Despite the competitive activity, orbital data centers won’t replace terrestrial cloud regions for general hosting through 2030, said Ashish Banerjee, senior principal analyst at Gartner. Instead, they suit specific workloads, including meeting data sovereignty requirements for jurisdictionally complex scenarios, offering disaster recovery immune to terrestrial risks, and providing asynchronous high-performance computing, he said. “Orbital centers are ideal for high-compute, low-I/O batch jobs,” Banerjee said. “Think molecular folding simulations for pharma, massive Monte Carlo financial simulations, or training specific AI model weights. If the job takes 48 hours, the 500ms latency penalty of LEO is irrelevant.” One immediate application involves processing satellite-generated data in orbit, he said. Earth observation satellites using synthetic aperture radar generate roughly 10 gigabytes per second, but limited downlink bandwidth creates bottlenecks. Processing data in

Here’s what Oracle’s soaring infrastructure spend could mean for enterprises

He said he had earlier told analysts in a separate call that margins for AI workloads in these data centers would be in the 30% to 40% range over the life of a customer contract. Kehring reassured that there would be demand for the data centers when they were completed, pointing to Oracle’s increasing remaining performance obligations, or services contracted but not yet delivered, up $68 billion on the previous quarter, saying that Oracle has been seeing unprecedented demand for AI workloads driven by the likes of Meta and Nvidia. Rising debt and margin risks raise flags for CIOs For analysts, though, the swelling debt load is hard to dismiss, even with Oracle’s attempts to de-risk its spend and squeeze more efficiency out of its buildouts. Gogia sees Oracle already under pressure, with the financial ecosystem around the company pricing the risk — one of the largest debts in corporate history, crossing $100 billion even before the capex spend this quarter — evident in the rising cost of insuring the debt and the shift in credit outlook. “The combination of heavy capex, negative free cash flow, increasing financing cost and long-dated revenue commitments forms a structural pressure that will invariably finds its way into the commercial posture of the vendor,” Gogia said, hinting at an “eventual” increase in pricing of the company’s offerings. He was equally unconvinced by Magouyrk’s assurances about the margin profile of AI workloads as he believes that AI infrastructure, particularly GPU-heavy clusters, delivers significantly lower margins in the early years because utilisation takes time to ramp.

New Nvidia software gives data centers deeper visibility into GPU thermals and reliability

Addressing the challenge Modern AI accelerators now draw more than 700W per GPU, and multi-GPU nodes can reach 6kW, creating concentrated heat zones, rapid power swings, and a higher risk of interconnect degradation in dense racks, according to Manish Rawat, semiconductor analyst at TechInsights. Traditional cooling methods and static power planning increasingly struggle to keep pace with these loads. “Rich vendor telemetry covering real-time power draw, bandwidth behavior, interconnect health, and airflow patterns shifts operators from reactive monitoring to proactive design,” Rawat said. “It enables thermally aware workload placement, faster adoption of liquid or hybrid cooling, and smarter network layouts that reduce heat-dense traffic clusters.” Rawat added that the software’s fleet-level configuration insights can also help operators catch silent errors caused by mismatched firmware or driver versions. This can improve training reproducibility and strengthen overall fleet stability. “Real-time error and interconnect health data also significantly accelerates root-cause analysis, reducing MTTR and minimizing cluster fragmentation,” Rawat said. These operational pressures can shape budget decisions and infrastructure strategy at the enterprise level.

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle