Stay Ahead, Stay ONMINE

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first.  Previously, we covered the first two major stages of training an LLM: Pre-training — Learning from massive datasets to form a base model. Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful. Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline. I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea. Let’s go 🚀 What’s the purpose of reinforcement learning (RL)? Humans and LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training. This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer. RL bridges this gap by allowing the model to learn from its own experience. Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align better with human intent. Intuition behind RL LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution. We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the better paths more often. To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself. The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future. But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial. RL is not “new” — It can surpass human expertise (AlphaGo, 2016) A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play. In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it. The dotted line represents Lee Sedol’s performance — the best Go player in the world. This is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge. However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line). Image taken from AlphaGo 2016 paper RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies. RL foundations recap Let’s quickly recap the key components of a typical RL setup: Image by author Agent — The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward). Environment  — The external system in which the agent operates. State —  A snapshot of the environment at a given step t.  At each timestamp, the agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was. This feedback is called a reward, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it. By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximise the total reward over time. Policy The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps. In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)). Value function An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model.  Actor-Critic architecture It is a popular RL setup that combines two components: Actor — Learns and updates the policy (πθ), deciding which action to take in each state. Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes.  How it works: The actor picks an action based on its current policy. The critic evaluates the outcome (reward + next state) and updates its value estimate. The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards. Putting it all together for LLMs The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is.  The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses. DeepSeek-R1 (published 22 Jan 2025) To highlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT). DeepSeek-R1 builds on it, addressing encountered challenges. Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world. 🤖🫡— Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025 Let’s dive into some of these key points.  1. RL algo: Group Relative Policy Optimisation (GRPO) One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024.  Why GRPO over PPO? PPO struggles with reasoning tasks due to: Dependency on a critic model.PPO needs a separate critic model, effectively doubling memory and compute.Training the critic can be complex for nuanced or subjective tasks. High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses.  Absolute reward evaluationsWhen you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains.  How GRPO addressed these challenges: GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard. Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality. How does GRPO fit into the whole training process? GRPO modifies how loss is calculated while keeping other training steps unchanged: Gather data (queries + responses)– For LLMs, queries are like questions– The old policy (older snapshot of the model) generates several candidate answers for each query Assign rewards — each response in the group is scored (the “reward”). Compute the GRPO lossTraditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.In GRPO, however, you measure:a) How likely is the new policy to produce past responses?b) Are those responses relatively better or worse?c) Apply clipping to prevent extreme updates.This yields a scalar loss. Back propagation + gradient descent– Back propagation calculates how each parameter contributed to loss– Gradient descent updates those parameters to reduce the loss– Over many iterations, this gradually shifts the new policy to prefer higher reward responses Update the old policy occasionally to match the new policy.This refreshes the baseline for the next round of comparisons. 2. Chain of thought (CoT) Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning. Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute). DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.  A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses. Image taken from DeepSeek-R1 paper Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training. The model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes. Image taken from DeepSeek-R1 paper Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts. Reinforcement learning with Human Feedback (RLHF) For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer?  This is where human feedback comes in — but naïve RL approaches are unscalable. Image by author Let’s look at the naive approach with some arbitrary numbers. Image by author That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort.  Ranking responses is also easier and more intuitive than absolute scoring. Image by author Upsides of RLHF Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks. Ranking outputs is much easier for human labellers than generating creative outputs themselves. Downsides of RLHF The reward model is an approximation — it may not perfectly reflect human preferences. RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores. Do note that Rlhf is not the same as traditional RL. For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences. Conclusion And that’s a wrap! I hope you enjoyed Part 2 🙂 If you haven’t already read Part 1 — do check it out here. Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first

Previously, we covered the first two major stages of training an LLM:

  1. Pre-training — Learning from massive datasets to form a base model.
  2. Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful.

Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline.

I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea.

Let’s go 🚀

What’s the purpose of reinforcement learning (RL)?

Humans and LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training.

This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer.

RL bridges this gap by allowing the model to learn from its own experience.

Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align better with human intent.

Intuition behind RL

LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution.

We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the better paths more often.

To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself.

The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future.

But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial.

RL is not “new” — It can surpass human expertise (AlphaGo, 2016)

A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play.

In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it.

The dotted line represents Lee Sedol’s performance — the best Go player in the world.

This is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge.

However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line).

Image taken from AlphaGo 2016 paper

RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies.

RL foundations recap

Let’s quickly recap the key components of a typical RL setup:

Image by author
  • Agent The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward).
  • Environment  — The external system in which the agent operates.
  • State —  A snapshot of the environment at a given step t

At each timestamp, the agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was.

This feedback is called a reward, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it.

By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximise the total reward over time.

Policy

The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps.

In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)).

Value function

An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model. 

Actor-Critic architecture

It is a popular RL setup that combines two components:

  1. Actor — Learns and updates the policy (πθ), deciding which action to take in each state.
  2. Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes. 

How it works:

  • The actor picks an action based on its current policy.
  • The critic evaluates the outcome (reward + next state) and updates its value estimate.
  • The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards.

Putting it all together for LLMs

The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is. 

The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses.

DeepSeek-R1 (published 22 Jan 2025)

To highlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1.

  • DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT).
  • DeepSeek-R1 builds on it, addressing encountered challenges.

Let’s dive into some of these key points. 

1. RL algo: Group Relative Policy Optimisation (GRPO)

One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024. 

Why GRPO over PPO?

PPO struggles with reasoning tasks due to:

  1. Dependency on a critic model.
    PPO needs a separate critic model, effectively doubling memory and compute.
    Training the critic can be complex for nuanced or subjective tasks.
  2. High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses. 
  3. Absolute reward evaluations
    When you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains. 

How GRPO addressed these challenges:

GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard.

Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality.

How does GRPO fit into the whole training process?

GRPO modifies how loss is calculated while keeping other training steps unchanged:

  1. Gather data (queries + responses)
    – For LLMs, queries are like questions
    – The old policy (older snapshot of the model) generates several candidate answers for each query
  2. Assign rewards — each response in the group is scored (the “reward”).
  3. Compute the GRPO loss
    Traditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.
    In GRPO, however, you measure:
    a) How likely is the new policy to produce past responses?
    b) Are those responses relatively better or worse?
    c) Apply clipping to prevent extreme updates.
    This yields a scalar loss.
  4. Back propagation + gradient descent
    – Back propagation calculates how each parameter contributed to loss
    – Gradient descent updates those parameters to reduce the loss
    – Over many iterations, this gradually shifts the new policy to prefer higher reward responses
  5. Update the old policy occasionally to match the new policy.
    This refreshes the baseline for the next round of comparisons.

2. Chain of thought (CoT)

Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning.

Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute).

DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning. 

A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses.

Image taken from DeepSeek-R1 paper

Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training.

The model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes.

Image taken from DeepSeek-R1 paper

Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts.

Reinforcement learning with Human Feedback (RLHF)

For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer? 

This is where human feedback comes in — but naïve RL approaches are unscalable.

Image by author

Let’s look at the naive approach with some arbitrary numbers.

Image by author

That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort. 

Ranking responses is also easier and more intuitive than absolute scoring.

Image by author

Upsides of RLHF

  • Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks.
  • Ranking outputs is much easier for human labellers than generating creative outputs themselves.

Downsides of RLHF

  • The reward model is an approximation — it may not perfectly reflect human preferences.
  • RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores.

Do note that Rlhf is not the same as traditional RL.

For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences.

Conclusion

And that’s a wrap! I hope you enjoyed Part 2 🙂 If you haven’t already read Part 1 — do check it out here.

Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Four new vulnerabilities found in Ingress NGINX

NGINX is a reverse proxy/load balancer that generally acts as the front-end web traffic receiver and directs it to the application service for data transformation. Ingress NGINX is a version used in Kubernetes as the controller for traffic coming into the infrastructure. It takes care of mapping traffic to pods

Read More »

WTI, Brent Gain as Talks Ease Conflict Fears

Oil edged marginally higher after a choppy session as investors assessed the status of nuclear talks between the US and Iran. West Texas Intermediate settled above $63 a barrel, with markets reacting sharply to headlines tied to the meeting. Iranian Foreign Minister Abbas Araghchi said the talks had a “good start,” even as the Wall Street Journal reported that Tehran stood by its refusal to end enrichment of nuclear fuel, a major sticking point for the US. The escalation in the Middle East, which provides about a third of the world’s crude, has added a risk premium to benchmark oil prices. Traders have weighed the geopolitical tensions against an outlook for oversupply. Still, futures in New York notched their first weekly retreat since mid-December as the US-Iran talks helped allay concerns over a broader conflict in the region. Prices also extended gains after data showed US consumer sentiment unexpectedly improved to the highest in six months, calming some concerns over an economic slowdown in the country that could lead to weaker oil demand. Meanwhile, in trilateral negotiations with the US, Ukraine and Russia agreed to exchange prisoners for the first time in five months as they sought to end their four-year conflict. Talks were making progress, with results expected “in the coming weeks,” President Donald Trump’s special envoy said. Saudi Arabia cut prices for buyers in Asia by less than expected, signaling confidence in demand for its barrels, although prices have still been reduced to the lowest levels since late 2020. Oil Prices WTI for March delivery settled 0.4% higher at $63.55 a barrel in New York. Brent for April settlement rose 0.7% to close at $68.05 a barrel. What do you think? We’d love to hear from you, join the conversation on the Rigzone Energy Network. The Rigzone Energy

Read More »

Saudis Cut Key Oil Price for Asian Buyers

Saudi Arabia cut the price of its main oil grade for buyers in Asia to the lowest in years, a further sign that global supplies are running ahead of demand. State oil producer Saudi Aramco will reduce the price of its Arab Light grade by 30 cents a barrel to parity with the regional benchmark for March, according to a price list seen by Bloomberg. That brings pricing for the kingdom’s most plentiful crude blend to the lowest level since late 2020. Still, Aramco’s cut was not as deep as buyers expected, coming in smaller than even the most modest estimate of a reduction in a survey of refiners and traders. That offers a sign that the kingdom has faith in demand for its barrels and Aramco’s Chief Executive Officer Amin Nasser has previously said that fears of a glut are overblown. Saudi Arabia’s monthly crude pricing is keenly watched by traders across the globe as it sets the tone for other sellers in the world’s top producing regions. Asia is the biggest market for Middle Eastern crude, with the prices set for refiners determining the profitability of processing and influencing the cost of fuels like gasoline and diesel the world over. Aramco also cut pricing for its Arab Medium and Arab Heavy crude grades to Asia to the lowest levels since mid 2020, while it increased prices for the Extra Light and Super Light blends. That split reflects that dynamic in the Middle East market where prices for the heavier and more sulfurous crudes that are most plentiful in the region have trailed those for the lighter blends. The OPEC+ producers group, led by Saudi Arabia and Russia, agreed to keep production levels steady during talks on Feb. 1, maintaining an earlier decision to forgo output increases to avoid

Read More »

Shell to Pause Kazakh Oil and Gas Investments

Shell Plc will pause investment in Kazakhstan as it navigates legal claims from the OPEC+ nation against oil majors that could stretch into the billions of dollars, Chief Executive Officer Wael Sawan said. Kazakhstan is pressing multiple western oil companies for compensation across a series of cases both in the Central Asian country’s courts and in international arbitration. This month, it emerged that Shell and partners lost a dispute that could see them pay as much as $4 billion. There is also ongoing litigation about sulfur breaches and project costs. “It does impact our appetite to invest further in Kazakhstan,” Sawan said Thursday during an earnings conference call with analysts. While the company sees plenty of investment opportunities in the future, “we will hold until we have a better line of sight to where things end up.” The setbacks in Kazakhstan come as Shell seeks to ensure future production growth with a healthy inventory of projects. Acquisitions have largely filled the company’s production gap through 2030, buying time to deal with the 2030-2035 period, Sawan said in an interview on Thursday. The Kazakh energy ministry didn’t reply to an emailed request for comment sent outside normal working hours. Sawan didn’t elaborate on whether the pause would apply to new or existing projects. Shell didn’t immediately respond to a request to clarify whether the CEO was talking about new or existing investments. The latest dispute was against the Karachaganak field joint venture led by Italy’s Eni SpA and Shell, over cost deductions. Other partners include Chevron Corp., Lukoil PJSC and KazMunayGas National Co. The venture may still appeal the decision.   Last year, the companies proposed settling the dispute by building a plant that would process natural gas from the field for domestic use. WHAT DO YOU THINK? Generated by readers,

Read More »

Tankers With Russian Oil Flock to East Asia

More than a dozen tankers loaded with Russian Urals oil are sailing toward Asia or idling along the route, a sign of producers racing to get cargoes closer to China as India pulls back from the trade.  These vessels — carrying a combined 10 million to 12 million barrels of oil — are spread across the Indian Ocean, and off the coasts of Malaysia, China and Russia. Five of them are indicating ‘for orders’ or ‘China for orders’ as their status, according to data intelligence firm Kpler, a category that usually means they don’t yet have a specific buyer or discharge port. Another six are signaling Singapore and Malaysia, and are likely heading to a popular spot for ship-to-ship transfers in the South China Sea where they can wait until the crude is bought. Four are floating off Malaysia, China and Russia’s Far East, without indicating a clear destination. Urals — Russia’s flagship crude grade, which is loaded from ports in the Baltic Sea — has become the variety of choice for Indian refiners since the invasion of Ukraine in early 2022 saw it become heavily discounted. But pressure from Washington has pushed imports lower, reaching an average of 1.2 million barrels a day in January compared with a peak of more than 2 million barrels a day in mid-2024. Indian imports of the crude could be trimmed further after President Donald Trump said on Monday the country would stop buying Russian oil as part of deal to cut trade tariffs. Prime Minister Narendra Modi confirmed the agreement but didn’t comment on oil. Some refiners are holding off purchases while they seek clarification from New Delhi.  The big question is where the surplus cargoes of Urals — the bulk of which have gone to India over the last few years — will now end up. China’s

Read More »

BP, KOC Sign ETSA Extension

In a statement sent to Rigzone on Thursday, BP announced that it and Kuwait Oil Company have signed an extension of the Enhanced Technical Services Agreement (ETSA) between the companies. The agreement “paves the way for both companies to collaboratively progress Kuwait’s most strategic asset fields”, BP noted in the statement. BP added that the deal enables it to “bring expertise in enhanced oil recovery to the Greater Burgan oil field and develop local capabilities with Kuwait Oil Company to manage the development of South and East Kuwait fields through 50 secondment opportunities of BP’s technical experts”. Rigzone asked BP to disclose the deal’s value. A BP spokesperson was unable to do so. The ETSA was originally signed in 2016 for a period of 10 years, the statement highlighted, adding that it will now extend through to March 2029. BP Executive Vice President, Gas & Low Carbon Energy, William Lin, noted in the statement, “BP’s commitment to Kuwait dates back to our participation in the discovery of the Greater Burgan oil field in the 1930s, and we appreciate the trust placed in our expertise in giant oil and gas fields to continue to help develop this important strategic asset”. “This is another example of the deep relationships we’ve formed across governments, partners, and supply chains in the regions where we operate. We look forward to continuing our strong collaboration with Kuwait and to working with KOC to help support the country’s long-term energy resilience,” he added. BP notes on its website that it was one of the founders of the original Kuwait Oil Company, which it highlighted first discovered oil at Burgan in 1938. “Exportation of KOC began in 1946, in which the first export of Kuwait crude was loaded on to the bp vessel ‘Fusilier’,” BP’s site adds. BP

Read More »

IPAA Promotes Naatz to Chief Policy Officer

In a statement sent to Rigzone on Thursday, the Independent Petroleum Association of America (IPAA) announced that Dan Naatz has been promoted to Executive Vice President and Chief Policy Officer. As Chief Policy Officer, Naatz will lead IPAA’s policy priorities and oversee all government relations and advocacy efforts, the organization said in the statement, adding that he will focus “on the issues most critical to independent producers across regulatory, legislative, and permitting environments”. “He will also continue to build consensus across IPAA’s diverse membership and strengthen partnerships with aligned organizations,” the IPAA noted. Naatz also serves as Corporate Secretary on the IPAA board of directors, the statement highlighted. Prior to joining the IPAA in 2003, Naatz spent 12 years on Capitol Hill working for the late Senator Craig Thomas in various capacities, the IPAA pointed out in its statement. “Dan has led IPAA’s advocacy efforts on Capitol Hill, securing meaningful wins for independent oil and natural gas producers on issues including methane regulation, federal leasing, and permitting reform,” the statement said. “His decades of leadership and judgment have strengthened IPAA’s voice in Washington at a critical time for IPAA members,” it added. A bio on Naatz hosted on the IPAA website states that, “for more than two decades, Dan has been the public face for independent oil and natural gas producers in Washington, representing the industry in congressional hearings on Capitol Hill, roundtables, coalition meetings, and with federal regulating agencies”. “IPAA actively engages with agencies including the Environmental Protection Agency (EPA), Department of Interior (DOI), Bureau of Land Management (BLM) in support of rules and timelines that are technically feasible and cost-effective, so producers can do what they do best: provide reliable and affordable energy for Americans,” the bio page adds. IPAA President and CEO Edith Naegele said in the IPAA statement, “IPAA

Read More »

Nvidia’s $100 Billion OpenAI Bet Shrinks and Signals a New Phase in the AI Infrastructure Cycle

One of the most eye-popping figures of the AI boom – a proposed $100 billion Nvidia commitment to OpenAI and as much as 10 gigawatts of compute for the company’s Stargate AI infrastructure buildout – is no longer on the table. And that partial retreat tells the data center industry something important. According to multiple reports surfacing at the end of January, Nvidia has paused and re-scoped its previously discussed, non-binding investment framework with OpenAI, shifting from an unprecedented capital-plus-infrastructure commitment to a much smaller (though still massive) equity investment. What was once framed as a potential $100 billion alignment is now being discussed in the $20-30 billion range, as part of OpenAI’s broader effort to raise as much as $100 billion at a valuation approaching $830 billion. For data center operators, infrastructure developers, and power providers, the recalibration matters less for the headline number and more for what it reveals about risk discipline, competitive dynamics, and the limits of vertical circularity in AI infrastructure finance. From Moonshot to Measured Capital The original September 2025 memorandum reportedly contemplated not just capital, but direct alignment on compute delivery: a structure that would have tightly coupled Nvidia’s balance sheet with OpenAI’s AI-factory roadmap. By late January, however, sources indicated Nvidia executives had grown uneasy with both the scale and the structure of the deal. Speaking in Taipei on January 31, Nvidia CEO Jensen Huang pushed back on reports of friction, calling them “nonsense” and confirming Nvidia would “absolutely” participate in OpenAI’s current fundraising round. But Huang was also explicit on what had changed: the investment would be “nothing like” $100 billion, even if it ultimately becomes the largest single investment Nvidia has ever made. That nuance matters. Nvidia is not walking away from OpenAI. But it is drawing a clearer boundary around

Read More »

Data Center Jobs: Engineering, Construction, Commissioning, Sales, Field Service and Facility Tech Jobs Available in Major Data Center Hotspots

Each month Data Center Frontier, in partnership with Pkaza, posts some of the hottest data center career opportunities in the market. Here’s a look at some of the latest data center jobs posted on the Data Center Frontier jobs board, powered by Pkaza Critical Facilities Recruiting. Looking for Data Center Candidates? Check out Pkaza’s Active Candidate / Featured Candidate Hotlist Onsite Engineer – Critical FacilitiesCharleston, SC This is NOT a traveling position. Having degreed engineers seems to be all the rage these days. I can also use this type of candidate in following cities: Ashburn, VA; Moncks Corner, SC; Binghamton, NY; Dallas, TX or Indianapolis, IN. Our client is an engineering design and commissioning company that is a subject matter expert in the data center space. This role will be onsite at a customer’s data center. They will provide onsite design coordination and construction administration, consulting and management support for the data center / mission critical facilities space with the mindset to provide reliability, energy efficiency, sustainable design and LEED expertise when providing these consulting services for enterprise, colocation and hyperscale companies. This career-growth minded opportunity offers exciting projects with leading-edge technology and innovation as well as competitive salaries and benefits. Electrical Commissioning Engineer Ashburn, VA This traveling position is also available in: New York, NY; White Plains, NY;  Richmond, VA; Montvale, NJ; Charlotte, NC; Atlanta, GA; Hampton, GA; New Albany, OH; Cedar Rapids, IA; Phoenix, AZ; Salt Lake City, UT; Dallas, TX; Kansas City, MO; Omaha, NE; Chesterton, IN or Chicago, IL. *** ALSO looking for a LEAD EE and ME CxA Agents and CxA PMs *** Our client is an engineering design and commissioning company that has a national footprint and specializes in MEP critical facilities design. They provide design, commissioning, consulting and management expertise in the critical facilities space. They

Read More »

Operationalizing AI at Scale: Google Cloud on Data Infrastructure, Search, and Enterprise AI

The AI conversation has been dominated by model announcements, benchmark races, and the rapid evolution of large language models. But in enterprise environments, the harder problem isn’t building smarter models. It’s making them work reliably with real-world data. On the latest episode of the Data Center Frontier Show Podcast, Sailesh Krishnamurthy, VP of Engineering for Databases at Google Cloud, pulled back the curtain on the infrastructure layer where many ambitious AI initiatives succeed, or quietly fail. Krishnamurthy operates at the intersection of databases, search, and AI systems. His perspective underscores a growing reality across enterprise IT: AI success increasingly depends on how organizations manage, integrate, and govern data across operational systems, not just how powerful their models are. The Disconnect Between LLMs and Reality Enterprises today face a fundamental challenge: connecting LLMs to real-time operational data. Search systems handle documents and unstructured information well. Operational databases manage transactions, customer data, and financial records with precision. But combining the two remains difficult. Krishnamurthy described the problem as universal. “Inside enterprises, knowledge workers are often searching documents while separately querying operational systems,” he said. “But combining unstructured information with operational database data is still hard to do.” Externally, customers encounter the opposite issue. Portals expose personal data but struggle to incorporate broader contextual information. “You get a narrow view of your own data,” he explained, “but combining that with unstructured information that might answer your real question is still challenging.” The result: AI systems often operate with incomplete context. Vector Search Moves Into the Database Vector search has emerged as a bridge between structured and unstructured worlds. But its evolution over the past three years has changed how enterprises deploy it. Early use cases focused on semantic search, i.e. finding meaning rather than exact keyword matches. Bug tracking systems, for example, began

Read More »

Transmission at the Breaking Point: Why the Grid Is Becoming the Defining Constraint for AI Data Centers

Regions in a Position to Scale California (A- overall)California continues to lead in long-term, scenario-based transmission planning. CAISO’s most recent transmission plan identifies $4.8 billion in new projects to accommodate approximately 76 gigawatts of additional capacity by 2039, explicitly accounting for data center growth alongside broader electrification. For data center developers, California’s challenge is less about planning quality and more about execution. Permitting timelines, cost allocation debates, and political scrutiny remain significant hurdles. Plains / Southwest Power Pool (B- overall, A in regional planning)SPP stands out nationally for embracing ultra-high-voltage transmission as a backbone strategy. Its recent Integrated Transmission Plans approve more than $16 billion in new projects, including multiple 765-kV lines, with benefit-cost ratios exceeding 10:1. This approach positions the Plains region as one of the most structurally “AI-ready” grids in North America, particularly for multi-gigawatt campuses supported by wind, natural gas, and emerging nuclear resources. Midwest / MISO (B overall)MISO’s Long-Range Transmission Planning framework aligns closely with federal best practices, co-optimizing generation and transmission over long planning horizons. While challenges remain—particularly around interregional coordination—the Midwest is comparatively well positioned for sustained data center growth. Regions Facing Heightened Risk Texas / ERCOT (D- overall)Texas has approved massive new transmission investments, including 765-kV projects tied to explosive load growth in the Permian Basin. However, the report criticizes ERCOT’s planning for remaining largely siloed and reliability-driven, with limited long-term scenario analysis and narrow benefit assessments. For data centers, ERCOT still offers speed to market, but increasingly with risks tied to congestion, price volatility, and political backlash surrounding grid reliability. Southeast (F overall)The Southeast receives failing grades across all categories, with transmission development remaining fragmented, utility-driven, and largely disconnected from durable regional planning frameworks. As AI data centers increasingly target the region for its land availability and tax incentives, the lack of

Read More »

From Row-Level CDUs to Facility-Scale Cooling: DCX Ramps Liquid Cooling for the AI Factory Era

Enter the 8MW CDU Era The next evolution arrived just days later. On Jan. 20, DCX announced its second-generation facility-scale unit, the FDU V2AT2, pushing capacity into territory previously unimaginable for single CDU platforms. The system delivers up to 8.15 megawatts of heat transfer capacity with record flow rates designed to support 45°C warm-water cooling, aligning directly with NVIDIA’s roadmap for rack-scale AI systems, including Vera Rubin-class deployments. That temperature target is significant. Warm-water cooling at this level allows many facilities to eliminate traditional chillers for heat rejection, depending on climate and deployment design. Instead of relying on compressor-driven refrigeration, operators can shift toward dry coolers or other simplified heat rejection strategies. The result: • Reduced mechanical complexity• Lower energy consumption• Improved efficiency at scale• New opportunities for heat reuse According to DCX CTO Maciek Szadkowski, the goal is to avoid obsolescence in a single hardware generation: “As the datacenter industry transitions to AI factories, operators need cooling systems that won’t be obsolete in one platform cycle. The FDU V2AT2 replaces multiple legacy CDUs and enables 45°C supply water operation while simplifying cooling topology and significantly reducing both CAPEX and OPEX.” The unit incorporates a high-capacity heat exchanger with a 2°C approach temperature, N+1 redundant pump configuration, integrated water quality control, and diagnostics systems designed for predictive maintenance. In short, this is infrastructure built not for incremental density growth, but for hyperscale AI facilities where megawatts of cooling must scale as predictably as compute capacity. Liquid Cooling Becomes System Architecture The broader industry implication is clear: cooling is no longer an auxiliary mechanical function. It is becoming system architecture. DCX’s broader 2025 performance metrics underscore the speed of this transition. The company reported 600% revenue growth, expanded its workforce fourfold, and shipped or secured contracts covering more than 500 MW

Read More »

AI Infrastructure Scales Out and Up: Edge Expansion Meets the Gigawatt Campus Era

The AI infrastructure boom is often framed around massive hyperscale campuses racing to secure gigawatts of power. But an equally important shift is happening in parallel: AI infrastructure is also becoming more distributed, modular, and sovereign, extending compute far beyond traditional data center hubs. A wave of recent announcements across developers, infrastructure investors, and regional operators shows the market pursuing a dual strategy. On one end, developers are accelerating delivery of hyperscale campuses measured in hundreds of megawatts, and increasingly gigawatts, often located where power availability and energy economics offer structural advantage, and in some cases pairing compute directly with dedicated generation. On the other, providers are building increasingly capable regional and edge facilities designed to bring AI compute closer to users, industrial operations, and national jurisdictions. Taken together, these moves point toward a future in which AI infrastructure is no longer purely centralized, but built around interconnected hub-and-spoke architectures combining energy-advantaged hyperscale cores with rapidly deployable edge capacity. Recent developments across hyperscale developers, edge specialists, infrastructure investors, and regional operators illustrate how quickly this model is taking shape. Sovereign AI Moves Beyond the Core On Feb. 5, 2026, San Francisco-based Armada and European AI infrastructure builder Nscale signed a letter of intent to jointly deploy both large-scale and edge AI infrastructure worldwide. The collaboration targets enterprise and public sector customers seeking sovereign, secure, geographically distributed AI environments. Nscale is building large AI supercomputer clusters globally, offering vertically integrated capabilities spanning power, data centers, compute, and software. Armada specializes in modular deployments through its Galleon data centers and Armada Edge Platform, delivering compute and storage into remote or infrastructure-poor environments. The combined offering addresses a growing challenge: many governments and enterprises want AI capability deployed within their own jurisdictions, even where traditional hyperscale infrastructure does not yet exist. “There is

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »