How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Stay Ahead, Stay ONMINE

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first. Previously, we covered the first two major stages of training an LLM: Pre-training — Learning from massive datasets to form a base model. Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful. Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline. I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea. Let’s go 🚀 What’s the purpose of reinforcement learning (RL)? Humans and LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training. This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer. RL bridges this gap by allowing the model to learn from its own experience. Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align better with human intent. Intuition behind RL LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution. We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the better paths more often. To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself. The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future. But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial. RL is not “new” — It can surpass human expertise (AlphaGo, 2016) A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play. In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it. The dotted line represents Lee Sedol’s performance — the best Go player in the world. This is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge. However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line). Image taken from AlphaGo 2016 paper RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies. RL foundations recap Let’s quickly recap the key components of a typical RL setup: Image by author Agent — The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward). Environment — The external system in which the agent operates. State — A snapshot of the environment at a given step t. At each timestamp, the agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was. This feedback is called a reward, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it. By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximise the total reward over time. Policy The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps. In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)). Value function An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model. Actor-Critic architecture It is a popular RL setup that combines two components: Actor — Learns and updates the policy (πθ), deciding which action to take in each state. Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes. How it works: The actor picks an action based on its current policy. The critic evaluates the outcome (reward + next state) and updates its value estimate. The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards. Putting it all together for LLMs The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is. The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses. DeepSeek-R1 (published 22 Jan 2025) To highlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT). DeepSeek-R1 builds on it, addressing encountered challenges. Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world. 🤖🫡— Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025 Let’s dive into some of these key points. 1. RL algo: Group Relative Policy Optimisation (GRPO) One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024. Why GRPO over PPO? PPO struggles with reasoning tasks due to: Dependency on a critic model.PPO needs a separate critic model, effectively doubling memory and compute.Training the critic can be complex for nuanced or subjective tasks. High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses. Absolute reward evaluationsWhen you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains. How GRPO addressed these challenges: GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard. Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality. How does GRPO fit into the whole training process? GRPO modifies how loss is calculated while keeping other training steps unchanged: Gather data (queries + responses)– For LLMs, queries are like questions– The old policy (older snapshot of the model) generates several candidate answers for each query Assign rewards — each response in the group is scored (the “reward”). Compute the GRPO lossTraditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.In GRPO, however, you measure:a) How likely is the new policy to produce past responses?b) Are those responses relatively better or worse?c) Apply clipping to prevent extreme updates.This yields a scalar loss. Back propagation + gradient descent– Back propagation calculates how each parameter contributed to loss– Gradient descent updates those parameters to reduce the loss– Over many iterations, this gradually shifts the new policy to prefer higher reward responses Update the old policy occasionally to match the new policy.This refreshes the baseline for the next round of comparisons. 2. Chain of thought (CoT) Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning. Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute). DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning. A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses. Image taken from DeepSeek-R1 paper Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training. The model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes. Image taken from DeepSeek-R1 paper Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts. Reinforcement learning with Human Feedback (RLHF) For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer? This is where human feedback comes in — but naïve RL approaches are unscalable. Image by author Let’s look at the naive approach with some arbitrary numbers. Image by author That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort. Ranking responses is also easier and more intuitive than absolute scoring. Image by author Upsides of RLHF Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks. Ranking outputs is much easier for human labellers than generating creative outputs themselves. Downsides of RLHF The reward model is an approximation — it may not perfectly reflect human preferences. RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores. Do note that Rlhf is not the same as traditional RL. For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences. Conclusion And that’s a wrap! I hope you enjoyed Part 2 🙂 If you haven’t already read Part 1 — do check it out here. Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first.

Previously, we covered the first two major stages of training an LLM:

Pre-training — Learning from massive datasets to form a base model.
Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful.

Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline.

I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea.

Let’s go 🚀

What’s the purpose of reinforcement learning (RL)?

Humans and LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training.

This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer.

RL bridges this gap by allowing the model to learn from its own experience.

Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align better with human intent.

Intuition behind RL

LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution.

We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the better paths more often.

To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself.

The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future.

But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial.

RL is not “new” — It can surpass human expertise (AlphaGo, 2016)

A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play.

In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it.

The dotted line represents Lee Sedol’s performance — the best Go player in the world.

This is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge.

However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line).

RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies.

RL foundations recap

Let’s quickly recap the key components of a typical RL setup:

Agent — The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward).
Environment — The external system in which the agent operates.
State — A snapshot of the environment at a given step t.

At each timestamp, the agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was.

This feedback is called a reward, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it.

By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximise the total reward over time.

Policy

The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps.

In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)).

Value function

An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model.

Actor-Critic architecture

It is a popular RL setup that combines two components:

Actor — Learns and updates the policy (πθ), deciding which action to take in each state.
Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes.

How it works:

The actor picks an action based on its current policy.
The critic evaluates the outcome (reward + next state) and updates its value estimate.
The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards.

Putting it all together for LLMs

The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is.

The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses.

DeepSeek-R1 (published 22 Jan 2025)

To highlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT).
DeepSeek-R1 builds on it, addressing encountered challenges.

Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world. 🤖🫡

— Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025

Let’s dive into some of these key points.

1. RL algo: Group Relative Policy Optimisation (GRPO)

One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024.

Why GRPO over PPO?

PPO struggles with reasoning tasks due to:

Dependency on a critic model.
PPO needs a separate critic model, effectively doubling memory and compute.
Training the critic can be complex for nuanced or subjective tasks.
High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses.
Absolute reward evaluations
When you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains.

How GRPO addressed these challenges:

GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard.

Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality.

How does GRPO fit into the whole training process?

GRPO modifies how loss is calculated while keeping other training steps unchanged:

Gather data (queries + responses)
– For LLMs, queries are like questions
– The old policy (older snapshot of the model) generates several candidate answers for each query
Assign rewards — each response in the group is scored (the “reward”).
Compute the GRPO loss
Traditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.
In GRPO, however, you measure:
a) How likely is the new policy to produce past responses?
b) Are those responses relatively better or worse?
c) Apply clipping to prevent extreme updates.
This yields a scalar loss.
Back propagation + gradient descent
– Back propagation calculates how each parameter contributed to loss
– Gradient descent updates those parameters to reduce the loss
– Over many iterations, this gradually shifts the new policy to prefer higher reward responses
Update the old policy occasionally to match the new policy.
This refreshes the baseline for the next round of comparisons.

2. Chain of thought (CoT)

Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning.

Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute).

DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.

A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses.

Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training.

The model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes.

Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts.

Reinforcement learning with Human Feedback (RLHF)

For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer?

This is where human feedback comes in — but naïve RL approaches are unscalable.

Let’s look at the naive approach with some arbitrary numbers.

That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort.

Ranking responses is also easier and more intuitive than absolute scoring.

Upsides of RLHF

Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks.
Ranking outputs is much easier for human labellers than generating creative outputs themselves.

Downsides of RLHF

The reward model is an approximation — it may not perfectly reflect human preferences.
RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores.

Do note that Rlhf is not the same as traditional RL.

For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences.

Conclusion

And that’s a wrap! I hope you enjoyed Part 2 🙂 If you haven’t already read Part 1 — do check it out here.

Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

How AWS is reinventing the telco revenue model

Consider what that means for the mobile operator and its relationship with its customers. Instead of selling a generic 5G pipe with a static SLA, a telco can now sell a dynamic, guaranteed slice for a specific use case—say, a remote robotic surgery setup or a high-density, low-latency industrial IoT

What’s the biggest barrier to AI success?

AI’s challenge starts with definition. We hear all the time about how AI raises productivity, and many have experienced that themselves. But what, exactly, does “productivity” mean? To the average person, it means they can do things with less effort, which they like, so it generates a lot of favorable

IBM proposes unified architecture for hybrid quantum-classical computing

Quantum computers and classical HPC are traditionally “disparate systems [that] operate in isolation,” IBM researchers explain in a new paper. This can be “cumbersome,” because users have to manually orchestrate workflows, coordinate scheduling, and transfer data between systems, thus hindering productivity and “severely” limiting algorithmic exploration. But a hybrid approach

FluidCloud’s Large Infrastructure Model targets the multicloud networking gap

“It’s a mixture of multiple models,” Omar told Network World. “The conversion and the core capability are not an LLM; it’s our own conditional model.” A standard LLM sits at the front end to parse user intent. The Terraform generation and cloud-to-cloud conversion work runs on custom foundation models trained

Energy Department Announces $1.9B Investment in Critical Grid Infrastructure to Reduce Electricity Costs

WASHINGTON—The U.S. Department of Energy’s Office of Electricity (OE) today announced an approximately $1.9 billion funding opportunity to accelerate urgently needed upgrades to the nation’s power grid. These investments will meet rising electricity demand and resource adequacy needs, while lowering electricity costs for American households and businesses. Projects selected through the Speed to Power through Accelerated Reconductoring and other Key Advanced Transmission Technology Upgrades (SPARK) funding opportunity will deliver fast and durable upgrades to the grid with real results. In line with President Trump’s Executive Order, Unleashing American Energy, selected projects will demonstrate how reconductoring—replacing existing power lines with higher‑capacity conductors—paired with other Advanced Transmission Technologies (ATTs) can expand grid capacity, increase operational efficiency, lower prices for consumers, and improve overall system reliability and security of the nation’s electric grid. “For too long, important grid modernization and energy addition efforts were not prioritized by past leaders,” said U.S. Secretary of Energy Chris Wright. “Thanks to President Trump, we are doing the important work of modernizing our grid so electricity costs will be lowered for American families and businesses.” “The United States must increase grid capacity to meet demand, and ensure the grid provides reliable power—day-in and day-out,” said OE Assistant Secretary Katie Jereza. “Through this SPARK funding opportunity, we will stabilize and optimize grid operations to strengthen it for rapid growth.” The SPARK opportunity builds on the Grid Resilience and Innovation Partnerships (GRIP) Program, which provided up to $10.5 billion in competitive funding over five years to states, tribes, electric utilities, and other eligible recipients to strengthen grid resilience and innovation. The previous two GRIP funding rounds covered FY 2022-2023 and FY 2023-2024 funding. Today’s announcement continues the mission of the GRIP Program under the SPARK funding opportunity, focusing on the rapid deployment of reconductoring and other ATTs that expand transfer capability, strengthen reliability

United States to Release 172 Million Barrels of Oil From the Strategic Petroleum Reserve

WASHINGTON—U.S. Secretary of Energy Chris Wright released the following statement regarding the International Energy Agency (IEA) and the U.S. Strategic Petroleum Reserve (SPR): “Earlier today, 32 member nations of the International Energy Agency unanimously agreed to President Trump’s request to lower energy prices with a coordinated release of 400 million barrels of oil and refined products from their respective reserves. “As part of this effort, President Trump authorized the Department of Energy to release 172 million barrels from the Strategic Petroleum Reserve, beginning next week. This will take approximately 120 days to deliver based on planned discharge rates. “President Trump promised to protect America’s energy security by managing the Strategic Petroleum Reserve responsibly and this action demonstrates his commitment to that promise. Unlike the previous administration, which left America’s oil reserves drained and damaged, the United States has arranged to more than replace these strategic reserves with approximately 200 million barrels within the next year—20% more barrels than will be drawn down—and at no cost to the taxpayer. “For 47 years, Iran and its terrorist proxies have been intent on killing Americans. They have manipulated and threatened the energy security of America and its allies. Under President Trump, those days are coming to an end. “Rest assured, America’s energy security is as strong as ever.” ###

Occidental Petroleum, 1PointFive STRATOS DAC plant nears startup in Texas Permian basin

Occidental Petroleum Corp. and its subsidiary 1PointFive expect Phase 1 of the STRATOS direct air capture (DAC) plant in Texas’ Permian basin to come online in this year’s second quarter. In a post to LinkedIn, 1PointFive said Phase 1 “is in the final stage of startup” and that Phase 2, which incorporates learnings from research and development and Phase 1 construction activities, “will also begin commissioning in Q2, with operational ramp-up continuing through the rest of the year.” Once fully operational, STRATOS is designed to capture up to 500,000 tonnes/year (tpy) of CO2. As part of the US Environmental Protection Agency (EPA) Class VI permitting process and approval, it was reported that STRATOS is expected to include three wells to store about 722,000 tpy of CO2 in saline formations at a depth of about 4,400 ft. The company said a few activities before start-up remain, including ramping up remaining pellet reactors, completing calciner final commissioning in parallel, and beginning CO2 injection. Start-up milestones achieved include: Completed wet commissioning with water circulation. Received Class VI permits to sequester CO2. Ran CO2 compression system at design pressure. Added potassium hydroxide (KOH) to capture CO2 from the atmosphere. Building pellet inventory. Burners tested on calciner.

Brava Energia weighs Phase 3 at Atlanta to extend production plateau

Just 2 months after bringing its flagship Atlanta field onstream with the new FPSO Atlanta, Brazil’s independent operator Brava Energia SA is evaluating a potential third development phase that could add roughly 25 million bbl of reserves and help sustain peak production longer than originally planned. The Phase 3 project, still at an early technical and economic evaluation stage, focuses on the Atlanta Nordeste area; a separate, shallower reservoir discovered in 2006 by Shell’s 9-SHEL-19D-RJS well. According to André Fagundes, vice-president of research (Brazil) at Welligence Energy Analytics, Phase 2 has four wells still to be developed: two expected in 2027 and two in 2029. Phase 3 would involve drilling two additional wells in 2031, bringing total development to 12 producing wells. Until recently, full-field development was understood to comprise 10 wells, but Brava has since updated guidance to reflect a 12-well development concept. Atlanta field upside The primary objective is clear. “We believe its main objective is to extend the production plateau,” Fagundes said. Welligence estimates incremental recovery could reach 25 MMbbl, increasing the field’s overall recovery factor by roughly 1.5%. Lying outside Atlanta’s main Cretaceous reservoir, Atlanta Nordeste represents a genuine upside opportunity, Fagundes explained. The field benefits from strong natural aquifer support, and no water or gas injection is anticipated. Water-handling constraints that affected early production using the Petrojarl I—limited to 11,500 b/d of water treatment—are no longer a bottleneck. FPSO Atlanta can process up to 140,000 b/d of water. Reservoir performance to date has been solid, albeit with difficulties. Recurrent electric submersible pump (ESP) failures and processing limits on the previous FPSO complicated full validation of original reservoir models. With the new 50,000-b/d FPSO in operation since late 2024, reservoir deliverability has become the main constraint. Phase 3 wells would also use ESPs and require additional subsea

California Resources eyes ‘measured’ capex ramp on way to 12% production growth thanks to Berry buy

@import url(‘https://fonts.googleapis.com/css2?family=Inter:[email protected]&display=swap’); a { color: var(–color-primary-main); } .ebm-page__main h1, .ebm-page__main h2, .ebm-page__main h3, .ebm-page__main h4, .ebm-page__main h5, .ebm-page__main h6 { font-family: Inter; } body { line-height: 150%; letter-spacing: 0.025em; font-family: Inter; } button, .ebm-button-wrapper { font-family: Inter; } .label-style { text-transform: uppercase; color: var(–color-grey); font-weight: 600; font-size: 0.75rem; } .caption-style { font-size: 0.75rem; opacity: .6; } #onetrust-pc-sdk [id*=btn-handler], #onetrust-pc-sdk [class*=btn-handler] { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-policy a, #onetrust-pc-sdk a, #ot-pc-content a { color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-pc-sdk .ot-active-menu { border-color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-accept-btn-handler, #onetrust-banner-sdk #onetrust-reject-all-handler, #onetrust-consent-sdk #onetrust-pc-btn-handler.cookie-setting-link { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-consent-sdk .onetrust-pc-btn-handler { color: #c19a06 !important; border-color: #c19a06 !important; } The leaders of California Resources Corp., Long Beach, plan to have the company’s total production average 152,000-157,000 boe/d in 2026, with each quarter expected to be in that range. That output would equate to an increase of more than 12% from the operator’s 137,000 boe/d during fourth-quarter 2025, due mostly to the mid-December acquisition of Berry Corp. Fourth-quarter results folded in 14 days of Berry production and included 109,000 b/d of oil, with the company’s assets in the San Joaquin and Los Angeles basins accounting for 99,000 b/d of that total. The company dilled 31 new wells during the quarter and 76 in all of 2025—all in the San Joaquin—but that number will grow significantly to about 260 this year as state officials have resumed issuing permits following the passage last fall of a bill focused on Kern County production. Speaking to analysts after CRC reported fourth-quarter net income of $12 million on $924 million in revenues, president and chief executive officer Francisco Leon and chief financial officer Clio Crespy said the goal is to manage 2026 output decline to roughly 0.5% per quarter while operating four rigs and

Petro-Victory Energy spuds São João well in Brazil

@import url(‘https://fonts.googleapis.com/css2?family=Inter:[email protected]&display=swap’); a { color: var(–color-primary-main); } .ebm-page__main h1, .ebm-page__main h2, .ebm-page__main h3, .ebm-page__main h4, .ebm-page__main h5, .ebm-page__main h6 { font-family: Inter; } body { line-height: 150%; letter-spacing: 0.025em; font-family: Inter; } button, .ebm-button-wrapper { font-family: Inter; } .label-style { text-transform: uppercase; color: var(–color-grey); font-weight: 600; font-size: 0.75rem; } .caption-style { font-size: 0.75rem; opacity: .6; } #onetrust-pc-sdk [id*=btn-handler], #onetrust-pc-sdk [class*=btn-handler] { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-policy a, #onetrust-pc-sdk a, #ot-pc-content a { color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-pc-sdk .ot-active-menu { border-color: #c19a06 !important; } #onetrust-consent-sdk #onetrust-accept-btn-handler, #onetrust-banner-sdk #onetrust-reject-all-handler, #onetrust-consent-sdk #onetrust-pc-btn-handler.cookie-setting-link { background-color: #c19a06 !important; border-color: #c19a06 !important; } #onetrust-consent-sdk .onetrust-pc-btn-handler { color: #c19a06 !important; border-color: #c19a06 !important; } Petro-Victory Energy Corp. has spudded the SJ‑12 well at São João field in Barreirinhas basin, on the Brazilian equatorial margin, Maranhão. Drilling and testing SJ‑12 is aimed at proving enough gas can be produced to sell locally. The well forms part of the single non‑associated gas well commitment under a memorandum of understanding signed in 2024 with Enava. São João contains 50.1 bcf (1.4 billion cu m) non‑associated gas resources. Petro‑Victory 100% owns and operates São João field.

Arista targets AI data centers with new liquid cooled pluggable optic module

To prove their point, the authors imagined a 400 MW AI datacenter with 1024 GPU racks of 128 GPUs each for a total of 128,000 GPUs. “Assume 12.8T scale-up and 1.6T scale-out bandwidth per GPU. With OSFP switch racks that have a density of 1.6 Pbps per rack, this would require more than 1,400 switch racks for scale-up and scale-out fabrics. With XPO, this would require 75% fewer racks, saving over 1,050 racks or 44 % of the floor space,” Bechtolsheim and Vusirikala stated in the blog. “Eliminating 75% of switch racks translates to massive reductions in construction and infrastructure costs, including power distribution, plumbing and installation costs, while accelerating deployment timelines,” Bechtolsheim and Vusirikala stated. Arista said the water-cooling capability of XPO is also an important feature. “All large AI data centers will be liquid cooled and the switches that go into these data centers also need to be liquid cooled,” Bechtolsheim and Vusirikala stated. “While one can add liquid cooled cold plates on flat-top OSFP modules, this does not substantially improve thermal performance.” XPO solves this problem by integrating a liquid cold plate inside the module, with two 32-channel paddle cards sharing the common cold plate which can cool both low power as well as high-power optics such as 8x1600G-ZR/ZR+ with up to 400W of power, Bechtolsheim and Vusirikala stated. XPO modules are much simpler than OSPF modules which improves reliability as well. “Each 32-channel paddle card has only one microcontroller and one set of voltage converters, a 75% reduction in common components versus 4 OSFPs,” Bechtolsheim and Vusirikala wrote.

Cisco grows high-end optical support for AI clusters

Cisco has also upgraded its Network Conversion System (NCS) with a 1RU, 800GE line card offering 12.8T capacity, with 32 OSFP-based ports for 100GE, 400GE, and 800GE clients and 800ZR/ZR+ WDM trunks. The NCS 1014 doubles the density of previous-generation NCS versions and now includes MACsec encryption (IEEE 802.1AE) to secure point-to-point links with hardware-based encryption, data integrity, and authentication for Ethernet traffic, Ghioni stated. It supports enhanced capacity and performance with C&L-band support and NCS 1014 systems with the 2.4T WDM line card based on the Coherent Interconnect Module 8 and now supports 800 GE clients, which can be mapped directly to a wavelength or inverse multiplexed across two wavelengths to maximize reach, Ghioni wrote. In the pluggable optic arena, Cisco is now offering a Quad Small Form Factor Pluggable Double Density (QSFP-DD) Pluggable Protection Switch Module that can monitor the optical link and switch traffic if it detects a fault in less than 50 milliseconds. The module occupies a quarter of the rack space compared to traditional protection devices—offering 90% rack space saving over available options, Ghioni wrote. It is aimed at Metro and DCI network customers where sub-50 ms failure recovery is essential and data centers needing fiber protection without bulky hardware, Ghioni stated. Cisco also added its Acacia developed Bright QSFP28 100ZR 0 dBm coherent optical pluggable in a standard QSFP28 form factor. It is aimed at edge, access, enterprise, and campus network deployment. Cisco has been actively growing its optical portfolio recently adding the Cisco Silicon One G300, which powers 102.4T N9000 and Cisco 8000 systems, as well as advanced 1.6T OSFP optics and 800G Linear Pluggable Optics.

Datalec targets rapid infrastructure deployment with new modular data centers

“We are engineering the data center with a new lens bringing pre-engineered system designs that are flexible and adaptable that enables a tailored solution for clients,” said John Lever, director of modular solutions at Datalec. The systems are flexible enough that these solutions cater for all types of data center, from standard server technology to AI and high-density compute. Datalec also provides “bolt-on” solutions, including a ‘digital wrapper’ including digital twinning and lifecycle and global support, Lever says. Another way Datalec says it differentiates from competing modular designs is a larger share of work is done offsite in a controlled manufacturing environment, which cuts onsite construction time, improves safety and limits disruption to live facilities, Lever says. The company competes with other modular data center vendors including Schneider Electric, Vertiv, Flex many others. DPI’s says its services are aimed at colocation providers, hyperscale and AI infrastructure teams, and large enterprises that need to add capacity quickly, safely and cost effectively across multiple regions.

Study finds significant savings from direct current power for AI workloads

The result is a 50% to 80% reduction in copper usage, due to fewer conductors and less parallel cabling, and an 8% to 12% reduction in annual energy-related OpEx through lower conversion and distribution losses. By reducing conductor count, cabling, and redundant power components, 800VDC enables meaningful savings at both build-out and operational stages. AI-first facilities can see a $4 million to $8 million in CapEx savings per 10 MW build by reducing upstream AC. For a one-gigawatt data center, you’re saving a couple million pounds of copper wire, he said. Burke says an all-DC data center is best done with a whole new facility rather than retrofitting old facilities. “[DC] is going to be in a lot of greenfield data centers that are going to be built, and data centers that are going to go to higher compute power are also going to DC,” he said. He did recommend all-DC retrofits for existing data centers that are going to employ high power computing with GPUs. Enteligent’s unnamed and as yet unreleased product is a converter that takes 800 volts and partitions it to 50 volts for the computing servers. The company will provide a new power supply, power shelf that converts 800 volts DC to 50 volts DC much more efficiently than any current power supplies. Burke said the company is doing NDA level testing and pilot programs now with its product, but it will be making a formal announcement within the next few weeks. There are a number of players in the DC arena focusing on different parts of the power supply market including Vertiv, Rutherford, Siemens, Eaton and many more.

Cisco blends Splunk analytics, security with core data center management

With the integration, data center teams can gather and act on events, alarms, health scores, and inventory through open APIs, Cisco stated. It also offers pre-built and customizable dashboards for inventory, health, fabric state, anomalies, and advisories as well as correlates telemetry across fabrics and technology tiers for actionable insights, according to Cisco. “This isn’t just another connector or API call. This is an embedded, architectural integration designed to transform how you monitor, troubleshoot, and secure your data center fabric. By bringing the power of Splunk directly into the Data Center Networking environment, we are enabling teams to solve complex problems faster, maintain strict data sovereignty, and dramatically reduce operational costs,” wrote Usha Andra is a senior product marketing leader and Anant Shah, senior product manager, both with Cisco Data Center Networking in a blog about the integration. “Traditionally, network monitoring involves a trade-off. You either send massive amounts of raw logs to a centralized data lake, incurring high ingress and storage costs. Or you rely on sampled data that misses critical microbursts and anomalies,” Andra and Shah wrote. “Native Splunk integration changes the paradigm by running Splunk capabilities directly within the Cisco Nexus Dashboard. This allows for the streaming of high-fidelity telemetry, including anomalies, advisories, and audit logs, directly to Splunk analytics.”

Execution, Power, and Public Trust: Rich Miller on 2026’s Data Center Reality and Why He Built Data Center Richness

DCF founder Rich Miller has spent much of his career explaining how the data center industry works. Now, with his latest venture, Data Center Richness, he’s also examining how the industry learns. That thread provided the opening for the latest episode of The DCF Show Podcast, where Miller joined present Data Center Frontier Editor in Chief Matt Vincent and Senior Editor David Chernicoff for a wide-ranging discussion that ultimately landed on a simple conclusion: after two years of unprecedented AI-driven announcements, 2026 will be the year reality asserts itself. Projects will either get built, or they won’t. Power will either materialize, or it won’t. Communities will either accept data center expansion – or they’ll stop it. In other words, the industry is entering its execution phase. Why Data Center Richness Matters Now Miller launched Data Center Richness as both a podcast and a Substack publication, an effort to experiment with formats and better understand how professionals now consume industry information. Podcasts have become a primary way many practitioners follow the business, while YouTube’s discovery advantages increasingly make video versions essential. At the same time, Miller remains committed to written analysis, using Substack as a venue for deeper dives and format experimentation. One example is his weekly newsletter distilling key industry developments into just a handful of essential links rather than overwhelming readers with volume. The approach reflects a broader recognition: the pace of change has accelerated so much that clarity matters more than quantity. The topic of how people learn about data centers isn’t separate from the industry’s trajectory; it’s becoming part of it. Public perception, regulatory scrutiny, and investor expectations are now shaped by how stories are told as much as by how facilities are built. That context sets the stage for the conversation’s core theme. Execution Defines 2026 After

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle