Stay Ahead, Stay ONMINE

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first.  Previously, we covered the first two major stages of training an LLM: Pre-training — Learning from massive datasets to form a base model. Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful. Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline. I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea. Let’s go 🚀 What’s the purpose of reinforcement learning (RL)? Humans and LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training. This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer. RL bridges this gap by allowing the model to learn from its own experience. Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align better with human intent. Intuition behind RL LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution. We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the better paths more often. To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself. The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future. But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial. RL is not “new” — It can surpass human expertise (AlphaGo, 2016) A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play. In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it. The dotted line represents Lee Sedol’s performance — the best Go player in the world. This is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge. However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line). Image taken from AlphaGo 2016 paper RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies. RL foundations recap Let’s quickly recap the key components of a typical RL setup: Image by author Agent — The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward). Environment  — The external system in which the agent operates. State —  A snapshot of the environment at a given step t.  At each timestamp, the agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was. This feedback is called a reward, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it. By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximise the total reward over time. Policy The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps. In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)). Value function An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model.  Actor-Critic architecture It is a popular RL setup that combines two components: Actor — Learns and updates the policy (πθ), deciding which action to take in each state. Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes.  How it works: The actor picks an action based on its current policy. The critic evaluates the outcome (reward + next state) and updates its value estimate. The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards. Putting it all together for LLMs The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is.  The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses. DeepSeek-R1 (published 22 Jan 2025) To highlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT). DeepSeek-R1 builds on it, addressing encountered challenges. Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world. 🤖🫡— Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025 Let’s dive into some of these key points.  1. RL algo: Group Relative Policy Optimisation (GRPO) One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024.  Why GRPO over PPO? PPO struggles with reasoning tasks due to: Dependency on a critic model.PPO needs a separate critic model, effectively doubling memory and compute.Training the critic can be complex for nuanced or subjective tasks. High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses.  Absolute reward evaluationsWhen you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains.  How GRPO addressed these challenges: GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard. Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality. How does GRPO fit into the whole training process? GRPO modifies how loss is calculated while keeping other training steps unchanged: Gather data (queries + responses)– For LLMs, queries are like questions– The old policy (older snapshot of the model) generates several candidate answers for each query Assign rewards — each response in the group is scored (the “reward”). Compute the GRPO lossTraditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.In GRPO, however, you measure:a) How likely is the new policy to produce past responses?b) Are those responses relatively better or worse?c) Apply clipping to prevent extreme updates.This yields a scalar loss. Back propagation + gradient descent– Back propagation calculates how each parameter contributed to loss– Gradient descent updates those parameters to reduce the loss– Over many iterations, this gradually shifts the new policy to prefer higher reward responses Update the old policy occasionally to match the new policy.This refreshes the baseline for the next round of comparisons. 2. Chain of thought (CoT) Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning. Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute). DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.  A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses. Image taken from DeepSeek-R1 paper Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training. The model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes. Image taken from DeepSeek-R1 paper Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts. Reinforcement learning with Human Feedback (RLHF) For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer?  This is where human feedback comes in — but naïve RL approaches are unscalable. Image by author Let’s look at the naive approach with some arbitrary numbers. Image by author That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort.  Ranking responses is also easier and more intuitive than absolute scoring. Image by author Upsides of RLHF Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks. Ranking outputs is much easier for human labellers than generating creative outputs themselves. Downsides of RLHF The reward model is an approximation — it may not perfectly reflect human preferences. RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores. Do note that Rlhf is not the same as traditional RL. For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences. Conclusion And that’s a wrap! I hope you enjoyed Part 2 🙂 If you haven’t already read Part 1 — do check it out here. Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first

Previously, we covered the first two major stages of training an LLM:

  1. Pre-training — Learning from massive datasets to form a base model.
  2. Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful.

Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline.

I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea.

Let’s go 🚀

What’s the purpose of reinforcement learning (RL)?

Humans and LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training.

This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer.

RL bridges this gap by allowing the model to learn from its own experience.

Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align better with human intent.

Intuition behind RL

LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution.

We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the better paths more often.

To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself.

The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future.

But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial.

RL is not “new” — It can surpass human expertise (AlphaGo, 2016)

A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play.

In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it.

The dotted line represents Lee Sedol’s performance — the best Go player in the world.

This is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge.

However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line).

Image taken from AlphaGo 2016 paper

RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies.

RL foundations recap

Let’s quickly recap the key components of a typical RL setup:

Image by author
  • Agent The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward).
  • Environment  — The external system in which the agent operates.
  • State —  A snapshot of the environment at a given step t

At each timestamp, the agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was.

This feedback is called a reward, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it.

By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximise the total reward over time.

Policy

The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps.

In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)).

Value function

An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model. 

Actor-Critic architecture

It is a popular RL setup that combines two components:

  1. Actor — Learns and updates the policy (πθ), deciding which action to take in each state.
  2. Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes. 

How it works:

  • The actor picks an action based on its current policy.
  • The critic evaluates the outcome (reward + next state) and updates its value estimate.
  • The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards.

Putting it all together for LLMs

The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is. 

The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses.

DeepSeek-R1 (published 22 Jan 2025)

To highlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1.

  • DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT).
  • DeepSeek-R1 builds on it, addressing encountered challenges.

Let’s dive into some of these key points. 

1. RL algo: Group Relative Policy Optimisation (GRPO)

One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024. 

Why GRPO over PPO?

PPO struggles with reasoning tasks due to:

  1. Dependency on a critic model.
    PPO needs a separate critic model, effectively doubling memory and compute.
    Training the critic can be complex for nuanced or subjective tasks.
  2. High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses. 
  3. Absolute reward evaluations
    When you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains. 

How GRPO addressed these challenges:

GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard.

Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality.

How does GRPO fit into the whole training process?

GRPO modifies how loss is calculated while keeping other training steps unchanged:

  1. Gather data (queries + responses)
    – For LLMs, queries are like questions
    – The old policy (older snapshot of the model) generates several candidate answers for each query
  2. Assign rewards — each response in the group is scored (the “reward”).
  3. Compute the GRPO loss
    Traditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.
    In GRPO, however, you measure:
    a) How likely is the new policy to produce past responses?
    b) Are those responses relatively better or worse?
    c) Apply clipping to prevent extreme updates.
    This yields a scalar loss.
  4. Back propagation + gradient descent
    – Back propagation calculates how each parameter contributed to loss
    – Gradient descent updates those parameters to reduce the loss
    – Over many iterations, this gradually shifts the new policy to prefer higher reward responses
  5. Update the old policy occasionally to match the new policy.
    This refreshes the baseline for the next round of comparisons.

2. Chain of thought (CoT)

Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning.

Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute).

DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning. 

A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses.

Image taken from DeepSeek-R1 paper

Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training.

The model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes.

Image taken from DeepSeek-R1 paper

Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts.

Reinforcement learning with Human Feedback (RLHF)

For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer? 

This is where human feedback comes in — but naïve RL approaches are unscalable.

Image by author

Let’s look at the naive approach with some arbitrary numbers.

Image by author

That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort. 

Ranking responses is also easier and more intuitive than absolute scoring.

Image by author

Upsides of RLHF

  • Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks.
  • Ranking outputs is much easier for human labellers than generating creative outputs themselves.

Downsides of RLHF

  • The reward model is an approximation — it may not perfectly reflect human preferences.
  • RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores.

Do note that Rlhf is not the same as traditional RL.

For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences.

Conclusion

And that’s a wrap! I hope you enjoyed Part 2 🙂 If you haven’t already read Part 1 — do check it out here.

Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Nvidia launches research center to accelerate quantum computing breakthrough

The new research center aims to tackle quantum computing’s most significant challenges, including qubit noise reduction and the transformation of experimental quantum processors into practical devices. “By combining quantum processing units (QPUs) with state-of-the-art GPU technology, Nvidia hopes to accelerate the timeline to practical quantum computing applications,” the statement added.

Read More »

Keysight network packet brokers gain AI-powered features

The technology has matured considerably since then. Over the last five years, Singh said that most of Keysight’s NPB customers are global Fortune 500 organizations that have large network visibility practices. Meaning they deploy a lot of packet brokers with capabilities ranging anywhere from one gigabit networking at the edge,

Read More »

Adding, managing and deleting groups on Linux

$ sudo groupadd -g 1111 techs In this case, a specific group ID (1111) is being assigned. Omit the -g option to use the next available group ID (e.g., sudo groupadd techs). Once a group is added, you will find it in the /etc/group file. $ grep techs /etc/grouptechs:x:1111: Adding

Read More »

BP makes first divestment of target $20bn asset sale

Energy firm BP has sold a stake worth $1 billion (£733m) in the Trans-Anatolian Natural Gas Pipeline (TANAP) to Apollo as part of the first tranche of a $20bn asset sale target. The US asset manager will take a 25% non-operated stake in BP Pipelines (TANAP) (BP TANAP) which itself holds BP’s 12% interest in TANAP, owner and operator of the pipeline that carries natural gas from Azerbaijan across Turkey. The sale comes after BP chief executive Murray Auchincloss unveiled plans to review assets for a potential sale including its core lubricants business, Castrol and its solar business, BP Lightsource. The $20bn target was announced alongside a “fundamental reset” for the firm as it turns focus to its traditional oil and gas production business. The deal marks the second such sale agreed with US fund manager, Apollo. Last year Apollo snapped up another $1bn BP-owned stake in Trans Adriatic Pipeline (TAP). TANAP, running for approximately 1,120 miles (1,800km) across Turkey, is the central section of the Southern Gas Corridor project (SGC) pipeline system. The SGC transports gas from the BP-operated Shah Deniz gas field in the Azerbaijan sector of the Caspian Sea to markets in Europe, including Italy and Greece. It connects to TAP at the Greek-Turkish border, which crosses Northern Greece, Albania and the Adriatic Sea before coming ashore in Southern Italy to connect to the Italian natural gas network. BP said the deal allows it to “monetise” its interest in TANAP while retaining control of the asset. BP executive vice president for gas and low carbon energy William Lin said: “This unlocks capital from our global portfolio while retaining our role in this strategic asset for bringing Azerbaijan gas to Europe. BP and Apollo will continue to explore further strategic cooperation and mutually beneficial opportunities.” Apollo partner Skardon Baker

Read More »

‘First Major Project’ for GB Energy Announced

A release posted on the UK government website on Friday announced that the “first major project” for Great British Energy (GB Energy) “is to put rooftop solar panels on around 200 schools and 200 NHS sites, saving hundreds of millions on their energy bills”. Hundreds of schools, NHS trusts and communities across the UK will benefit from new rooftop solar power and renewable schemes to save money on their energy bills, thanks to a total GBP 200 million ($258.6 million) investment from the UK government and Great British Energy, the release stated. “In England around GBP 80 million ($103.4 million) in funding will support around 200 schools, alongside GBP 100 million ($129.3 million) for nearly 200 NHS sites, covering a third of NHS trusts, to install rooftop solar panels that could power classrooms and operations, with potential to sell leftover energy back to the grid,” the release noted. The first panels are expected to be in schools and hospitals by the end of summer 2025, according to the release. The release stated that local authorities and community energy groups will also be supported by nearly GBP 12 million ($15.5 million) to help build local clean energy projects. A further GBP 9.3 million ($12.0 million) will power schemes in Scotland, Wales, and Northern Ireland including community energy or rooftop solar for public buildings, the release added. “Great British Energy’s first major project will be to help our vital public institutions save hundreds of millions on bills to reinvest on the frontline,” Energy Secretary Ed Miliband said in the release. “Great British Energy will provide power for pupils and patients,” he added. “Parents at the school gate and patients in hospitals will experience the difference Great British Energy can make. This is our clean energy superpower mission in action, with lower bills

Read More »

Raymond James Sees Biggest Crop of New Oil Projects in a Decade

A flurry of oil projects from Brazil to Saudi Arabia are set to come online this year, providing the biggest infusion of new crude production in more than a decade.  Fresh oil field output is expected to total about 2.9 million barrels a day in 2025, up from about 800,000 barrels last year, according to data from Raymond James. That’s the most in data stretching back to 2015. Among the largest projects are the Tengiz field in Kazakhstan and Bacalhau in Brazil, as well as the Berri and Marjan expansions in Saudi Arabia. The projections for this year and next are subject to delays, and could change. Global oil forecasters have been projecting a supply overhang for 2025 as countries including Guyana and Brazil bring on new output and OPEC+ plans to start reviving idled output in April. Meanwhile, US President Donald Trump’s trade policies have fanned concerns about reduced global energy demand. The US Energy Information Administration projects supply will exceed demand by 100,000 barrels a day this year, and the International Energy Agency sees a surplus of 600,000 barrels a day. While Raymond James didn’t provide full forecasts for production and consumption, the firm projects that supply will outstrip demand by 280,000 barrels a day toward the end of 2025.  “Investors have not fully grasped just how much new supply from projects is on deck in 2025,” said Pavel Molchanov, an analyst at Raymond James. What do you think? We’d love to hear from you, join the conversation on the Rigzone Energy Network. The Rigzone Energy Network is a new social experience created for you and all energy professionals to Speak Up about our industry, share knowledge, connect with peers and industry insiders and engage in a professional community that will empower your career in energy. MORE FROM THIS

Read More »

Net zero cannot be ‘in isolation’ from fossil fuels, industry chief to say

The drive for net zero cannot be “in isolation from the hydrocarbon sector”, the head of the CBI (Confederation of British Industry) will say today. Rain Newton-Smith is due to speak at the group’s annual lunch in Edinburgh, alongside First Minister John Swinney, where she will laud the oil and gas industry as “the bridge” to net zero. But the business boss will also lament government action which has hit the fossil fuels industry. The speech comes at a time when the idea of net zero is becoming unpopular with the Conservatives and a surging Reform UK – which has made opposing them a key plank of their offering to the public. “Despite the voices being raised against net zero, the fact is Scotland is sitting on a goldmine of green energy,” Newton-Smith is expected to say. “The numbers don’t lie. The opportunities are there. “Since 2022, Scotland’s net-zero sector has grown 20% and created 16,000 more jobs while average UK growth has near-flatlined. “So, let me be crystal clear. Business is behind net zero. Business is invested in our energy transition. And we’re behind the plans to go further. “But we can’t see net zero in isolation from the hydrocarbon sector. “Especially in Scotland. Oil and gas are still tens of thousands of jobs here. From the latest data it still makes up over 10% of Scotland’s GDP. “It will still be a part of the energy mix and the bridge to net zero, for some time yet. The infrastructure, the investment, the skills and knowledge of these industries will be mission critical for the transition. “But too often, they have been left out of the picture, hit by repeated tax changes and uncertainty. “On one hand, we need clear timelines and funding for net-zero commitments government has already

Read More »

Iberdrola Bags Contract to Power Honda Car Production in Mexico

Honda Motor Co. Ltd. has signed a deal to buy renewable power from Iberdrola SA to electrify the Japanese car maker’s Mexican operations in Celaya, Guanajuato, and El Salto, Jalisco. The electricity will come from Iberdrola Mexico’s wind farms. “Honda de México will use this power for its production in the country, where it has an installed capacity to manufacture up to 200,000 cars a year at its Guanajuato plant and more than 120,000 motorcycles at the one in Jalisco, in both cases for the local and export markets”, the Spanish power utility said in an online statement. “Supplying clean energy to Honda will prevent the annual release of 63,826 tonnes of CO2 [carbon dioxide] into the atmosphere, the equivalent of the carbon captured by more than a million trees over 10 years. “The agreement includes the acquisition of Guarantees of Origin, an instrument that makes it possible to certify the reduction of greenhouse gases by ensuring, with full traceability, that the electricity supply comes from clean sources”. Last week Iberdrola announced a long-term agreement to supply Air Liquide with clean power for the French company’s operations in Spain and Portugal. “It will allow Air Liquide to continue developing innovative and sustainable solutions for the supply of industrial gases and will enable its industrial and medical customers to meet their ambitions to reduce the carbon footprint associated with their final products”, Iberdrola said March 14. Air Liquide supplies gases to the industrial and healthcare sectors. Iberdrola has long-term contracts to supply power in several countries including Australia, Brazil, Germany, Italy, Mexico, Poland, Portugal, Spain, the United Kingdom and the United States. The electricity comes from offshore and onshore wind projects, as well as solar projects, according to the company. Other international firms that have committed to buying power from Iberdrola are ABInBev, Amazon, Apple, Bayer, Burger

Read More »

Res Integra to Build 4 MW Green Hydrogen Plant in Sicily

IREM SpA’s Res Integra SRL has partnered with Ohmium International Inc. to construct a 4-megawatt (MW) electrolyzer in an industrial area of Siracusa, which Res Integra said is one of the biggest petrochemical sites in Europe. Newark, California-based electrolyzer developer Ohmium will supply a production system with a capacity of 700 metric tons a year. The PEM electrolyzer is planned to be shipped this year, Ohmium said in a press release. The plant, which will use solar power, will support efforts to decarbonize Sicily’s industry and transport, Ohmium said. “We sought advanced electrolyzer technology that was efficient, cost-effective, and quick to deliver”, commented Dario Niciforo, managing director of Res Integra. “We explored multiple options, but Ohmium’s hyper-modular, easily scalable solutions and expedient delivery timeline made the choice clear for us”. Ohmium said, “The modular units feature integrated advanced power electronics enabling rapid dynamic ramping, essential for pairing with intermittent renewable energy”. “Building on Ohmium’s success delivering similar projects in Central and Southern Europe, we knew we could meet Res Integra’s technical requirements, even on an accelerated timeline”, said Ohmium chief executive Arne Ballantine. The project “leads us towards the construction of green energy plants in line with the objectives of the European economic community”, said IREM chief executive Giovanni Musso. IREM is a design, construction, supply and maintenance contractor in the oil, gas, power and chemical sectors. The European Union aims to reach 10 million metric tons of hydrogen production capacity by 2030. Recently the European Commission completed a second auction under the European Hydrogen Bank, a financing platform to scale up the renewable hydrogen value chain in the EU and partner countries. The auction was oversubscribed by over four times with about 6.3 gigawatts (GW) of project proposals asking for over EUR 4.8 billion ($5.2 billion), the Commission said

Read More »

PEAK:AIO adds power, density to AI storage server

There is also the fact that many people working with AI are not IT professionals, such as professors, biochemists, scientists, doctors, clinicians, and they don’t have a traditional enterprise department or a data center. “It’s run by people that wouldn’t really know, nor want to know, what storage is,” he said. While the new AI Data Server is a Dell design, PEAK:AIO has worked with Lenovo, Supermicro, and HPE as well as Dell over the past four years, offering to convert their off the shelf storage servers into hyper fast, very AI-specific, cheap, specific storage servers that work with all the protocols at Nvidia, like NVLink, along with NFS and NVMe over Fabric. It also greatly increased storage capacity by going with 61TB drives from Solidigm. SSDs from the major server vendors typically maxed out at 15TB, according to the vendor. PEAK:AIO competes with VAST, WekaIO, NetApp, Pure Storage and many others in the growing AI workload storage arena. PEAK:AIO’s AI Data Server is available now.

Read More »

SoftBank to buy Ampere for $6.5B, fueling Arm-based server market competition

SoftBank’s announcement suggests Ampere will collaborate with other SBG companies, potentially creating a powerful ecosystem of Arm-based computing solutions. This collaboration could extend to SoftBank’s numerous portfolio companies, including Korean/Japanese web giant LY Corp, ByteDance (TikTok’s parent company), and various AI startups. If SoftBank successfully steers its portfolio companies toward Ampere processors, it could accelerate the shift away from x86 architecture in data centers worldwide. Questions remain about Arm’s server strategy The acquisition, however, raises questions about how SoftBank will balance its investments in both Arm and Ampere, given their potentially competing server CPU strategies. Arm’s recent move to design and sell its own server processors to Meta signaled a major strategic shift that already put it in direct competition with its own customers, including Qualcomm and Nvidia. “In technology licensing where an entity is both provider and competitor, boundaries are typically well-defined without special preferences beyond potential first-mover advantages,” Kawoosa explained. “Arm will likely continue making independent licensing decisions that serve its broader interests rather than favoring Ampere, as the company can’t risk alienating its established high-volume customers.” Industry analysts speculate that SoftBank might position Arm to focus on custom designs for hyperscale customers while allowing Ampere to dominate the market for more standardized server processors. Alternatively, the two companies could be merged or realigned to present a unified strategy against incumbents Intel and AMD. “While Arm currently dominates processor architecture, particularly for energy-efficient designs, the landscape isn’t static,” Kawoosa added. “The semiconductor industry is approaching a potential inflection point, and we may witness fundamental disruptions in the next 3-5 years — similar to how OpenAI transformed the AI landscape. SoftBank appears to be maximizing its Arm investments while preparing for this coming paradigm shift in processor architecture.”

Read More »

Nvidia, xAI and two energy giants join genAI infrastructure initiative

The new AIP members will “further strengthen the partnership’s technology leadership as the platform seeks to invest in new and expanded AI infrastructure. Nvidia will also continue in its role as a technical advisor to AIP, leveraging its expertise in accelerated computing and AI factories to inform the deployment of next-generation AI data center infrastructure,” the group’s statement said. “Additionally, GE Vernova and NextEra Energy have agreed to collaborate with AIP to accelerate the scaling of critical and diverse energy solutions for AI data centers. GE Vernova will also work with AIP and its partners on supply chain planning and in delivering innovative and high efficiency energy solutions.” The group claimed, without offering any specifics, that it “has attracted significant capital and partner interest since its inception in September 2024, highlighting the growing demand for AI-ready data centers and power solutions.” The statement said the group will try to raise “$30 billion in capital from investors, asset owners, and corporations, which in turn will mobilize up to $100 billion in total investment potential when including debt financing.” Forrester’s Nguyen also noted that the influence of two of the new members — xAI, owned by Elon Musk, along with Nvidia — could easily help with fundraising. Musk “with his connections, he does not make small quiet moves,” Nguyen said. “As for Nvidia, they are the face of AI. Everything they do attracts attention.” Info-Tech’s Bickley said that the astronomical dollars involved in genAI investments is mind-boggling. And yet even more investment is needed — a lot more.

Read More »

IBM broadens access to Nvidia technology for enterprise AI

The IBM Storage Scale platform will support CAS and now will respond to queries using the extracted and augmented data, speeding up the communications between GPUs and storage using Nvidia BlueField-3 DPUs and Spectrum-X networking, IBM stated. The multimodal document data extraction workflow will also support Nvidia NeMo Retriever microservices. CAS will be embedded in the next update of IBM Fusion, which is planned for the second quarter of this year. Fusion simplifies the deployment and management of AI applications and works with Storage Scale, which will handle high-performance storage support for AI workloads, according to IBM. IBM Cloud instances with Nvidia GPUs In addition to the software news, IBM said its cloud customers can now use Nvidia H200 instances in the IBM Cloud environment. With increased memory bandwidth (1.4x higher than its predecessor) and capacity, the H200 Tensor Core can handle larger datasets, accelerating the training of large AI models and executing complex simulations, with high energy efficiency and low total cost of ownership, according to IBM. In addition, customers can use the power of the H200 to process large volumes of data in real time, enabling more accurate predictive analytics and data-driven decision-making, IBM stated. IBM Consulting capabilities with Nvidia Lastly, IBM Consulting is adding Nvidia Blueprint to its recently introduced AI Integration Service, which offers customers support for developing, building and running AI environments. Nvidia Blueprints offer a suite pre-validated, optimized, and documented reference architectures designed to simplify and accelerate the deployment of complex AI and data center infrastructure, according to Nvidia.  The IBM AI Integration service already supports a number of third-party systems, including Oracle, Salesforce, SAP and ServiceNow environments.

Read More »

Nvidia’s silicon photonics switches bring better power efficiency to AI data centers

Nvidia typically uses partnerships where appropriate, and the new switch design was done in collaboration with multiple vendors across different aspects, including creating the lasers, packaging, and other elements as part of the silicon photonics. Hundreds of patents were also included. Nvidia will licensing the innovations created to its partners and customers with the goal of scaling this model. Nvidia’s partner ecosystem includes TSMC, which provides advanced chip fabrication and 3D chip stacking to integrate silicon photonics into Nvidia’s hardware. Coherent, Eoptolink, Fabrinet, and Innolight are involved in the development, manufacturing, and supply of the transceivers. Additional partners include Browave, Coherent, Corning Incorporated, Fabrinet, Foxconn, Lumentum, SENKO, SPIL, Sumitomo Electric Industries, and TFC Communication. AI has transformed the way data centers are being designed. During his keynote at GTC, CEO Jensen Huang talked about the data center being the “new unit of compute,” which refers to the entire data center having to act like one massive server. That has driven compute to be primarily CPU based to being GPU centric. Now the network needs to evolve to ensure data is being fed to the GPUs at a speed they can process the data. The new co-packaged switches remove external parts, which have historically added a small amount of overhead to networking. Pre-AI this was negligible, but with AI, any slowness in the network leads to dollars being wasted.

Read More »

Critical vulnerability in AMI MegaRAC BMC allows server takeover

“In disruptive or destructive attacks, attackers can leverage the often heterogeneous environments in data centers to potentially send malicious commands to every other BMC on the same management segment, forcing all devices to continually reboot in a way that victim operators cannot stop,” the Eclypsium researchers said. “In extreme scenarios, the net impact could be indefinite, unrecoverable downtime until and unless devices are re-provisioned.” BMC vulnerabilities and misconfigurations, including hardcoded credentials, have been of interest for attackers for over a decade. In 2022, security researchers found a malicious implant dubbed iLOBleed that was likely developed by an APT group and was being deployed through vulnerabilities in HPE iLO (HPE’s Integrated Lights-Out) BMC. In 2018, a ransomware group called JungleSec used default credentials for IPMI interfaces to compromise Linux servers. And back in 2016, Intel’s Active Management Technology (AMT) Serial-over-LAN (SOL) feature which is part of Intel’s Management Engine (Intel ME), was exploited by an APT group as a covert communication channel to transfer files. OEM, server manufacturers in control of patching AMI released an advisory and patches to its OEM partners, but affected users must wait for their server manufacturers to integrate them and release firmware updates. In addition to this vulnerability, AMI also patched a flaw tracked as CVE-2024-54084 that may lead to arbitrary code execution in its AptioV UEFI implementation. HPE and Lenovo have already released updates for their products that integrate AMI’s patch for CVE-2024-54085.

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »