Stay Ahead, Stay ONMINE

This Is How LLMs Break Down the Language

Do you remember the hype when OpenAI released GPT-3 in 2020? Though not the first in its series, GPT-3 gained widespread popularity due to its impressive text generation capabilities. Since then, a diverse group of Large Language Models(Llms) have flooded the AI landscape. The golden question is: Have you ever wondered how ChatGPT or any other LLMs break down the language? If you haven’t yet, we are going to discuss the mechanism by which LLMs process the textual input given to them during training and inference. In principle, we call it tokenization. This article is inspired by the YouTube video titled Deep Dive into LLMs like ChatGPT from former Senior Director of AI at Tesla, Andrej Karpathy. His general audience video series is highly recommended for those who want to take a deep dive into the intricacies behind LLMs. Before diving into the main topic, I need you to have an understanding of the inner workings of a LLM. In the next section, I’ll break down the internals of a language model and its underlying architecture. If you’re already familiar with neural networks and LLMs in general, you can skip the next section without affecting your reading experience. Internals of large language models LLMs are made up of transformer neural networks. Consider neural networks as giant mathematical expressions. Inputs to neural networks are a sequence of tokens that are typically processed through embedding layers, which convert the tokens into numerical representations. For now, think of tokens as basic units of input data, such as words, phrases, or characters. In the next section, we’ll explore how to create tokens from input text data in depth. When we feed these inputs to the network, they are mixed into a giant mathematical expression along with the parameters or weights of these neural networks. Modern neural networks have billions of parameters. At the beginning, these parameters or weights are set randomly. Therefore, the neural network randomly guesses its predictions. During the training process, we iteratively update these weights so that the outputs of our neural network become consistent with the patterns observed in our training set. In a sense, neural network training is about finding the right set of weights that seem to be consistent with the statistics of the training set. The transformer architecture was introduced in the paper titled “Attention is All You Need” by Vaswani et al. in 2017. This is a neural network with a special kind of structure designed for sequence processing. Initially intended for Neural Machine Translation, it has since become the founding building block for LLMs. To get a sense of what production grade transformer neural networks look like visit https://bbycroft.net/llm. This site provides interactive 3D visualizations of generative pre-trained transformer (GPT) architectures and guides you through their inference process. Visualization of Nano-GPT at https://bbycroft.net/llm (Image by the author) This particular architecture, called Nano-GPT, has around 85,584 parameters. We feed the inputs, which are token sequences, at the top of the network. Information then flows through the layers of the network, where the input undergoes a series of transformations, including attention mechanisms and feed-forward networks, to produce an output. The output is the model’s prediction for the next token in the sequence. Tokenization Training a state-of-the-art language model like ChatGPT or Claude involves several stages arranged sequentially. In my previous article about hallucinations, I briefly explained the training pipeline for an LLM. If you want to learn more about training stages and hallucinations, you can read it here. Now, imagine we’re at the initial stage of training called pretraining. This stage requires a large, high-quality, web-scale dataset of terabyte size. The datasets used by major LLM providers are not publicly available. Therefore, we will look into an open-source dataset curated by Hugging Face, called FineWeb distributed under the Open Data Commons Attribution License. You can read more about how they collected and created this dataset here. FineWeb dataset curated by Hugging Face (Image by the author) I downloaded a sample from the FineWeb dataset, selected the first 100 examples, and concatenated them into a single text file. This is just raw internet text with various patterns within it. Sampled text from the FineWeb dataset (Image by the author) So our goal is to feed this data to the transformer neural network so that the model learns the flow of this text. We need to train our neural network to mimic the text. Before plugging this text into the neural network, we must decide how to represent it. Neural networks expect a one-dimensional sequence of symbols. That requires a finite set of possible symbols. Therefore, we must determine what these symbols are and how to represent our data as a one-dimensional sequence of them. What we have at this point is a one-dimensional sequence of text. There is an underlined representation of a sequence of raw bits for this text. We can encode the original sequence of text with UTF-8 encoding to get the sequence of raw bits. If you check the image below, you can see that the first 8 bits of the raw bit sequence correspond to the first letter ‘A’ of the original one-dimensional text sequence. Sampled text, represented as a one-dimensional sequence of bits (Image by the author) Now, we have a very long sequence with two symbols: zero and one. This is, in fact, what we were looking for — a one-dimensional sequence of symbols with a finite set of possible symbols. Now the problem is that sequence length is a precious resource in a neural network primarily because of computational efficiency, memory constraints, and the difficulty of processing long dependencies. Therefore, we don’t want extremely long sequences of just two symbols. We prefer shorter sequences of more symbols. So, we are going to trade off the number of symbols in our vocabulary against the resulting sequence length. As we need to further compress or shorten our sequence, we can group every 8 consecutive bits into a single byte. Since each bit is either 0 or 1, there are exactly 256 possible combinations of 8-bit sequences. Thus, we can represent this sequence as a sequence of bytes instead. Grouping bits to bytes (Image by the author) This representation reduces the length by a factor of 8, while expanding the symbol set to 256 possibilities. Consequently, each value in the sequence will fall within the range of 0 to 255. Sampled text, represented as a one-dimensional sequence of bytes (Image by the author) These numbers do not have any value in a numerical sense. They are just placeholders for unique identifiers or symbols. In fact, we could replace each of these numbers with a unique emoji and the core idea would still stand. Think of this as a sequence of emojis, each chosen from 256 unique options. Sampled text, represented as a one-dimensional sequence of emojis (Image by the author) This process of converting from raw text into symbols is called Tokenization. Tokenization in state-of-the-art language models goes even beyond this. We can further compress the length of the sequence in return for more symbols in our vocabulary using the Byte-Pair Encoding (BPE) algorithm. Initially developed for text compression, BPE is now widely used by transformer models for tokenization. OpenAI’s GPT series uses standard and customized versions of the BPE algorithm. Essentially, byte pair encoding involves identifying frequent consecutive bytes or symbols. For example, we can look into our byte level sequence of text. Sequence 101, followed by 114, is quite frequent (Image by the author) As you can see, the sequence 101 followed by 114 appears frequently. Therefore, we can replace this pair with a new symbol and assign it a unique identifier. We are going to rewrite every occurrence of 101 114 using this new symbol. This process can be repeated multiple times, with each iteration further shortening the sequence length while introducing additional symbols, thereby increasing the vocabulary size. Using this process, GPT-4 has come up with a token vocabulary of around 100,000. We can further explore tokenization using Tiktokenizer. Tiktokenizer provides an interactive web-based graphical user interface where you can input text and see how it’s tokenized according to different models. Play with this tool to get an intuitive understanding of what these tokens look like. For example, we can take the first four sentences of the text sequence and input them into the Tiktokenizer. From the dropdown menu, select the GPT-4 base model encoder: cl100k_base. Tiktokenizer (Image by the author) The colored text shows how the chunks of text correspond to the symbols. The following text, which is a sequence of length 51, is what GPT-4 will see at the end of the day. 11787, 499, 21815, 369, 90250, 763, 14689, 30, 7694, 1555, 279, 21542, 3770, 323, 499, 1253, 1120, 1518, 701, 4832, 2457, 13, 9359, 1124, 323, 6642, 264, 3449, 709, 3010, 18396, 13, 1226, 617, 9214, 315, 1023, 3697, 430, 1120, 649, 10379, 83, 3868, 311, 3449, 18570, 1120, 1093, 499, 0 We can now take our entire sample dataset and re-represent it as a sequence of tokens using the GPT-4 base model tokenizer, cl100k_base. Note that the original FineWeb dataset consists of a 15-trillion-token sequence, while our sample dataset contains only a few thousand tokens from the original dataset. Sampled text, represented as a one-dimensional sequence of tokens (Image by the author) Conclusion Tokenization is a fundamental step in how LLMs process text, transforming raw text data into a structured format before being fed into neural networks. As neural networks require a one-dimensional sequence of symbols, we need to achieve a balance between sequence length and the number of symbols in the vocabulary, optimizing for efficient computation. Modern state-of-the-art transformer-based LLMs, including GPT and GPT-2, use Byte-Pair Encoding tokenization. Breaking down tokenization helps demystify how LLMs interpret text inputs and generate coherent responses. Having an intuitive sense of what tokenization looks like helps in understanding the internal mechanisms behind the training and inference of LLMs. As LLMs are increasingly used as a knowledge base, a well-designed tokenization strategy is crucial for improving model efficiency and overall performance. If you enjoyed this article, connect with me on X (formerly Twitter) for more insights. References

Do you remember the hype when OpenAI released GPT-3 in 2020? Though not the first in its series, GPT-3 gained widespread popularity due to its impressive text generation capabilities. Since then, a diverse group of Large Language Models(Llms) have flooded the AI landscape. The golden question is: Have you ever wondered how ChatGPT or any other LLMs break down the language? If you haven’t yet, we are going to discuss the mechanism by which LLMs process the textual input given to them during training and inference. In principle, we call it tokenization.

This article is inspired by the YouTube video titled Deep Dive into LLMs like ChatGPT from former Senior Director of AI at Tesla, Andrej Karpathy. His general audience video series is highly recommended for those who want to take a deep dive into the intricacies behind LLMs.

Before diving into the main topic, I need you to have an understanding of the inner workings of a LLM. In the next section, I’ll break down the internals of a language model and its underlying architecture. If you’re already familiar with neural networks and LLMs in general, you can skip the next section without affecting your reading experience.

Internals of large language models

LLMs are made up of transformer neural networks. Consider neural networks as giant mathematical expressions. Inputs to neural networks are a sequence of tokens that are typically processed through embedding layers, which convert the tokens into numerical representations. For now, think of tokens as basic units of input data, such as words, phrases, or characters. In the next section, we’ll explore how to create tokens from input text data in depth. When we feed these inputs to the network, they are mixed into a giant mathematical expression along with the parameters or weights of these neural networks.

Modern neural networks have billions of parameters. At the beginning, these parameters or weights are set randomly. Therefore, the neural network randomly guesses its predictions. During the training process, we iteratively update these weights so that the outputs of our neural network become consistent with the patterns observed in our training set. In a sense, neural network training is about finding the right set of weights that seem to be consistent with the statistics of the training set.

The transformer architecture was introduced in the paper titled “Attention is All You Need” by Vaswani et al. in 2017. This is a neural network with a special kind of structure designed for sequence processing. Initially intended for Neural Machine Translation, it has since become the founding building block for LLMs.

To get a sense of what production grade transformer neural networks look like visit https://bbycroft.net/llm. This site provides interactive 3D visualizations of generative pre-trained transformer (GPT) architectures and guides you through their inference process.

Visualization of Nano-GPT at https://bbycroft.net/llm (Image by the author)

This particular architecture, called Nano-GPT, has around 85,584 parameters. We feed the inputs, which are token sequences, at the top of the network. Information then flows through the layers of the network, where the input undergoes a series of transformations, including attention mechanisms and feed-forward networks, to produce an output. The output is the model’s prediction for the next token in the sequence.

Tokenization

Training a state-of-the-art language model like ChatGPT or Claude involves several stages arranged sequentially. In my previous article about hallucinations, I briefly explained the training pipeline for an LLM. If you want to learn more about training stages and hallucinations, you can read it here.

Now, imagine we’re at the initial stage of training called pretraining. This stage requires a large, high-quality, web-scale dataset of terabyte size. The datasets used by major LLM providers are not publicly available. Therefore, we will look into an open-source dataset curated by Hugging Face, called FineWeb distributed under the Open Data Commons Attribution License. You can read more about how they collected and created this dataset here.

FineWeb dataset curated by Hugging Face (Image by the author)

I downloaded a sample from the FineWeb dataset, selected the first 100 examples, and concatenated them into a single text file. This is just raw internet text with various patterns within it.

Sampled text from the FineWeb dataset (Image by the author)

So our goal is to feed this data to the transformer neural network so that the model learns the flow of this text. We need to train our neural network to mimic the text. Before plugging this text into the neural network, we must decide how to represent it. Neural networks expect a one-dimensional sequence of symbols. That requires a finite set of possible symbols. Therefore, we must determine what these symbols are and how to represent our data as a one-dimensional sequence of them.

What we have at this point is a one-dimensional sequence of text. There is an underlined representation of a sequence of raw bits for this text. We can encode the original sequence of text with UTF-8 encoding to get the sequence of raw bits. If you check the image below, you can see that the first 8 bits of the raw bit sequence correspond to the first letter ‘A’ of the original one-dimensional text sequence.

Sampled text, represented as a one-dimensional sequence of bits (Image by the author)

Now, we have a very long sequence with two symbols: zero and one. This is, in fact, what we were looking for — a one-dimensional sequence of symbols with a finite set of possible symbols. Now the problem is that sequence length is a precious resource in a neural network primarily because of computational efficiency, memory constraints, and the difficulty of processing long dependencies. Therefore, we don’t want extremely long sequences of just two symbols. We prefer shorter sequences of more symbols. So, we are going to trade off the number of symbols in our vocabulary against the resulting sequence length.

As we need to further compress or shorten our sequence, we can group every 8 consecutive bits into a single byte. Since each bit is either 0 or 1, there are exactly 256 possible combinations of 8-bit sequences. Thus, we can represent this sequence as a sequence of bytes instead.

Grouping bits to bytes (Image by the author)

This representation reduces the length by a factor of 8, while expanding the symbol set to 256 possibilities. Consequently, each value in the sequence will fall within the range of 0 to 255.

Sampled text, represented as a one-dimensional sequence of bytes (Image by the author)

These numbers do not have any value in a numerical sense. They are just placeholders for unique identifiers or symbols. In fact, we could replace each of these numbers with a unique emoji and the core idea would still stand. Think of this as a sequence of emojis, each chosen from 256 unique options.

Sampled text, represented as a one-dimensional sequence of emojis (Image by the author)

This process of converting from raw text into symbols is called Tokenization. Tokenization in state-of-the-art language models goes even beyond this. We can further compress the length of the sequence in return for more symbols in our vocabulary using the Byte-Pair Encoding (BPE) algorithm. Initially developed for text compression, BPE is now widely used by transformer models for tokenization. OpenAI’s GPT series uses standard and customized versions of the BPE algorithm.

Essentially, byte pair encoding involves identifying frequent consecutive bytes or symbols. For example, we can look into our byte level sequence of text.

Sequence 101, followed by 114, is quite frequent (Image by the author)

As you can see, the sequence 101 followed by 114 appears frequently. Therefore, we can replace this pair with a new symbol and assign it a unique identifier. We are going to rewrite every occurrence of 101 114 using this new symbol. This process can be repeated multiple times, with each iteration further shortening the sequence length while introducing additional symbols, thereby increasing the vocabulary size. Using this process, GPT-4 has come up with a token vocabulary of around 100,000.

We can further explore tokenization using Tiktokenizer. Tiktokenizer provides an interactive web-based graphical user interface where you can input text and see how it’s tokenized according to different models. Play with this tool to get an intuitive understanding of what these tokens look like.

For example, we can take the first four sentences of the text sequence and input them into the Tiktokenizer. From the dropdown menu, select the GPT-4 base model encoder: cl100k_base.

Tiktokenizer (Image by the author)

The colored text shows how the chunks of text correspond to the symbols. The following text, which is a sequence of length 51, is what GPT-4 will see at the end of the day.

11787, 499, 21815, 369, 90250, 763, 14689, 30, 7694, 1555, 279, 21542, 3770, 323, 499, 1253, 1120, 1518, 701, 4832, 2457, 13, 9359, 1124, 323, 6642, 264, 3449, 709, 3010, 18396, 13, 1226, 617, 9214, 315, 1023, 3697, 430, 1120, 649, 10379, 83, 3868, 311, 3449, 18570, 1120, 1093, 499, 0

We can now take our entire sample dataset and re-represent it as a sequence of tokens using the GPT-4 base model tokenizer, cl100k_base. Note that the original FineWeb dataset consists of a 15-trillion-token sequence, while our sample dataset contains only a few thousand tokens from the original dataset.

Sampled text, represented as a one-dimensional sequence of tokens (Image by the author)

Conclusion

Tokenization is a fundamental step in how LLMs process text, transforming raw text data into a structured format before being fed into neural networks. As neural networks require a one-dimensional sequence of symbols, we need to achieve a balance between sequence length and the number of symbols in the vocabulary, optimizing for efficient computation. Modern state-of-the-art transformer-based LLMs, including GPT and GPT-2, use Byte-Pair Encoding tokenization.

Breaking down tokenization helps demystify how LLMs interpret text inputs and generate coherent responses. Having an intuitive sense of what tokenization looks like helps in understanding the internal mechanisms behind the training and inference of LLMs. As LLMs are increasingly used as a knowledge base, a well-designed tokenization strategy is crucial for improving model efficiency and overall performance.

If you enjoyed this article, connect with me on X (formerly Twitter) for more insights.

References

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Altera targets low-latency AI edge applications with new FPGA products

Support for Agilex 3 and other Agilex product lines is available through Altera’s free Quartus software suite. Quartus is a design software suite for programmable logic devices. It allows engineers to design, analyze, optimize, and program Intel FPGAs, CPLDs, and SoCs using system-level design techniques and advanced place-and-route algorithms. For

Read More »

Observe links end-user experience with back-end troubleshooting

Frontend Observability uses a capability called Browser Real User Monitoring (RUM) to enable IT and developer teams to quickly identify and diagnose performance issues across browsers, devices, and locations. For instance, RUM identifies anomalies in page load times, core web vitals, and JavaScript or HTTP errors. RUM also provides developers

Read More »

ServiceNow to pay $2.85B for Moveworks’ AI tools

ServiceNow and Moveworks will deliver a unified, end‑to‑end search and self‑service experience for all employee requestors across every workflow, according to ServiceNow. A majority of Moveworks’ current customer deployments already use ServiceNow in their environments to access enterprise AI, data, and workflows. ServiceNow said this acquisition will build upon the

Read More »

BP CEO, Defending His Reset, Says Company to Be More Focused

BP Plc Chief Executive Officer Murray Auchincloss appeared at a key US energy major conference in Houston on Tuesday, talking up his reset for the troubled British oil major. Under pressure from activist Elliott Investment Management,  a new strategy outlined in late February focused firmly on raising cash flow by focusing on oil and gas production and scaling back renewable investments — his plans have had a mixed reception. Speaking at CERAWeek by S&P Global, he argued for his vision for a leaner, more focused BP.  The company needs to “focus on fewer things, with higher returns,” he said. Auchincloss said he’s met 40% of shareholders over the past few weeks and they all seem “pretty satisfied.” He outlined plans to grow production in two key regions: the US and the Middle East. He’s bullish on prospects in the US Gulf, where BP has 10 billion barrels of resources to develop. He also talked up plans to work on Iraq’s Kirkuk oil field and expanding in Abu Dhabi. He said the company would remain in renewable energy markets because its giant trading business needed a steady supply of electrons, but by using partnerships, it could redirect capital toward oil and gas projects that offered higher rates of return. BP has already placed its offshore wind assets in a joint venture with Japanese utility JERA. Auchincloss said in Houston that it planned to bring in a partner for its Lightsource solar business. That would actually enable the business to expand its potential rate of development to 6 gigawatts to 8 gigawatts a year from around 5 gigwatts today. Since unveiling the strategy on Feb. 26, BP shares have fallen almost 6%, losing almost twice as much as larger UK rival Shell Plc. WHAT DO YOU THINK? Generated by readers, the comments included

Read More »

Amazon to expand its use of in-house AI tools to conserve water, reduce energy use

Dive Brief: AI-supported tools Amazon uses to monitor utility and building system performance in 120 of its sites globally will be expanded to more than 300 facilities by the end of the year, the company said today. We’re “innovating with AI to help us find new ways to decarbonize even faster, including inventing new solutions that … make our buildings more energy- and water-efficient,” Chief Sustainability Officer Kara Hurst said in a statement. Amazon is using the technology to analyze HVAC operational data and energy consumption, monitor refrigeration units, identify issues like leaks or clogged filters, with the goal of reaching net-zero carbon emissions across its operations by 2040, the company said. Dive Insight: The company developed the tools within its Amazon Web Services subsidiary. A tool the company calls FlowMS monitors building water systems and one called Base Building monitors HVAC systems. A third tool is being rolled out to monitor refrigeration systems in more than 150 fulfillment centers worldwide, the company said.  The tools have already helped Amazon reduce water and energy consumption at sites in Scotland, Spain and New York, the company said. In the Scotland facility, the FlowMS tool helped associates uncover a leak in an underground water line that could have caused up to 9 million gallons in annual water losses, the company said. The tool analyzed the building’s water meter data, saw that it was using more water than expected and alerted engineers, who traced the leak and repaired a faulty valve, the company said. The Base Building tool leverages Amazon’s SageMaker and Lambda machine learning capabilities to analyze HVAC operational data, energy consumption and local weather data to identify possible system anomalies, the company said.  The tool helped Amazon pinpoint a miscalibrated utility meter at a New York facility that appeared to be using

Read More »

Growing demand for electricity requires new policy solutions

Tom Falcone is the president of the Large Public Power Council, an advocacy organization that represents 29 of the largest public power systems in America, and the former CEO of the Long Island Power Authority. America wants more. More manufacturing, more innovation, more opportunities for economic growth. Technologies like AI and advanced semiconductors are meeting the moment, providing enterprises and individuals with computing power, jobs and investment opportunities. And the utilities that supply the electricity necessary to manufacture and power these technologies are working to meet rapidly rising demand, but it’s getting harder to keep up. For more than 20 years, utilities have had flat or declining electricity loads, but now, load growth is expanding faster than anyone was expecting. The Energy Information Administration anticipates that U.S. electricity demand will grow by 4.6% this year, the highest in several decades, due to the electrification of the economy and the expansion of data centers and advanced manufacturing centers. The Electric Power Research Institute projects that data centers alone could consume up to 9% of U.S. electricity generation by 2030. At the same time, America also wants more reliability, more efficiency, more convenience — all without escalating the cost of electricity. This can be particularly challenging in areas affected by extreme weather conditions like hurricanes, tornadoes and wildfires. Public power utilities are experiencing this shift firsthand. Large Public Power Council member utilities in states like Georgia, Nebraska, Texas and Arizona are scaling up their generation capacity to meet surging demand. In Omaha, for example, the local public power utility, Omaha Public Power District, is doubling its capacity by 2030. Metropolitan Omaha is now the second-largest in the U.S. for megawatt capacity dedicated to data centers, per S&P Global’s 2024 U.S. Datacenters and Energy Report. Similarly, Salt River Project must double its power

Read More »

Gas appliances suffer setback in Washington state

Washington state regulations encouraging the use of electric appliances over gas appliances have withstood a legal challenge from the building industry and other trade groups. U.S. District Judge Kymberly Evanson in Seattle ruled last week that the 11th Amendment, which prevents people from outside a state from suing the state in federal court, protected the defendant state officials from facing a federal lawsuit.  In dismissing the case, Judge Evanson noted that the state legislature that enacted the energy regulations lacks supervisory power over the city and county officials responsible for enforcing the Washington Energy Code.  The building industry viewed the state’s energy regulations, which established minimum performance standards and requirements for construction and construction materials, as impossibly stringent. They claimed the Energy Code improperly restricted the use of natural gas appliances in new residential and commercial construction. The groups were concerned that the regulations either outright banned the use of certain gas appliances or imposed energy efficiency standards that gas appliances cannot satisfy. They also claimed that the federal Energy Policy and Conservation Act preempted the Washington energy regulations and should have rendered them unenforceable. While the court reached its ruling solely on 11th amendment state sovereign immunity grounds, the decision could impact existing properties and represents a victory for Washington state in leaving the strict energy and efficiency regulations intact. In 2023, the Ninth Circuit struck down a city’s ban on natural gas piping in newly constructed buildings in California Restaurant Association v. City of Berkeley. The appellate court found that federal law preempted the Berkeley ordinance. However, it amended its opinion to clarify that state and local governments still have many options available for promoting clean energy.  The Spokane Home Builders Association didn’t immediately respond to a request for comment. 

Read More »

ARC Secures LNG Sale to ExxonMobil Using Cedar LNG Offtake

ARC Resources Ltd. said Tuesday it has entered into a deal to sell all of its offtake from the under-construction Cedar LNG in Canada to Exxon Mobil Corp. The agreement entitles ExxonMobil LNG Asia-Pacific to about 1.5 million metric tons per annum (MMtpa) of liquefied natural gas (LNG). “The Agreement commences with commercial operations at the Cedar LNG Facility, expected late 2028, and continues for the term of ARC’s liquefaction tolling services agreement with Cedar LNG Partners LP”, Calgary, Canada-based ARC said in an online statement. Under the tolling services agreement, ARC is to deliver around 200 million cubic feet a day of natural gas to Cedar LNG for liquefaction. Cedar LNG, 50.1 percent owned by the Haisla Nation with Pembina Pipeline Corp. holding the remaining 49.9 percent, is planned to have a nameplate capacity of 3.3 MMtpa. It is located on tribal land on Canada’s West Coast. The project has secured all key governmental approvals and is in the early construction stage, Cedar LNG says on its website. The partners reached a final investment decision June 2024. ARC president and chief executive Terry Anderson said of the agreement with ExxonMobil, “Today, we have reached a significant milestone in our strategy to diversify and expand margins through participation in the global LNG market”. “Through this Agreement, we have achieved our target of linking approximately 25 percent of our future natural gas production to international pricing”, Anderson said. This is ARC’s third long-term LNG-related agreement in three years that provides exposure to international LNG pricing, the company said.  In 2022 it announced an agreement to supply 140,000 million British thermal units per day (MMBtud) of natural gas to Cheniere Energy Inc.’s Corpus Christi Stage III expansion with pricing linked to the Platts Japan Korea Marker. In 2023 ARC announced a second

Read More »

Decarbonising Scotland means an industrial transformation over energy transition

Scotland and the UK’s decarbonisation strategy needs to extend beyond the energy industry to become a full-scale industrial transformation. Taking carbon out of the country’s economy will be the big issue discussed at DeCarbScotland, organised by industrial decarbonisation representative body NECCUS, in Edinburgh on 13 March. Speaking to Energy Voice, NECCUS CEO Dr Philippa Parmiter said: “People have focused on the energy transition and using carbon capture and storage (CCS) on fossil fuel power plants. But we’ve moved away from that. It’s applicable across the economy, and we need to talk about an industrial transformation rather than the energy transition.” In addition to hearing from Scottish minister for climate action Alasdair Allan, the event will see representatives from several companies and bodies discussing their ongoing projects. This includes Ineos, which will talk about Project Willow, the plan to renovate Scotland’s Grangemouth refinery to contribute to the country’s net-zero ambitions. Veri Energy will also explore plans for the Sullom Voe terminal on Shetland and Storegga, which is developing the Track 2 Acorn project, and National Gas and SSE Thermal will also be presenting. Opportunity for Scotland “The key thing we’re interested in is the opportunity for Scotland,” Parmiter said. “For carbon capture, we’ve got huge storage capacity – potentially 78 gigatonnes in the UK and Scotland has 2/3 of that.” With emissions of around 40m tonnes per year, this equates to hundreds of years of storage capacity. This presents a market to offer storage services for European consumers, creating revenue and driving global decarbonisation targets. It also represents a chance to develop a UK and Scotland-based supply chain to develop the projects. “There’s a strong feeling that local content wasn’t there in the development of the wind sector,” Parmiter noted. “We’ve got an opportunity to make sure that local content is

Read More »

Podcast: On the Frontier of Modular Edge AI Data Centers with Flexnode’s Andrew Lindsey

The modular data center industry is undergoing a seismic shift in the age of AI, and few are as deeply embedded in this transformation as Andrew Lindsey, Co-Founder and CEO of Flexnode. In a recent episode of the Data Center Frontier Show podcast, Lindsey joined Editor-in-Chief Matt Vincent and Senior Editor David Chernicoff to discuss the evolution of modular data centers, the growing demand for high-density liquid-cooled solutions, and the industry factors driving this momentum. A Background Rooted in Innovation Lindsey’s career has been defined by the intersection of technology and the built environment. Prior to launching Flexnode, he worked at Alpha Corporation, a top 100 engineering and construction management firm founded by his father in 1979. His early career involved spearheading technology adoption within the firm, with a focus on high-security infrastructure for both government and private clients. Recognizing a massive opportunity in the data center space, Lindsey saw a need for an innovative approach to infrastructure deployment. “The construction industry is relatively uninnovative,” he explained, citing a McKinsey study that ranked construction as the second least-digitized industry—just above fishing and wildlife, which remains deliberately undigitized. Given the billions of square feet of data center infrastructure required in a relatively short timeframe, Lindsey set out to streamline and modernize the process. Founded four years ago, Flexnode delivers modular data centers with a fully integrated approach, handling everything from site selection to design, engineering, manufacturing, deployment, operations, and even end-of-life decommissioning. Their core mission is to provide an “easy button” for high-density computing solutions, including cloud and dedicated GPU infrastructure, allowing faster and more efficient deployment of modular data centers. The Rising Momentum for Modular Data Centers As Vincent noted, Data Center Frontier has closely tracked the increasing traction of modular infrastructure. Lindsey has been at the forefront of this

Read More »

Last Energy to Deploy 30 Microreactors in Texas for Data Centers

As the demand for data center power surges in Texas, nuclear startup Last Energy has now announced plans to build 30 microreactors in the state’s Haskell County near the Dallas-Fort Worth Metroplex. The reactors will serve a growing customer base of data center operators in the region looking for reliable, carbon-free energy. The plan marks Last Energy’s largest project to date and a significant step in advancing modular nuclear power as a viable solution for high-density computing infrastructure. Meeting the Looming Power Demands of Texas Data Centers Texas is already home to over 340 data centers, with significant expansion underway. Google is increasing its data center footprint in Dallas, while OpenAI’s Stargate has announced plans for a new facility in Abilene, just an hour south of Last Energy’s planned site. The company notes the Dallas-Fort Worth metro area alone is projected to require an additional 43 gigawatts of power in the coming years, far surpassing current grid capacity. To help remediate, Last Energy has secured a 200+ acre site in Haskell County, approximately three and a half hours west of Dallas. The company has also filed for a grid connection with ERCOT, with plans to deliver power via a mix of private wire and grid transmission. Additionally, Last Energy has begun pre-application engagement with the U.S. Nuclear Regulatory Commission (NRC) for an Early Site Permit, a key step in securing regulatory approval. According to Last Energy CEO Bret Kugelmass, the company’s modular approach is designed to bring nuclear energy online faster than traditional projects. “Nuclear power is the most effective way to meet Texas’ growing energy demand, but it needs to be deployed faster and at scale,” Kugelmass said. “Our microreactors are designed to be plug-and-play, enabling data center operators to bypass the constraints of an overloaded grid.” Scaling Nuclear for

Read More »

Data Center Jobs: Engineering and Technician Jobs Available in Major Markets

Each month Data Center Frontier, in partnership with Pkaza, posts some of the hottest data center career opportunities in the market. Here’s a look at some of the latest data center jobs posted on the Data Center Frontier jobs board, powered by Pkaza Critical Facilities Recruiting.  Data Center Facility Engineer (Night Shift Available) Ashburn, VAThis position is also available in: Tacoma, WA (Nights), Days/Nights: Needham, MA and New York City, NY. This opportunity is working directly with a leading mission-critical data center developer / wholesaler / colo provider. This firm provides data center solutions custom-fit to the requirements of their client’s mission-critical operational facilities. They provide reliability of mission-critical facilities for many of the world’s largest organizations facilities supporting enterprise clients and hyperscale companies. This opportunity provides a career-growth minded role with exciting projects with leading-edge technology and innovation as well as competitive salaries and benefits. Electrical Commissioning Engineer New Albany, OHThis traveling position is also available in: Somerset, NJ; Boydton, VA; Richmond, VA; Ashburn, VA; Charlotte, NC; Atlanta, GA; Hampton, GA; Fayetteville, GA; Des Moines, IA; San Jose, CA; Portland, OR; St Louis, MO; Phoenix, AZ;  Dallas, TX;  Chicago, IL; or Toronto, ON. *** ALSO looking for a LEAD EE and ME CxA agents.*** Our client is an engineering design and commissioning company that has a national footprint and specializes in MEP critical facilities design. They provide design, commissioning, consulting and management expertise in the critical facilities space. They have a mindset to provide reliability, energy efficiency, sustainable design and LEED expertise when providing these consulting services for enterprise, colocation and hyperscale companies. This career-growth minded opportunity offers exciting projects with leading-edge technology and innovation as well as competitive salaries and benefits. Switchgear Field Service Technician – Critical Facilities Nationwide TravelThis position is also available in: Charlotte, NC; Atlanta, GA; Dallas,

Read More »

Amid Shifting Regional Data Center Policies, Iron Mountain and DC Blox Both Expand in Virginia’s Henrico County

The dynamic landscape of data center developments in Maryland and Virginia exemplify the intricate balance between fostering technological growth and addressing community and environmental concerns. Data center developers in this region find themselves both in the crosshairs of groups worried about the environment and other groups looking to drive economic growth. In some cases, the groups are different components of the same organizations, such as local governments. For data center development, meeting the needs of these competing interests often means walking a none-too-stable tightrope. Rapid Government Action Encourages Growth In May 2024, Maryland demonstrated its commitment to attracting data center investments by enacting the Critical Infrastructure Streamlining Act. This legislation provides a clear framework for the use of emergency backup power generation, addressing previous regulatory challenges that a few months earlier had hindered projects like Aligned Data Centers’ proposed 264-megawatt campus in Frederick County, causing Aligned to pull out of the project. However, just days after the Act was signed by the governor, Aligned reiterated its plans to move forward with development in Maryland.  With the Quantum Loop and the related data center development making Frederick County a focal point for a balanced approach, the industry is paying careful attention to the pace of development and the relations between developers, communities and the government. In September of 2024, Frederick County Executive Jessica Fitzwater revealed draft legislation that would potentially restrict where in the county data centers could be built. The legislation was based on information found in the Frederick County Data Centers Workgroup’s final report. Those bills would update existing regulations and create a floating zone for Critical Digital Infrastructure and place specific requirements on siting data centers. Statewide, a cautious approach to environmental and community impacts statewide has been deemed important. In January 2025, legislators introduced SB116,  a bill

Read More »

New Reports Show How AI, Power, and Investment Trends Are Reshaping the Data Center Landscape

Today we provide a comprehensive roundup of the latest industry analyst reports from CBRE, PwC, and Synergy Research, offering a data-driven perspective on the state of the North American data center market.  To wit, CBRE’s latest findings highlight record-breaking growth in supply, soaring colocation pricing, and mounting power constraints shaping site selection. For its part, PwC’s analysis underscores the sector’s broader economic impact, quantifying its trillion-dollar contribution to GDP, rapid job growth, and surging tax revenues.  Meanwhile, the latest industry analysis from Synergy Research details the acceleration of cloud spending, AI’s role in fueling infrastructure demand, and an unprecedented surge in data center mergers and acquisitions.  Together, these reports paint a picture of an industry at an inflection point—balancing explosive expansion with evolving challenges in power availability, cost pressures, and infrastructure investment. Let’s examine them. CBRE: Surging Demand Fuels Record Data Center Expansion CBRE says the North American data center sector is scaling at an unprecedented pace, driven by unrelenting demand from artificial intelligence (AI), hyperscale, and cloud service providers. The latest North America Data Center Trends H2 2024 report from CBRE reveals that total supply across primary markets surged by 34% year-over-year to 6,922.6 megawatts (MW), outpacing the 26% growth recorded in 2023. This accelerating expansion has triggered record-breaking construction activity and intensified competition for available capacity. Market Momentum: Scaling Amid Power Constraints According to CBRE, data center construction activity reached historic levels, with 6,350 MW under development at the close of 2024—more than doubling the 3,077.8 MW recorded a year prior. Yet, the report finds the surge in development is being met with significant hurdles, including power constraints and supply chain challenges affecting critical electrical infrastructure. As a result, the vacancy rate across primary markets has plummeted to an all-time low of 1.9%, with only a handful of sites

Read More »

Minnesota PUC Says No to Amazon’s Bid to Fast-Track 250 Diesel Generators for Data Center

Amazon is facing scrutiny and significant pushbacks over its plan to install 250 diesel backup generators for a proposed data center in Becker, Minnesota. Much of the concern had been due to the fact that the hyperscaler was seeking an exemption from the state’s standard permitting process, a move that has sparked opposition from environmental groups and state officials. Aggregate Power that Matches Nuclear Power Generation Amazon’s proposed fleet of diesel generators would have a maximum power output almost equivalent to the 647 MW that is produced by Xcel Energy’s nuclear plant in Monticello, one of the two existing nuclear generation stations in the state. Meanwhile, as reported by Datacenter Dynamics, according to a real estate filing published with the Minnesota Department of Revenue, the land parcel assigned for the Amazon data center in Becker was previously part of Minneapolis-based utility Xcel’s coal-powered Sherco Site. Amazon argues that the diesel generators in question are essential to ensuring reliable and secure access to critical data and applications for its customers, including hospitals and first responders. However, opponents worry about the environmental impact and the precedent it may set for future large-scale data center developments in the state. The Law and Its Exception Under Minnesota state law, any power plant capable of generating 50 megawatts or more that connects to the grid via transmission lines must obtain a Certificate of Need from the Public Utilities Commission (PUC). This certification ensures that the infrastructure is necessary and that no cheaper, cleaner alternatives exist. Amazon, however, contends that its generators do not fall under this requirement because they are not connected to the larger electric grid; power generated would be strictly used by the data center suffering an outage from its primary power source. That power would be generated locally, and not transmitted over

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »