Stay Ahead, Stay ONMINE

Efficient Data Handling in Python with Arrow

1. Introduction We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the data is crucial for our data science/analytics workflow, and this is exactly where Apache Arrow comes into play.  Why? The main reason resides in how the data is stored in memory. While JSON and CSVs, for example, are text-based formats, Arrow is a columnar in-memory data format (and that allows for fast data interchange between different data processing tools). Arrow is therefore designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression.  Moreover, Apache Arrow is open-source and optimized for analytics. It is designed to accelerate big data processing while maintaining interoperability with various data tools, such as Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read/write operations and efficient memory usage, making it ideal for analytical workloads. Sounds great right? What’s best is that this is all the introduction to Arrow I’ll provide. Enough theory, we want to see it in action. So, in this post, we’ll explore how to use Arrow in Python and how to make the most out of it. 2. Arrow in Python To get started, you need to install the necessary libraries: pandas and pyarrow. pip install pyarrow pandas Then, as always, import them in your Python script: import pyarrow as pa import pandas as pd Nothing new yet, just necessary steps to do what follows. Let’s start by performing some simple operations. 2.1. Creating and Storing a Table The simplest we can do is hardcode our table’s data. Let’s create a two-column table with football data: teams = pa.array([‘Barcelona’, ‘Real Madrid’, ‘Rayo Vallecano’, ‘Athletic Club’, ‘Real Betis’], type=pa.string()) goals = pa.array([30, 23, 9, 24, 12], type=pa.int8()) team_goals_table = pa.table([teams, goals], names=[‘Team’, ‘Goals’]) The format is pyarrow.table, but we can easily convert it to pandas if we want: df = team_goals_table.to_pandas() And restore it back to arrow using: team_goals_table = pa.Table.from_pandas(df) And we’ll finally store the table in a file. We could use different formats, like feather, parquet… I’ll use this last one because it’s fast and memory-optimized: import pyarrow.parquet as pq pq.write_table(team_goals_table, ‘data.parquet’) Reading a parquet file would just consist of using pq.read_table(‘data.parquet’). 2.2. Compute Functions Arrow has its own compute module for the usual operations. Let’s start by comparing two arrays element-wise: import pyarrow.compute as pc > > > a = pa.array([1, 2, 3, 4, 5, 6]) > > > b = pa.array([2, 2, 4, 4, 6, 6]) > > > pc.equal(a,b) [ false, true, false, true, false, true ] That was easy, we could sum all elements in an array with: > > > pc.sum(a) And from this we could easily guess how we can compute a count, a floor, an exp, a mean, a max, a multiplication… No need to go over them, then. So let’s move to tabular operations. We’ll start by showing how to sort it: > > > table = pa.table({‘i’: [‘a’,’b’,’a’], ‘x’: [1,2,3], ‘y’: [4,5,6]}) > > > pc.sort_indices(table, sort_keys=[(‘y’, descending)]) [ 2, 1, 0 ] Just like in pandas, we can group values and aggregate the data. Let’s, for example, group by “i” and compute the sum on “x” and the mean on “y”: > > > table.group_by(‘i’).aggregate([(‘x’, ‘sum’), (‘y’, ‘mean’)]) pyarrow.Table i: string x_sum: int64 y_mean: double —- i: [[“a”,”b”]] x_sum: [[4,2]] y_mean: [[5,5]] Or we can join two tables: > > > t1 = pa.table({‘i’: [‘a’,’b’,’c’], ‘x’: [1,2,3]}) > > > t2 = pa.table({‘i’: [‘a’,’b’,’c’], ‘y’: [4,5,6]}) > > > t1.join(t2, keys=”i”) pyarrow.Table i: string x: int64 y: int64 —- i: [[“a”,”b”,”c”]] x: [[1,2,3]] y: [[4,5,6]] By default, it is a left outer join but we could twist it by using the join_type parameter. There are many more useful operations, but let’s see just one more to avoid making this too long: appending a new column to a table. > > > t1.append_column(“z”, pa.array([22, 44, 99])) pyarrow.Table i: string x: int64 z: int64 —- i: [[“a”,”b”,”c”]] x: [[1,2,3]] z: [[22,44,99]] Before ending this section, we must see how to filter a table or array: > > > t1.filter((pc.field(‘x’) > 0) & (pc.field(‘x’) < 3)) pyarrow.Table i: string x: int64 —- i: [["a","b"]] x: [[1,2]] Easy, right? Especially if you’ve been using pandas and numpy for years! 3. Working with files We’ve already seen how we can read and write Parquet files. But let’s check some other popular file types so that we have several options available. 3.1. Apache ORC Being very informal, Apache ORC can be understood as the equivalent of Arrow in the realm of file types (even though its origins have nothing to do with Arrow). Being more correct, it’s an open source and columnar storage format.  Reading and writing it is as follows: from pyarrow import orc # Write table orc.write_table(t1, 't1.orc') # Read table t1 = orc.read_table('t1.orc') As a side note, we could decide to compress the file while writing by using the “compression” parameter. 3.2. CSV No secret here, pyarrow has the CSV module: from pyarrow import csv # Write CSV csv.write_csv(t1, "t1.csv") # Read CSV t1 = csv.read_csv("t1.csv") # Write CSV compressed and without header options = csv.WriteOptions(include_header=False) with pa.CompressedOutputStream("t1.csv.gz", "gzip") as out: csv.write_csv(t1, out, options) # Read compressed CSV and add custom header t1 = csv.read_csv("t1.csv.gz", read_options=csv.ReadOptions( column_names=["i", "x"], skip_rows=1 )] 3.2. JSON Pyarrow allows JSON reading but not writing. It’s pretty straightforward, let’s see an example supposing we have our JSON data in “data.json”: from pyarrow import json # Read json fn = "data.json" table = json.read_json(fn) # We can now convert it to pandas if we want to df = table.to_pandas() Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. So, contrary to Apache ORC, this one was indeed created early in the Arrow project. from pyarrow import feather # Write feather from pandas DF feather.write_feather(df, "t1.feather") # Write feather from table, and compressed feather.write_feather(t1, "t1.feather.lz4", compression="lz4") # Read feather into table t1 = feather.read_table("t1.feather") # Read feather into df df = feather.read_feather("t1.feather") 4. Advanced Features We just touched upon the most basic features and what the majority would need while working with Arrow. However, its amazingness doesn’t end here, it’s right where it starts. As this will be quite domain-specific and not useful for anyone (nor considered introductory) I’ll just mention some of these features without using any code: We can handle memory management through the Buffer type (built on top of C++ Buffer object). Creating a buffer with our data does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object. Keeping up with this memory management, an instance of MemoryPool tracks all the allocations and deallocations (like malloc and free in C). This allows us to track the amount of memory being allocated. Similarly, there are different ways to work with input/output streams in batches. PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types. So, for example,  we can write and read parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted. 5. Conclusion and Key Takeaways Apache Arrow is a powerful tool for efficient Data Handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you can significantly boost performance and optimize memory usage. 6. Resources

1. Introduction

We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the data is crucial for our data science/analytics workflow, and this is exactly where Apache Arrow comes into play. 

Why? The main reason resides in how the data is stored in memory. While JSON and CSVs, for example, are text-based formats, Arrow is a columnar in-memory data format (and that allows for fast data interchange between different data processing tools). Arrow is therefore designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression. 

Moreover, Apache Arrow is open-source and optimized for analytics. It is designed to accelerate big data processing while maintaining interoperability with various data tools, such as Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read/write operations and efficient memory usage, making it ideal for analytical workloads.

Sounds great right? What’s best is that this is all the introduction to Arrow I’ll provide. Enough theory, we want to see it in action. So, in this post, we’ll explore how to use Arrow in Python and how to make the most out of it.

2. Arrow in Python

To get started, you need to install the necessary libraries: pandas and pyarrow.

pip install pyarrow pandas

Then, as always, import them in your Python script:

import pyarrow as pa
import pandas as pd

Nothing new yet, just necessary steps to do what follows. Let’s start by performing some simple operations.

2.1. Creating and Storing a Table

The simplest we can do is hardcode our table’s data. Let’s create a two-column table with football data:

teams = pa.array(['Barcelona', 'Real Madrid', 'Rayo Vallecano', 'Athletic Club', 'Real Betis'], type=pa.string())
goals = pa.array([30, 23, 9, 24, 12], type=pa.int8())

team_goals_table = pa.table([teams, goals], names=['Team', 'Goals'])

The format is pyarrow.table, but we can easily convert it to pandas if we want:

df = team_goals_table.to_pandas()

And restore it back to arrow using:

team_goals_table = pa.Table.from_pandas(df)

And we’ll finally store the table in a file. We could use different formats, like feather, parquet… I’ll use this last one because it’s fast and memory-optimized:

import pyarrow.parquet as pq
pq.write_table(team_goals_table, 'data.parquet')

Reading a parquet file would just consist of using pq.read_table('data.parquet').

2.2. Compute Functions

Arrow has its own compute module for the usual operations. Let’s start by comparing two arrays element-wise:

import pyarrow.compute as pc
>>> a = pa.array([1, 2, 3, 4, 5, 6])
>>> b = pa.array([2, 2, 4, 4, 6, 6])
>>> pc.equal(a,b)
[
  false,
  true,
  false,
  true,
  false,
  true
]

That was easy, we could sum all elements in an array with:

>>> pc.sum(a)

And from this we could easily guess how we can compute a count, a floor, an exp, a mean, a max, a multiplication… No need to go over them, then. So let’s move to tabular operations.

We’ll start by showing how to sort it:

>>> table = pa.table({'i': ['a','b','a'], 'x': [1,2,3], 'y': [4,5,6]})
>>> pc.sort_indices(table, sort_keys=[('y', descending)])

[
  2,
  1,
  0
]

Just like in pandas, we can group values and aggregate the data. Let’s, for example, group by “i” and compute the sum on “x” and the mean on “y”:

>>> table.group_by('i').aggregate([('x', 'sum'), ('y', 'mean')])
pyarrow.Table
i: string
x_sum: int64
y_mean: double
----
i: [["a","b"]]
x_sum: [[4,2]]
y_mean: [[5,5]]

Or we can join two tables:

>>> t1 = pa.table({'i': ['a','b','c'], 'x': [1,2,3]})
>>> t2 = pa.table({'i': ['a','b','c'], 'y': [4,5,6]})
>>> t1.join(t2, keys="i")
pyarrow.Table
i: string
x: int64
y: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
y: [[4,5,6]]

By default, it is a left outer join but we could twist it by using the join_type parameter.

There are many more useful operations, but let’s see just one more to avoid making this too long: appending a new column to a table.

>>> t1.append_column("z", pa.array([22, 44, 99]))
pyarrow.Table
i: string
x: int64
z: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
z: [[22,44,99]]

Before ending this section, we must see how to filter a table or array:

>>> t1.filter((pc.field('x') > 0) & (pc.field('x') < 3))
pyarrow.Table
i: string
x: int64
----
i: [["a","b"]]
x: [[1,2]]

Easy, right? Especially if you’ve been using pandas and numpy for years!

3. Working with files

We’ve already seen how we can read and write Parquet files. But let’s check some other popular file types so that we have several options available.

3.1. Apache ORC

Being very informal, Apache ORC can be understood as the equivalent of Arrow in the realm of file types (even though its origins have nothing to do with Arrow). Being more correct, it’s an open source and columnar storage format. 

Reading and writing it is as follows:

from pyarrow import orc
# Write table
orc.write_table(t1, 't1.orc')
# Read table
t1 = orc.read_table('t1.orc')

As a side note, we could decide to compress the file while writing by using the “compression” parameter.

3.2. CSV

No secret here, pyarrow has the CSV module:

from pyarrow import csv
# Write CSV
csv.write_csv(t1, "t1.csv")
# Read CSV
t1 = csv.read_csv("t1.csv")

# Write CSV compressed and without header
options = csv.WriteOptions(include_header=False)
with pa.CompressedOutputStream("t1.csv.gz", "gzip") as out:
    csv.write_csv(t1, out, options)

# Read compressed CSV and add custom header
t1 = csv.read_csv("t1.csv.gz", read_options=csv.ReadOptions(
    column_names=["i", "x"], skip_rows=1
)]

3.2. JSON

Pyarrow allows JSON reading but not writing. It’s pretty straightforward, let’s see an example supposing we have our JSON data in “data.json”:

from pyarrow import json
# Read json
fn = "data.json"
table = json.read_json(fn)

# We can now convert it to pandas if we want to
df = table.to_pandas()

Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. So, contrary to Apache ORC, this one was indeed created early in the Arrow project.

from pyarrow import feather
# Write feather from pandas DF
feather.write_feather(df, "t1.feather")
# Write feather from table, and compressed
feather.write_feather(t1, "t1.feather.lz4", compression="lz4")

# Read feather into table
t1 = feather.read_table("t1.feather")
# Read feather into df
df = feather.read_feather("t1.feather")

4. Advanced Features

We just touched upon the most basic features and what the majority would need while working with Arrow. However, its amazingness doesn’t end here, it’s right where it starts.

As this will be quite domain-specific and not useful for anyone (nor considered introductory) I’ll just mention some of these features without using any code:

  • We can handle memory management through the Buffer type (built on top of C++ Buffer object). Creating a buffer with our data does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object. Keeping up with this memory management, an instance of MemoryPool tracks all the allocations and deallocations (like malloc and free in C). This allows us to track the amount of memory being allocated.
  • Similarly, there are different ways to work with input/output streams in batches.
  • PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types. So, for example,  we can write and read parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted.

5. Conclusion and Key Takeaways

Apache Arrow is a powerful tool for efficient Data Handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you can significantly boost performance and optimize memory usage.

6. Resources

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

IBM targets AI application growth with DataStax buy

In particular IBM said DataStax’s technology will be built into its watsonx portfolio of generative AI products to help manage the vast amounts of unstructured data used in generative AI application development. Thousands of organizations including FedEx, Capital One, The Home Depot and Verizon use Apache Cassandra, and it offers

Read More »

New Relic boosts observability platform with AI intelligence

New Relic announced updates to its Intelligent Observability Platform this week, which the company says will enable a unified view of system relationships and dependencies and intelligently connect technical systems with business context. New Relic’s cloud-based observability platform monitors applications and services in real time to provide insights into software,

Read More »

Oil Slumps as US Confidence Dives

Oil slumped along with equity markets as US consumer confidence tumbled, adding to mounting concerns that US President Donald Trump’s policies will hamper economic growth and sap energy demand.   West Texas Intermediate fell 2.5% to settle below $69 a barrel at the lowest closing price this year. US consumer confidence dropped the most since 2021 and missed analysts’ estimates, prompting traders to flee risk assets, including equities. Trump’s tariffs and recent moves to further decouple economic ties with China, which spurred a drop in the Asian country’s stock markets Tuesday, are worsening the already-gloomy outlook for energy demand in the world’s largest oil consumer. Domestically, the trade turmoil is raising Americans’ inflation expectations amid a cooling labor market. “Crude markets are seeing another layer of bearish pressure from a continued string of misses in economic data,” said Frank Monkam, head of macro trading at Buffalo Bayou Commodities. “Such a rollover in economic data bodes ill for crude demand.” Crude has now broken below the roughly $5 range it had wandered in for February. Oil had initially spiked above $80 early this year before fading amid persistent expectations of lackluster Chinese demand, the potential for additional barrels on the market and the prospect that tariffs will weigh on global growth. Earlier this week, the US imposed more curbs on brokers, vessels and individuals that it said were linked to illicit shipments of Iranian crude. Markets had a muted reaction to the additional sanctions on expectations that the trade would adapt quickly by ramping up ship-to-ship transfers or switching off geo-locating signals for longer. The shifts would resemble Russia’s steps to keep crude exports flowing in the face of restrictions. “Sanctions are not the bullish factor many are expecting unless we see true attempts at locating and blockading tankers with naval

Read More »

Iran Rejects Direct Nuclear Talks With USA Under Trump Policy

Iran’s Foreign Minister Abbas Araghchi said his country won’t agree to direct nuclear talks with the US while President Donald Trump persists with his hard-line policy against the Islamic Republic.  “We will not negotiate under pressure, sanctions, or threats,” Araghchi said in a televised press conference alongside his Russian counterpart Sergei Lavrov in Tehran on Tuesday. Direct negotiations between Iran and the US on the nuclear issue will be impossible “as long as maximum pressure is being applied in this manner,” Araghchi said. Trump has vowed to squeeze Iran’s economy and target its oil exports as part of a return to the “maximum pressure” strategy that he deployed in his first term. That led to the 2018 US withdrawal from a landmark international deal that limited Iran’s atomic activities in exchange for sanctions relief.  Iran’s stockpile of near weapons-grade enriched uranium has since surged. Araghchi said Iran will cooperate fully on its nuclear affairs — but only with “friends” China and Russia.  Since returning to office last month, Trump has said he wants Iran to agree to a new nuclear agreement, but Iran’s Supreme Leader Ayatollah Ali Khamenei dismissed the idea earlier this month, saying negotiating with the US “won’t solve any of the country’s problems.”  Speaking alongside Araghchi, Lavrov said Russia will pursue diplomatic efforts to resolve the Iranian nuclear issue. “We’re convinced that the tool of diplomacy remains,” Lavrov said. “It cannot be neglected — it must be used as effectively as possible without any threats and without hints at the possibility of certain forceful solutions.” Iran and the US haven’t had direct, formal ties since the 1979 Islamic revolution and previous negotiations that led to the 2015 nuclear deal took place through mediators. Araghchi didn’t mention whether indirect or mediated talks with the US are still on

Read More »

Schneider Electric books strong 2024 revenue, earnings growth amid data center boom

Dive Brief: Schneider Electric saw stronger-than-forecast revenue and adjusted earnings in 2024 as its customers’ data center investments — especially in North America — drove mid-double-digit growth in its energy management business, the company said Feb. 20. Year-over-year organic revenue growth accelerated 12% in the fourth quarter, supported by 25% growth in Schneider Electric’s North American energy management business. The company’s year-end 2024 sales backlog of 21.4 billion euros, or about $22.39 billion, was its highest ever, and the company plans to invest about 2 billion euros through 2027 to expand production capacity, it said. Much of the planned capacity expansion will occur in North America despite uncertainty around U.S. trade policy that could necessitate “commercial actions” to protect the company’s profitability, Chief Financial Officer Hilary Maxson said Thursday on Schneider Electric’s earnings call. Dive Insight: Schneider data centers and networks end-market has been strong throughout 2024 and should continue to see robust growth in 2025 and beyond, CEO Olivier Blum said on his first earnings call since replacing Peter Herweck in November. The AI investment boom supports annual growth of 10% or more through 2027 in the company’s data centers and networks business, which accounts for 24% of Schneider Electric’s 2024 end-market exposure, the company said in its earnings presentation.  “Pure data centers” make up 20% of Schneider Electric’s end market exposure, with hyperscalers contributing “a bit less than half” of that total, Maxson said.  “Suffice to say we feel there is healthy growth in that segment … and we believe there is healthy growth to come, [though] not exponential … as this new infrastructure backbone is built out,” Maxson said. Schneider expects DeepSeek, a Chinese AI firm that caught the industry off-guard in January when it released a reasoning model that appeared to use far less energy than

Read More »

New York PSC approves retail and residential storage plan as 6-GW 2030 target in question

Dive Brief: The New York State Public Service Commission has approved the state’s retail and residential energy storage implementation plan, a significant step in its effort to reach 6 GW of energy storage by 2030. The Feb. 13 order approved a framework to reach the state’s retail storage deployment goal of 1,500 MW and its residential storage deployment goal of 200 MW. It also includes incentives for resources participating in the New York Independent System Operator’s distributed energy resources program to also be eligible for the retail storage incentive, the PSC said. The plan was approved as a new forecast by Aurora Energy Research shows New York falling “marginally short” of its 2030 energy storage target despite an expected deployment surge in the late 2020s, but reaching 30 GW of deployed storage capacity by 2050. Dive Insight: New York’s 6-GW 2030 goal will “support a buildout of storage deployments estimated to reduce projected future statewide electric system costs by nearly $2 billion, in addition to further benefits in the form of improved public health because of reduced exposure to harmful fossil fuel pollutants,” the PSC said in announcing the order. The 6-GW goal represents a doubling of the previous 2030 goal of 3 GW. It envisions 1.7 GW of new retail and residential storage plus 3 GW of new bulk storage added to about 1.3 GW of existing storage assets being procured by or under contract with the state as of April 1, 2024, the PSC said on Feb. 13.  Following the adoption this month of its retail and residential implementation plan, the New York State Energy Research and Development Authority expects to make the first of three annual bulk storage solicitations by the end of June for deployment in 2027 and 2028. It plans subsequent storage solicitations in 2026

Read More »

Charging Forward: UK battery storage projects reach startup, grid delays and more

In this week’s Charging Forward, Gore Street, Eku and BW ESS reach energisation at UK battery energy storage system (BESS) projects, amid warnings over an oversubscribed grid connection queue. This week’s headlines: Root-Power secures planning consent for 40 MW Rotherham BESS Sungrow and BW ESS Bramley Project begins operations Warnings over UK grid connection queue Invinity and Frontier Power partner on UK long duration energy storage projects Fire at Statera BESS site in Essex brought under control Gore Street energises UK Enderby BESS project Eku energises two UK BESS projects International news: China and Saudi Arabia collaborate on 12.5 GWh of energy storage projects and Canadian firm Hydrostor secures $200 million for compressed air energy storage Root-Power consent for 40 MW Rotherham BESS UK energy storage developer Root-Power has secured planning consent for a 40 MW/80 MWh BESS project in Brinsworth, Rotherham. Root-Power said the site will power 80,000 homes for a two-hour period once fully operational, and delivering a biodiversity net gain of 32.76%. The Brinsworth BESS is the fourth planning approval for Root-Power in 2025, following consents at sites in Yorkshire, County Durham and the Scottish Highlands. © Supplied by Root-PowerThe site of Root-Power’s 40 MW/80 MWh Brinsworth BESS project in Rotherham. Root-Power managing director Neil Brooks said the company “carefully selected a near perfect location” for the Brinsworth project. “Managing competing constraints is always difficult when planning a project, so finding a suitable location only 1 mile from the point of connection in an urban area, without causing unacceptable noise or visual impact on sensitive receptors is a real achievement,” he said. “We are happy to see that the planning committee unanimously supported our application, which is a real vote of confidence in our process and team.” Sungrow and BW ESS Bramley BESS starts up Swiss energy storage developer

Read More »

Costain secures multi-million pound Sizewell C contract

UK construction and engineering firm Costain (LON:COST) has secured a multi-million pound contract to support the construction of the Sizewell C nuclear power plant. Costain said under the ten-year framework agreement, the company will provide support in areas such as delivery integration, health and safety and quality control. French state-owned energy firm EDF is developing the 3.2 GW nuclear power station, which could provide up to 7% of UK energy needs over its 60-year lifetime. The UK government holds a 76.1% stake in Sizewell C, with EDF holding the remaining 23.9%. Costain defence and nuclear energy sector director Bob Anstey said the Sizewell C project is a “vital part of creating a sustainable future”. “We have a long and successful track record in delivering for our civil nuclear customers, with a highly qualified and experienced workforce that consistently works to the highest safety and quality standards,” Anstey said. “A key part of our role will be to help ensure the project leaves a positive legacy, and we look forward to working closely with Sizewell C on a range of social value and employment initiatives that improve lives and provide long-term benefits to local communities.” Sizewell C Ltd managing director Nigel Cann said the project will “strengthen energy security and provide clean, reliable electricity for millions”. “We welcome Costain to the Sizewell C supplier family,” Cann said. “We are committed to providing thousands of great jobs and career development opportunities and we’re looking forward to working with our suppliers to boost skills, promote a diverse workforce and spread opportunities as widely as possible.” Sizewell C criticism The Sizewell C project has attracted significant criticism amid concerns over its ballooning costs. Earlier this year, campaign group Together Against Sizewell C (TASC) wrote to the National Audit Office calling for a review of

Read More »

Cisco, Nvidia expand AI partnership to include Silicon One technology

In addition, Cisco and Nvidia will invest in cross-portfolio technology to tackle common challenges like congestion management and load balancing, ensuring that enterprises can accelerate their AI deployments, Patel stated. The vendors said they would also collaborate to create and validate Nvidia Cloud Partner (NCP) and Enterprise Reference Architectures based on Nvidia Spectrum-X with Cisco Silicon One, Hyperfabric, Nexus, UCS Compute, Optics, and other Cisco technologies. History of Cisco, Nvidia collaborations The announcement is just the latest expansion of the Cisco/Nvidia partnership. The companies have already worked together to make Nvidia’s Tensor Core GPUs available in Cisco’s Unified Computing System (UCS) rack and blade servers, including Cisco UCS X-Series and UCS X-Series Direct, to support AI and data-intensive workloads in the data center and at the edge. The integrated package includes Nvidia AI Enterprise software, which features pretrained models and development tools for production-ready AI. Earlier this month, Cisco said it has shipped the UCS C845A M8 Rack Server for enterprise data center environments. The 8U rack server is built on Nvidia’s HGX platform and designed to deliver the accelerated compute capabilities needed for AI workloads such as LLM training, model fine-tuning, large model inferencing, and retrieval-augmented generation (RAG). The companies are also collaborating on AI Pods, which are preconfigured, validated, and optimized infrastructure packages that customers can plug into their data center or edge environments as needed. The Pods are based on Cisco Validated Design principals, which provide a blueprint for building reliable, scalable, and secure network infrastructures, according to Cisco. The Pods include Nvidia AI Enterprise, which features pretrained models and development tools for production-ready AI, and are managed through Cisco Intersight.

Read More »

3 strategies for carbon-free data centers

Because of the strain that data centers (as well as other electrification sources, such as electric vehicles) are putting on the grid, “the data center industry needs to develop new power supply strategies to support growth plans,” Dietrich said. Here are the underling factors that play into the three strategies outlined by Uptime. Scale creates new opportunities: It’s not just that more data centers are being built, but the data centers under construction are fundamentally different in terms of sheer magnitude. For example, a typical enterprise data center might require between 10 and 25 megawatts of power. Today, the hyperscalers are building data centers in the 250-megawatt range and a large data center campus could require 1,000 megawatts of power. Data centers not only require a reliable source of power, they also require backup power in the form of generators. Dietrich pointed out that if a data center operator builds out enough backup capacity to support 250 megawatts of demand, they’re essentially building a new, on-site power plant. On the one hand, that new power plant requires permitting, it’s costly, and it requires highly training staffers to operate. On the other hand, it provides an opportunity. Instead of letting this asset sit around unused except in an emergency, organizations can leverage these power plants to generate energy that can be sold back to the grid. Dietrich described this arrangement as a win-win that enables the data center to generate revenue, and it helps the utility to gain a new source of power. Realistic expectations: Alternative energy sources like wind and solar, which are dependent on environmental factors, can’t technically or economically supply 100% of data center power, but they can provide a significant percentage of it. Organizations need to temper their expectations, Dietrich said.

Read More »

Questions arise about reasons why Microsoft has cancelled data center lease plans

This, the company said, “allows us to invest and allocate resources to growth areas for our future. Our plans to spend over $80 billion on infrastructure this fiscal year remains on track as we continue to grow at a record pace to meet customer demand.” When asked for his reaction to the findings, John Annand, infrastructure and operations research practice lead at Info-Tech Research Group, pointed to a blog released last month by Microsoft president Brad Smith, and said he thinks the company “is hedging its bets. It reaffirms the $80 billion AI investment guidance in 2025, $40 billion in the US. Why lease when you can build/buy your own?” Over the past four years, he said, Microsoft “has been leasing more data centers than owning. Perhaps they are using the fact that the lessors are behind schedule on providing facilities or the power upgrades required to bring that ratio back into balance. The limiting factor for data centers has always been the availability of power, and this has only become more true with power-hungry AI workloads.” The company, said Annand, “has made very public statements about owning nuclear power plants to help address this demand. If third-party data center operators are finding it tough to provide Microsoft with the power they need, it would make sense that Microsoft vertically integrate its supply chain; so, cancel leases or statements of qualification in favor of investing in the building of their own capacity.” However, Gartner analyst Tony Harvey said of the report, “so much of this is still speculation.” Microsoft, he added, “has not stated as yet that they are reducing their capex spend, and there are reports that Microsoft have strongly refuted that they are making changes to their data center strategy.” The company, he said, “like any other hyperscaler,

Read More »

Quantum Computing Advancements Leap Forward In Evolving Data Center and AI Landscape

Overcoming the Barriers to Quantum Adoption Despite the promise of quantum computing, widespread deployment faces multiple hurdles: High Capital Costs: Quantum computing infrastructure requires substantial investment, with uncertain return-on-investment models. The partnership will explore cost-sharing strategies to mitigate risk. Undefined Revenue Models: Business frameworks for quantum services, including pricing structures and access models, remain in development. Hardware Limitations: Current quantum processors still struggle with error rates and scalability, requiring advancements in error correction and hybrid computing approaches. Software Maturity: Effective algorithms for leveraging quantum computing’s advantages remain an active area of research, particularly in real-world AI and optimization problems. SoftBank’s strategy includes leveraging its extensive telecom infrastructure and AI expertise to create real-world testing environments for quantum applications. By integrating quantum into existing data center operations, SoftBank aims to position itself at the forefront of the quantum-AI revolution. A Broader Play in Advanced Computing SoftBank’s quantum initiative follows a series of high-profile moves into the next generation of computing infrastructure. The company has been investing heavily in AI data centers, aligning with its “Beyond Carrier” strategy that expands its focus beyond telecommunications. Recent efforts include the development of large-scale AI models tailored to Japan and the enhancement of radio access networks (AI-RAN) through AI-driven optimizations. Internationally, SoftBank has explored data center expansion opportunities beyond Japan, as part of its efforts to support AI, cloud computing, and now quantum applications. The company’s long-term vision suggests that quantum data centers could eventually play a role in supporting AI-driven workloads at scale, offering performance benefits that classical supercomputers cannot achieve. The Road Ahead SoftBank and Quantinuum’s collaboration signals growing momentum for quantum computing in enterprise settings. While quantum remains a long-term bet, integrating QPUs into data center infrastructure represents a forward-looking approach that could redefine high-performance computing in the years to come. With

Read More »

STACK Infrastructure Pushes Aggressive Data Center Expansion and Sustainability Strategy Into 2025

Global data center developer and operator STACK Infrastructure is providing a growing range of digital infrastructure solutions for hyperscalers, cloud service providers, and enterprise clients. Like almost all of the cutting-edge developers in the industry, Stack is maintaining the focus on scalability, reliability, and sustainability while delivering a full range of solutions, including build-to-suit, colocation, and powered shell facilities, with continued development in key global markets. Headquartered in the United States, the company has expanded its presence across North America, Europe, and Asia-Pacific, catering to the increasing demand for high-performance computing, artificial intelligence (AI), and cloud-based workloads. The company is known for its commitment to sustainable growth, leveraging green financing initiatives, energy-efficient designs, and renewable power sources to minimize its environmental impact. Through rapid expansion in technology hubs like Silicon Valley, Northern Virginia, Malaysia, and Loudoun County, the company continues to develop industry benchmarks for innovation and infrastructure resilience. With a customer-centric approach and a robust development pipeline, STACK Infrastructure is shaping the future of digital connectivity and data management in an era of accelerating digital transformation. Significant Developments Across 23 Major Data Center Markets Early in 2024, Stack broke ground on the expansion of their existing 100 MW campus in San Jose, servicing the power constrained Silicon Valley. Stack worked with the city of San Jose to add a 60 MW expansion to their SVY01 data center. While possibly the highest profile of Stack’s developments, due to its location, at that point in time the company had announced significant developments across 23 major data center markets, including:       Stack’s 48 MW Santa Clara data center, featuring immediately available shell space powered by an onsite substation with rare, contracted capacity. Stack’s 56 MW Toronto campus, spanning 19 acres, includes an existing 8 MW data center and 48 MW expansion capacity,

Read More »

Meta Update: Opens Mesa, Arizona Data Center; Unveils Major Subsea Cable Initiative; Forges Oklahoma Wind Farm PPA; More

Meta’s Project Waterworth: Building the Global Backbone for AI-Powered Digital Infrastructure Also very recently, Meta unveiled its most ambitious subsea cable initiative yet: Project Waterworth. Aimed at revolutionizing global digital connectivity, the project will span over 50,000 kilometers—surpassing the Earth’s circumference—and connect five major continents. When completed, it will be the world’s longest subsea cable system, featuring the highest-capacity technology available today. A Strategic Expansion to Key Global Markets As announced on Feb. 14, Project Waterworth is designed to enhance connectivity across critical regions, including the United States, India, Brazil, and South Africa. These regions are increasingly pivotal to global digital growth, and the new subsea infrastructure will fuel economic cooperation, promote digital inclusion, and unlock opportunities for technological advancement. In India, for instance, where rapid digital infrastructure growth is already underway, the project will accelerate progress and support the country’s ambitions for an expanded digital economy. This enhanced connectivity will foster regional integration and bolster the foundation for next-generation applications, including AI-driven services. Strengthening Global Digital Highways Subsea cables are the unsung heroes of global digital infrastructure, facilitating over 95% of intercontinental data traffic. With a multi-billion-dollar investment, Meta aims to open three new oceanic corridors that will deliver the high-speed, high-capacity bandwidth needed to fuel innovations like artificial intelligence. Meta’s experience in subsea infrastructure is extensive. Over the past decade, the company has collaborated with various partners to develop more than 20 subsea cables, including systems boasting up to 24 fiber pairs—far exceeding the typical 8 to 16 fiber pairs found in most new deployments. This technological edge ensures scalability and reliability, essential for handling the world’s ever-increasing data demands. Engineering Innovations for Resilience and Capacity Project Waterworth isn’t just about scale—it’s about resilience and cutting-edge engineering. The system will be the longest 24-fiber-pair subsea cable ever built, enhancing

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »