😲 Quantifying Surprise – A Data Scientist’s Intro To Information Theory – Part 1/4: Foundations

Stay Ahead, Stay ONMINE

😲 Quantifying Surprise – A Data Scientist’s Intro To Information Theory – Part 1/4: Foundations

Surprise! Generated using Gemini. During the telecommunication boom, Claude Shannon, in his seminal 1948 paper¹, posed a question that would revolutionise technology: How can we quantify communication? Shannon’s findings remain fundamental to expressing information quantification, storage, and communication. These insights made major contributions to the creation of technologies ranging from signal processing, data compression (e.g., Zip files and compact discs) to the Internet and artificial intelligence. More broadly, his work has significantly impacted diverse fields such as neurobiology, statistical physics and computer science (e.g, cybersecurity, cloud computing, and machine learning). [Shannon’s paper is the] Magna Carta of the Information Age Scientific American This is the first article in a series that explores information quantification – an essential tool for data scientists. Its applications range from enhancing statistical analyses to serving as a go-to decision heuristic in cutting-edge machine learning algorithms. Broadly speaking, quantifying information is assessing uncertainty, which may be phrased as: “how surprising is an outcome?”. This article idea quickly grew into a series since I found this topic both fascinating and diverse. Most researchers, at one stage or another, come across commonly used metrics such as entropy, cross-entropy/KL-divergence and mutual-information. Diving into this topic I found that in order to fully appreciate these one needs to learn a bit about the basics which we cover in this first article. By reading this series you will gain an intuition and tools to quantify: Bits/Nats – Unit measures of information. Self-Information – **** The amount of information in a specific event. Pointwise Mutual Information – The amount of information shared between two specific events. Entropy – The average amount of information of a variable’s outcome. Cross-entropy – The misalignment between two probability distributions (also expressed by its derivative KL-Divergence – a distance measure). Mutual Information – The co-dependency of two variables by their conditional probability distributions. It expresses the information gain of one variable given another. No prior knowledge is required – just a basic understanding of probabilities. I demonstrate using common statistics such as coin and dice 🎲 tosses as well as machine learning applications such as in supervised classification, feature selection, model monitoring and clustering assessment. As for real world applications I’ll discuss a case study of quantifying DNA diversity 🧬. Finally, for fun, I also apply to the popular brain twister commonly known as the Monty Hall problem 🚪🚪 🐐 . Throughout I provide python code 🐍 , and try to keep formulas as intuitive as possible. If you have access to an integrated development environment (IDE) 🖥 you might want to plug 🔌 and play 🕹 around with the numbers to gain a better intuition. This series is divided into four articles, each exploring a key aspect of Information Theory: 😲 Quantifying Surprise: 👈 👈 👈 YOU ARE HERE In this opening article, you’ll learn how to quantify the “surprise” of an event using _self-informatio_n and understand its units of measurement, such as _bit_s and _nat_s. Mastering self-information is essential for building intuition about the subsequent concepts, as all later heuristics are derived from it. 🤷 Quantifying Uncertainty: Building on self-information, this article shifts focus to the uncertainty – or “average surprise” – associated with a variable, known as entropy. We’ll dive into entropy’s wide-ranging applications, from Machine Learning and data analysis to solving fun puzzles, showcasing its adaptability. 📏 Quantifying Misalignment: Here, we’ll explore how to measure the distance between two probability distributions using entropy-based metrics like cross-entropy and KL-divergence. These measures are particularly valuable for tasks like comparing predicted versus true distributions, as in classification loss functions and other alignment-critical scenarios. 💸 Quantifying Gain: Expanding from single-variable measures, this article investigates the relationships between two. You’ll discover how to quantify the information gained about one variable (e.g, target Y) by knowing another (e.g., predictor X). Applications include assessing variable associations, feature selection, and evaluating clustering performance. Each article is crafted to stand alone while offering cross-references for deeper exploration. Together, they provide a practical, data-driven introduction to information theory, tailored for data scientists, analysts and machine learning practitioners. Disclaimer: Unless otherwise mentioned the formulas analysed are for categorical variables with c≥2 classes (2 meaning binary). Continuous variables will be addressed in a separate article. 🚧 Articles (3) and (4) are currently under construction. I will share links once available. Follow me to be notified 🚧 Quantifying Surprise with Self-Information Self-information is considered the building block of information quantification. It is a way of quantifying the amount of “surprise” of a specific outcome. Formally self-information, or also referred to as Shannon Information or information content, quantifies the surprise of an event x occurring based on its probability, p(x). Here we denote it as hₓ: Self-information _h_ₓ is the information of event x that occurs with probability p(x). The units of measure are called bits. One bit (binary digit) is the amount of information for an event x that has probability of p(x)=½. Let’s plug in to verify: hₓ=-log₂(½)= log₂(2)=1 bit. This heuristic serves as an alternative to probabilities, odds and log-odds, with certain mathematical properties which are advantageous for information theory. We discuss these below when learning about Shannon’s axioms behind this choice. It’s always informative to explore how an equation behaves with a graph: Bernoulli trial self-information h(p). Key features: Monotonic, h(p=1)=0, h(p →)→∞. To deepen our understanding of self-information, we’ll use this graph to explore the said axioms that justify its logarithmic formulation. Along the way, we’ll also build intuition about key features of this heuristic. To emphasise the logarithmic nature of self-information, I’ve highlighted three points of interest on the graph: At p=1 an event is guaranteed, yielding no surprise and hence zero bits of information (zero bits). A useful analogy is a trick coin (where both sides show HEAD). Reducing the probability by a factor of two (p=½) increases the information to _hₓ=_1 bit. This, of course, is the case of a fair coin. Further reducing it by a factor of four results in hₓ(p=⅛)=3 bits. If you are interested in coding the graph here is a python script: To summarise this section: Self-Information hₓ=-log₂(p(x)) quantifies the amount of “surprise” of a specific outcome x. Three Axioms Referencing prior work by Ralph Hartley, Shannon chose -log₂(p) as a manner to meet three axioms. We’ll use the equation and graph to examine how these are manifested: An event with probability 100% is not surprising and hence does not yield any information. In the trick coin case this is evident by p(x)=1 yielding hₓ=0. Less probable events are more surprising and provide more information. This is apparent by self-information decreasing monotonically with increasing probability. The property of Additivity – the total self-information of two independent events equals the sum of individual contributions. This will be explored further in the upcoming fourth article on Mutual Information. There are mathematical proofs (which are beyond the scope of this series) that show that only the log function adheres to all three². The application of these axioms reveals several intriguing and practical properties of self-information: Important properties : Minimum bound: The first axiom hₓ(p=1)=0 establishes that self-information is non-negative, with zero as its lower bound. This is highly practical for many applications. Monotonically decreasing: The second axiom ensures that self-information decreases monotonically with increasing probability. No Maximum bound: At the extreme where _p→_0, monotonicity leads to self-information growing without bound hₓ(_p→0) →_ ∞, a feature that requires careful consideration in some contexts. However, when averaging self-information – as we will later see in the calculation of entropy – probabilities act as weights, effectively limiting the contribution of highly improbable events to the overall average. This relationship will become clearer when we explore entropy in detail. It is useful to understand the close relationship to log-odds. To do so we define p(x) as the probability of event x to happen and p(¬x)=1-p(x) of it not to happen. log-odds(x) = log₂(p(x)/p(¬x))= h(¬x) – h(x). The main takeaways from this section are Axiom 1: An event with probability 100% is not surprising Axiom 2: Less probable events are more surprising and, when they occur, provide more information. Self information (1) monotonically decreases (2) with a minimum bound of zero and (3) no upper bound. In the next two sections we further discuss units of measure and choice of normalisation. Information Units of Measure Bits or Shannons? A bit, as mentioned, represents the amount of information associated with an event that has a 50% probability of occurring. The term is also sometimes referred to as a Shannon, a naming convention proposed by mathematician and physicist David MacKay to avoid confusion with the term ‘bit’ in the context of digital processing and storage. After some deliberation, I decided to use ‘bit’ throughout this series for several reasons: This series focuses on quantifying information, not on digital processing or storage, so ambiguity is minimal. Shannon himself, encouraged by mathematician and statistician John Tukey, used the term ‘bit’ in his landmark paper. ‘Bit’ is the standard term in much of the literature on information theory. For convenience – it’s more concise Normalisation: Log Base 2 vs. Natural Throughout this series we use base 2 for logarithms, reflecting the intuitive notion of a 50% chance of an event as a fundamental unit of information. An alternative commonly used in machine learning is the natural logarithm, which introduces a different unit of measure called nats (short for natural units of information). One nat corresponds to the information gained from an event occurring with a probability of 1/e where e is Euler’s number (≈2.71828). In other words, 1 nat = -ln(p=(1/e)). The relationship between bits (base 2) and nats (natural log) is as follows: 1 bit = ln(2) nats ≈ 0.693 nats. Think of it as similar to a monetary current exchange or converting centimeters to inches. In his seminal publication Shanon explained that the optimal choice of base depends on the specific system being analysed (paraphrased slightly from his original work): “A device with two stable positions […] can store one bit of information” (bit as in binary digit). “A digit wheel on a desk computing machine that has ten stable positions […] has a storage capacity of one decimal digit.”³ “In analytical work where integration and differentiation are involved the base e is sometimes useful. The resulting units of information will be called natural units.” Key aspects of machine learning, such as popular loss functions, often rely on integrals and derivatives. The natural logarithm is a practical choice in these contexts because it can be derived and integrated without introducing additional constants. This likely explains why the machine learning community frequently uses nats as the unit of information – it simplifies the mathematics by avoiding the need to account for factors like ln(2). As shown earlier, I personally find base 2 more intuitive for interpretation. In cases where normalisation to another base is more convenient, I will make an effort to explain the reasoning behind the choice. To summarise this section of units of measure: bit = amount of information to distinguish between two equally likely outcomes. Now that we are familiar with self-information and its unit of measure let’s examine a few use cases. Quantifying Event Information with Coins and Dice In this section, we’ll explore examples to help internalise the self-information axioms and key features demonstrated in the graph. Gaining a solid understanding of self-information is essential for grasping its derivatives, such as entropy, cross-entropy (or KL divergence), and mutual information – all of which are averages over self-information. The examples are designed to be simple, approachable, and lighthearted, accompanied by practical Python code to help you experiment and build intuition. Note: If you feel comfortable with self-information, feel free to skip these examples and go straight to the Quantifying Uncertainty article. Generated using Gemini. To further explore the self-information and bits, I find analogies like coin flips and dice rolls particularly effective, as they are often useful analogies for real-world phenomena. Formally, these can be described as multinomial trials with n=1 trial. Specifically: A coin flip is a Bernoulli trial, where there are c=2 possible outcomes (e.g., heads or tails). Rolling a die represents a categorical trial, where c≥3 outcomes are possible (e.g., rolling a six-sided or eight-sided die). As a use case we’ll use simplistic weather reports limited to featuring sun 🌞 , rain 🌧 , and snow ⛄️. Now, let’s flip some virtual coins 👍 and roll some funky-looking dice 🎲 … Fair Coins and Dice Generated using Gemini. We’ll start with the simplest case of a fair coin (i.e, 50% chance for success/Heads or failure/Tails). Imagine an area for which at any given day there is a 50:50 chance for sun or rain. We can write the probability of each event be: p(🌞 )=p(🌧 )=½. As seen above, according the the self-information formulation, when 🌞 or 🌧 is reported we are provided with h(🌞 __ )=h(🌧 )=-log₂(½)=1 bit of information. We will continue to build on this analogy later on, but for now let’s turn to a variable that has more than two outcomes (c≥3). Before we address the standard six sided die, to simplify the maths and intuition, let’s assume an 8 sided one (_c=_8) as in Dungeons Dragons and other tabletop games. In this case each event (i.e, landing on each side) has a probability of p(🔲 ) = ⅛. When a die lands on one side facing up, e.g, value 7️⃣, we are provided with h(🔲 =7️⃣)=-log₂(⅛)=3 bits of information. For a standard six sided fair die: p(🔲 ) = ⅙ → an event yields __ h(🔲 )=-log₂(⅙)=2.58 bits. Comparing the amount of information from the fair coin (1 bit), 6 sided die (2.58 bits) and 8 sided (3 bits) we identify the second axiom: The less probable an event is, the more surprising it is and the more information it yields. Self information becomes even more interesting when probabilities are skewed to prefer certain events. Loaded Coins and Dice Generated using Gemini. Let’s assume a region where p(🌞 ) = ¾ and p(🌧 )= ¼. When rain is reported the amount of information conveyed is not 1 bit but rather h(🌧 )=-log₂(¼)=2 bits. When sun is reported less information is conveyed: h(🌞 )=-log₂(¾)=0.41 bits. As per the second axiom— a rarer event, like p(🌧 )=¼, reveals more information than a more likely one, like p(🌞 )=¾ – and vice versa. To further drive this point let’s now assume a desert region where p(🌞 ) =99% and p(🌧 )= 1%. If sunshine is reported – that is kind of expected – so nothing much is learnt (“nothing new under the sun” 🥁) and this is quantified as h(🌞 )=0.01 bits. If rain is reported, however, you can imagine being quite surprised. This is quantified as h(🌧 )=6.64 bits. In the following python scripts you can examine all the above examples, and I encourage you to play with your own to get a feeling. First let’s define the calculation and printout function: import numpy as np def print_events_self_information(probs): for ps in probs: print(f”Given distribution {ps}”) for event in ps: if ps[event] != 0: self_information = -np.log2(ps[event]) #same as: -np.log(ps[event])/np.log(2) text_ = f’When `{event}` occurs {self_information:0.2f} bits of information is communicated’ print(text_) else: print(f’a `{event}` event cannot happen p=0 ‘) print(“=” * 20) Next we’ll set a few example distributions of weather frequencies # Setting multiple probability distributions (each sums to 100%) # Fun fact – 🐍 💚 Emojis! probs = [{‘🌞 ‘: 0.5, ‘🌧 ‘: 0.5}, # half-half {‘🌞 ‘: 0.75, ‘🌧 ‘: 0.25}, # more sun than rain {‘🌞 ‘: 0.99, ‘🌧 ‘: 0.01} , # mostly sunshine ] print_events_self_information(probs) This yields printout Given distribution {‘🌞 ‘: 0.5, ‘🌧 ‘: 0.5} When `🌞 ` occurs 1.00 bits of information is communicated When `🌧 ` occurs 1.00 bits of information is communicated ==================== Given distribution {‘🌞 ‘: 0.75, ‘🌧 ‘: 0.25} When `🌞 ` occurs 0.42 bits of information is communicated When `🌧 ` occurs 2.00 bits of information is communicated ==================== Given distribution {‘🌞 ‘: 0.99, ‘🌧 ‘: 0.01} When `🌞 ` occurs 0.01 bits of information is communicated When `🌧 ` occurs 6.64 bits of information is communicated Let’s examine a case of a loaded three sided die. E.g, information of a weather in an area that reports sun, rain and snow at uneven probabilities: p(🌞 ) = 0.2, p(🌧 )=0.7, p(⛄️)=0.1. Running the following print_events_self_information([{‘🌞 ‘: 0.2, ‘🌧 ‘: 0.7, ‘⛄️’: 0.1}]) yields Given distribution {‘🌞 ‘: 0.2, ‘🌧 ‘: 0.7, ‘⛄️’: 0.1} When `🌞 ` occurs 2.32 bits of information is communicated When `🌧 ` occurs 0.51 bits of information is communicated When `⛄️` occurs 3.32 bits of information is communicated What we saw for the binary case applies to higher dimensions. To summarise – we clearly see the implications of the second axiom: When a highly expected event occurs – we do not learn much, the bit count is low. When an unexpected event occurs – we learn a lot, the bit count is high. Event Information Summary In this article we embarked on a journey into the foundational concepts of information theory, defining how to measure the surprise of an event. Notions introduced serve as the bedrock of many tools in information theory, from assessing data distributions to unraveling the inner workings of machine learning algorithms. Through simple yet insightful examples like coin flips and dice rolls, we explored how self-information quantifies the unpredictability of specific outcomes. Expressed in bits, this measure encapsulates Shannon’s second axiom: rarer events convey more information. While we’ve focused on the information content of specific events, this naturally leads to a broader question: what is the average amount of information associated with all possible outcomes of a variable? In the next article, Quantifying Uncertainty, we build on the foundation of self-information and bits to explore entropy – the measure of average uncertainty. Far from being just a beautiful theoretical construct, it has practical applications in data analysis and machine learning, powering tasks like decision tree optimisation, estimating diversity and more. Claude Shannon. Credit: Wikipedia Loved this post? ❤️🍕 💌 Follow me here, join me on LinkedIn or 🍕 buy me a pizza slice! About This Series Even though I have twenty years of experience in data analysis and predictive modelling I always felt quite uneasy about using concepts in information theory without truly understanding them. The purpose of this series was to put me more at ease with concepts of information theory and hopefully provide for others the explanations I needed. 🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: EntropyGa_in intuition into Entropy and master its applications in Machine Learning and Data Analysis. Python code included. 🐍 me_dium.com Check out my other articles which I wrote to better understand Causality and Bayesian Statistics: Footnotes ¹ A Mathematical Theory of Communication, Claude E. Shannon, Bell System Technical Journal 1948. It was later renamed to a book The Mathematical Theory of Communication in 1949. [Shannon’s “A Mathematical Theory of Communication”] the blueprint for the digital era – Historian James Gleick ² See Wikipedia page on Information Content (i.e, self-information) for a detailed derivation that only the log function meets all three axioms. ³ The decimal-digit was later renamed to a hartley (symbol Hart), a ban or a dit. See Hartley (unit) Wikipedia page. Credits Unless otherwise noted, all images were created by the author. Many thanks to Will Reynolds and Pascal Bugnion for their useful comments.

During the telecommunication boom, Claude Shannon, in his seminal 1948 paper¹, posed a question that would revolutionise technology:

How can we quantify communication?

Shannon’s findings remain fundamental to expressing information quantification, storage, and communication. These insights made major contributions to the creation of technologies ranging from signal processing, data compression (e.g., Zip files and compact discs) to the Internet and artificial intelligence. More broadly, his work has significantly impacted diverse fields such as neurobiology, statistical physics and computer science (e.g, cybersecurity, cloud computing, and machine learning).

[Shannon’s paper is the]

Magna Carta of the Information Age

Scientific American

This is the first article in a series that explores information quantification – an essential tool for data scientists. Its applications range from enhancing statistical analyses to serving as a go-to decision heuristic in cutting-edge machine learning algorithms.

Broadly speaking, quantifying information is assessing uncertainty, which may be phrased as: “how surprising is an outcome?”.

This article idea quickly grew into a series since I found this topic both fascinating and diverse. Most researchers, at one stage or another, come across commonly used metrics such as entropy, cross-entropy/KL-divergence and mutual-information. Diving into this topic I found that in order to fully appreciate these one needs to learn a bit about the basics which we cover in this first article.

By reading this series you will gain an intuition and tools to quantify:

Bits/Nats – Unit measures of information.
Self-Information – **** The amount of information in a specific event.
Pointwise Mutual Information – The amount of information shared between two specific events.
Entropy – The average amount of information of a variable’s outcome.
Cross-entropy – The misalignment between two probability distributions (also expressed by its derivative KL-Divergence – a distance measure).
Mutual Information – The co-dependency of two variables by their conditional probability distributions. It expresses the information gain of one variable given another.

No prior knowledge is required – just a basic understanding of probabilities.

I demonstrate using common statistics such as coin and dice 🎲 tosses as well as machine learning applications such as in supervised classification, feature selection, model monitoring and clustering assessment. As for real world applications I’ll discuss a case study of quantifying DNA diversity 🧬. Finally, for fun, I also apply to the popular brain twister commonly known as the Monty Hall problem 🚪🚪 🐐 .

Throughout I provide python code 🐍 , and try to keep formulas as intuitive as possible. If you have access to an integrated development environment (IDE) 🖥 you might want to plug 🔌 and play 🕹 around with the numbers to gain a better intuition.

This series is divided into four articles, each exploring a key aspect of Information Theory:

😲 Quantifying Surprise: 👈 👈 👈 YOU ARE HERE
In this opening article, you’ll learn how to quantify the “surprise” of an event using _self-informatio_n and understand its units of measurement, such as _bit_s and _nat_s. Mastering self-information is essential for building intuition about the subsequent concepts, as all later heuristics are derived from it.
🤷 Quantifying Uncertainty: Building on self-information, this article shifts focus to the uncertainty – or “average surprise” – associated with a variable, known as entropy. We’ll dive into entropy’s wide-ranging applications, from Machine Learning and data analysis to solving fun puzzles, showcasing its adaptability.
📏 Quantifying Misalignment: Here, we’ll explore how to measure the distance between two probability distributions using entropy-based metrics like cross-entropy and KL-divergence. These measures are particularly valuable for tasks like comparing predicted versus true distributions, as in classification loss functions and other alignment-critical scenarios.
💸 Quantifying Gain: Expanding from single-variable measures, this article investigates the relationships between two. You’ll discover how to quantify the information gained about one variable (e.g, target Y) by knowing another (e.g., predictor X). Applications include assessing variable associations, feature selection, and evaluating clustering performance.

Each article is crafted to stand alone while offering cross-references for deeper exploration. Together, they provide a practical, data-driven introduction to information theory, tailored for data scientists, analysts and machine learning practitioners.

Disclaimer: Unless otherwise mentioned the formulas analysed are for categorical variables with c≥2 classes (2 meaning binary). Continuous variables will be addressed in a separate article.

🚧 Articles (3) and (4) are currently under construction. I will share links once available. Follow me to be notified 🚧

Quantifying Surprise with Self-Information

Self-information is considered the building block of information quantification.

It is a way of quantifying the amount of “surprise” of a specific outcome.

Formally self-information, or also referred to as Shannon Information or information content, quantifies the surprise of an event x occurring based on its probability, p(x). Here we denote it as hₓ:

Self-information _h_ₓ is the information of event x that occurs with probability p(x).

The units of measure are called bits. One bit (binary digit) is the amount of information for an event x that has probability of p(x)=½. Let’s plug in to verify: hₓ=-log₂(½)= log₂(2)=1 bit.

This heuristic serves as an alternative to probabilities, odds and log-odds, with certain mathematical properties which are advantageous for information theory. We discuss these below when learning about Shannon’s axioms behind this choice.

It’s always informative to explore how an equation behaves with a graph:

Bernoulli trial self-information h(p). Key features: Monotonic, h(p=1)=0, h(p →)→∞.

To deepen our understanding of self-information, we’ll use this graph to explore the said axioms that justify its logarithmic formulation. Along the way, we’ll also build intuition about key features of this heuristic.

To emphasise the logarithmic nature of self-information, I’ve highlighted three points of interest on the graph:

At p=1 an event is guaranteed, yielding no surprise and hence zero bits of information (zero bits). A useful analogy is a trick coin (where both sides show HEAD).
Reducing the probability by a factor of two (p=½) increases the information to _hₓ=_1 bit. This, of course, is the case of a fair coin.
Further reducing it by a factor of four results in hₓ(p=⅛)=3 bits.

If you are interested in coding the graph here is a python script:

To summarise this section:

Self-Information hₓ=-log₂(p(x)) quantifies the amount of “surprise” of a specific outcome x.

Three Axioms

Referencing prior work by Ralph Hartley, Shannon chose -log₂(p) as a manner to meet three axioms. We’ll use the equation and graph to examine how these are manifested:

An event with probability 100% is not surprising and hence does not yield any information.
In the trick coin case this is evident by p(x)=1 yielding hₓ=0.
Less probable events are more surprising and provide more information.
This is apparent by self-information decreasing monotonically with increasing probability.
The property of Additivity – the total self-information of two independent events equals the sum of individual contributions. This will be explored further in the upcoming fourth article on Mutual Information.

There are mathematical proofs (which are beyond the scope of this series) that show that only the log function adheres to all three².

The application of these axioms reveals several intriguing and practical properties of self-information:

Important properties :

Minimum bound: The first axiom hₓ(p=1)=0 establishes that self-information is non-negative, with zero as its lower bound. This is highly practical for many applications.
Monotonically decreasing: The second axiom ensures that self-information decreases monotonically with increasing probability.
No Maximum bound: At the extreme where _p→_0, monotonicity leads to self-information growing without bound hₓ(_p→0) →_ ∞, a feature that requires careful consideration in some contexts. However, when averaging self-information – as we will later see in the calculation of entropy – probabilities act as weights, effectively limiting the contribution of highly improbable events to the overall average. This relationship will become clearer when we explore entropy in detail.

It is useful to understand the close relationship to log-odds. To do so we define p(x) as the probability of event x to happen and p(¬x)=1-p(x) of it not to happen. log-odds(x) = log₂(p(x)/p(¬x))= h(¬x) – h(x).

The main takeaways from this section are

Axiom 1: An event with probability 100% is not surprising

Axiom 2: Less probable events are more surprising and, when they occur, provide more information.

Self information (1) monotonically decreases (2) with a minimum bound of zero and (3) no upper bound.

In the next two sections we further discuss units of measure and choice of normalisation.

Information Units of Measure

Bits or Shannons?

A bit, as mentioned, represents the amount of information associated with an event that has a 50% probability of occurring.

The term is also sometimes referred to as a Shannon, a naming convention proposed by mathematician and physicist David MacKay to avoid confusion with the term ‘bit’ in the context of digital processing and storage.

After some deliberation, I decided to use ‘bit’ throughout this series for several reasons:

This series focuses on quantifying information, not on digital processing or storage, so ambiguity is minimal.
Shannon himself, encouraged by mathematician and statistician John Tukey, used the term ‘bit’ in his landmark paper.
‘Bit’ is the standard term in much of the literature on information theory.
For convenience – it’s more concise

Normalisation: Log Base 2 vs. Natural

Throughout this series we use base 2 for logarithms, reflecting the intuitive notion of a 50% chance of an event as a fundamental unit of information.

An alternative commonly used in machine learning is the natural logarithm, which introduces a different unit of measure called nats (short for natural units of information). One nat corresponds to the information gained from an event occurring with a probability of 1/e where e is Euler’s number (≈2.71828). In other words, 1 nat = -ln(p=(1/e)).

The relationship between bits (base 2) and nats (natural log) is as follows:

1 bit = ln(2) nats ≈ 0.693 nats.

Think of it as similar to a monetary current exchange or converting centimeters to inches.

In his seminal publication Shanon explained that the optimal choice of base depends on the specific system being analysed (paraphrased slightly from his original work):

“A device with two stable positions […] can store one bit of information” (bit as in binary digit).
“A digit wheel on a desk computing machine that has ten stable positions […] has a storage capacity of one decimal digit.”³
“In analytical work where integration and differentiation are involved the base e is sometimes useful. The resulting units of information will be called natural units.“

Key aspects of machine learning, such as popular loss functions, often rely on integrals and derivatives. The natural logarithm is a practical choice in these contexts because it can be derived and integrated without introducing additional constants. This likely explains why the machine learning community frequently uses nats as the unit of information – it simplifies the mathematics by avoiding the need to account for factors like ln(2).

As shown earlier, I personally find base 2 more intuitive for interpretation. In cases where normalisation to another base is more convenient, I will make an effort to explain the reasoning behind the choice.

To summarise this section of units of measure:

bit = amount of information to distinguish between two equally likely outcomes.

Now that we are familiar with self-information and its unit of measure let’s examine a few use cases.

Quantifying Event Information with Coins and Dice

In this section, we’ll explore examples to help internalise the self-information axioms and key features demonstrated in the graph. Gaining a solid understanding of self-information is essential for grasping its derivatives, such as entropy, cross-entropy (or KL divergence), and mutual information – all of which are averages over self-information.

The examples are designed to be simple, approachable, and lighthearted, accompanied by practical Python code to help you experiment and build intuition.

Note: If you feel comfortable with self-information, feel free to skip these examples and go straight to the Quantifying Uncertainty article.

To further explore the self-information and bits, I find analogies like coin flips and dice rolls particularly effective, as they are often useful analogies for real-world phenomena. Formally, these can be described as multinomial trials with n=1 trial. Specifically:

A coin flip is a Bernoulli trial, where there are c=2 possible outcomes (e.g., heads or tails).
Rolling a die represents a categorical trial, where c≥3 outcomes are possible (e.g., rolling a six-sided or eight-sided die).

As a use case we’ll use simplistic weather reports limited to featuring sun 🌞 , rain 🌧 , and snow ⛄️.

Now, let’s flip some virtual coins 👍 and roll some funky-looking dice 🎲 …

Fair Coins and Dice

We’ll start with the simplest case of a fair coin (i.e, 50% chance for success/Heads or failure/Tails).

Imagine an area for which at any given day there is a 50:50 chance for sun or rain. We can write the probability of each event be: p(🌞 )=p(🌧 )=½.

As seen above, according the the self-information formulation, when 🌞 or 🌧 is reported we are provided with h(🌞 __ )=h(🌧 )=-log₂(½)=1 bit of information.

We will continue to build on this analogy later on, but for now let’s turn to a variable that has more than two outcomes (c≥3).

Before we address the standard six sided die, to simplify the maths and intuition, let’s assume an 8 sided one (_c=_8) as in Dungeons Dragons and other tabletop games. In this case each event (i.e, landing on each side) has a probability of p(🔲 ) = ⅛.

When a die lands on one side facing up, e.g, value 7️⃣, we are provided with h(🔲 =7️⃣)=-log₂(⅛)=3 bits of information.

For a standard six sided fair die: p(🔲 ) = ⅙ → an event yields __ h(🔲 )=-log₂(⅙)=2.58 bits.

Comparing the amount of information from the fair coin (1 bit), 6 sided die (2.58 bits) and 8 sided (3 bits) we identify the second axiom: The less probable an event is, the more surprising it is and the more information it yields.

Self information becomes even more interesting when probabilities are skewed to prefer certain events.

Loaded Coins and Dice

Let’s assume a region where p(🌞 ) = ¾ and p(🌧 )= ¼.

When rain is reported the amount of information conveyed is not 1 bit but rather h(🌧 )=-log₂(¼)=2 bits.

When sun is reported less information is conveyed: h(🌞 )=-log₂(¾)=0.41 bits.

As per the second axiom— a rarer event, like p(🌧 )=¼, reveals more information than a more likely one, like p(🌞 )=¾ – and vice versa.

To further drive this point let’s now assume a desert region where p(🌞 ) =99% and p(🌧 )= 1%.

If sunshine is reported – that is kind of expected – so nothing much is learnt (“nothing new under the sun” 🥁) and this is quantified as h(🌞 )=0.01 bits. If rain is reported, however, you can imagine being quite surprised. This is quantified as h(🌧 )=6.64 bits.

In the following python scripts you can examine all the above examples, and I encourage you to play with your own to get a feeling.

First let’s define the calculation and printout function:

import numpy as np

def print_events_self_information(probs):
    for ps in probs:
        print(f"Given distribution {ps}")
        for event in ps:
            if ps[event] != 0:
                self_information = -np.log2(ps[event]) #same as: -np.log(ps[event])/np.log(2) 
                text_ = f'When `{event}` occurs {self_information:0.2f} bits of information is communicated'
                print(text_)
            else:
                print(f'a `{event}` event cannot happen p=0 ')
        print("=" * 20)

Next we’ll set a few example distributions of weather frequencies

# Setting multiple probability distributions (each sums to 100%)
# Fun fact - 🐍  💚  Emojis!
probs = [{'🌞   ': 0.5, '🌧   ': 0.5},   # half-half
        {'🌞   ': 0.75, '🌧   ': 0.25},  # more sun than rain
        {'🌞   ': 0.99, '🌧   ': 0.01} , # mostly sunshine
]

print_events_self_information(probs)

This yields printout

Given distribution {'🌞      ': 0.5, '🌧      ': 0.5}
When `🌞      ` occurs 1.00 bits of information is communicated 
When `🌧      ` occurs 1.00 bits of information is communicated 
====================
Given distribution {'🌞      ': 0.75, '🌧      ': 0.25}
When `🌞      ` occurs 0.42 bits of information is communicated 
When `🌧      ` occurs 2.00 bits of information is communicated 
====================
Given distribution {'🌞      ': 0.99, '🌧      ': 0.01}
When `🌞      ` occurs 0.01 bits of information is communicated 
When `🌧      ` occurs 6.64 bits of information is communicated

Let’s examine a case of a loaded three sided die. E.g, information of a weather in an area that reports sun, rain and snow at uneven probabilities: p(🌞 ) = 0.2, p(🌧 )=0.7, p(⛄️)=0.1.

Running the following

print_events_self_information([{'🌞 ': 0.2, '🌧 ': 0.7, '⛄️': 0.1}])

yields

Given distribution {'🌞  ': 0.2, '🌧  ': 0.7, '⛄️': 0.1}
When `🌞  ` occurs 2.32 bits of information is communicated 
When `🌧  ` occurs 0.51 bits of information is communicated 
When `⛄️` occurs 3.32 bits of information is communicated

What we saw for the binary case applies to higher dimensions.

To summarise – we clearly see the implications of the second axiom:

When a highly expected event occurs – we do not learn much, the bit count is low.
When an unexpected event occurs – we learn a lot, the bit count is high.

Event Information Summary

In this article we embarked on a journey into the foundational concepts of information theory, defining how to measure the surprise of an event. Notions introduced serve as the bedrock of many tools in information theory, from assessing data distributions to unraveling the inner workings of machine learning algorithms.

Through simple yet insightful examples like coin flips and dice rolls, we explored how self-information quantifies the unpredictability of specific outcomes. Expressed in bits, this measure encapsulates Shannon’s second axiom: rarer events convey more information.

While we’ve focused on the information content of specific events, this naturally leads to a broader question: what is the average amount of information associated with all possible outcomes of a variable?

In the next article, Quantifying Uncertainty, we build on the foundation of self-information and bits to explore entropy – the measure of average uncertainty. Far from being just a beautiful theoretical construct, it has practical applications in data analysis and machine learning, powering tasks like decision tree optimisation, estimating diversity and more.

Loved this post? ❤️🍕

💌 Follow me here, join me on LinkedIn or 🍕 buy me a pizza slice!

About This Series

Even though I have twenty years of experience in data analysis and predictive modelling I always felt quite uneasy about using concepts in information theory without truly understanding them.

The purpose of this series was to put me more at ease with concepts of information theory and hopefully provide for others the explanations I needed.

🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: EntropyGa_in intuition into Entropy and master its applications in Machine Learning and Data Analysis. Python code included. 🐍 me_dium.com

Check out my other articles which I wrote to better understand Causality and Bayesian Statistics:

Footnotes

¹ A Mathematical Theory of Communication, Claude E. Shannon, Bell System Technical Journal 1948.

It was later renamed to a book The Mathematical Theory of Communication in 1949.

[Shannon’s “A Mathematical Theory of Communication”] the blueprint for the digital era – Historian James Gleick

² See Wikipedia page on Information Content (i.e, self-information) for a detailed derivation that only the log function meets all three axioms.

³ The decimal-digit was later renamed to a hartley (symbol Hart), a ban or a dit. See Hartley (unit) Wikipedia page.

Credits

Unless otherwise noted, all images were created by the author.

Many thanks to Will Reynolds and Pascal Bugnion for their useful comments.

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Glenfarne Group secures $500 million for Texas LNG development

Glenfarne Group LLC has secured fresh capital to continue development and early construction works for the proposed Texas LNG plant to be constructed on a 625-acre site in the Port of Brownsville, Tex. HPS Investment Partners, a part of BlackRock, agreed to the $500-million investment, which serves as one of

AI workloads shake up observability market

There are 19 vendors that made the cut for Gartner’s new report. Its Leaders quadrant includes (alphabetically) Chronosphere, Coralogix, Datadog, Dynatrace, Elastic, Grafana Labs, IBM, and New Relic. The Challengers are Alibaba Cloud, Amazon Web Services, LogicMonitor, Microsoft, and Splunk. The two Visionaries are BMC Helix and Honeycomb. Those dubbed

Huawei eying possible DRAM market entry

Chinese tech giant Huawei is reportedly entering the DRAM manufacturing business in a bid to cash in on the insane profitability of memory sales. Three firms – Micron Technology, SK hynix, and Samsung Electronics — account for 95% of the DRAM on the market worldwide. The rest is small players,

IBM targets AI edge with Power server, software upgrades

IBM has bolstered its Power server portfolio with a new edge S1112 server and announced IBM Power Autonomous Operations, an AI agent that helps customers monitor Power systems and autonomously resolve issues to keep operations running smoothly. Additional software upgrades are aimed at helping customers deploy and manage AI infrastructure

S&P Global: Hormuz vessel transits fall amid heightened security risks

Vessel traffic through the Strait of Hormuz remained subdued July 10-12 as heightened regional security risks continued to weigh on movements through the strategic waterway, according to S&P Global MINT and S&P Global Commodities at Sea data. A total of 73 vessels transited the strait during the 3-day period, averaging fewer than 25 crossings/day. Transits fell to 11 on July 12, the lowest since June 14, after Iran declared the strait closed amid what the Persian Gulf Strait Authority described as “illegal movements” of US military forces in the region. No inbound crossings were recorded July 12, the first such occurrence since June 12. Six of the day’s 11 transits were assessed as compliant vessels. Total crossings were 32 on July 10 and 30 on July 11. The Joint Maritime Information Center (JMIC) said July 12 that the regional threat level remained severe. Despite Iran’s closure declaration, JMIC said the southern route remained available and had been expanded for two-way vessel traffic. Energy carriers—including oil, chemical, LPG, and LNG tankers—accounted for about 48% of transits July 10-12. About two-thirds of energy-carrier crossings involved compliant vessels, although only 10 compliant energy carriers entered the Persian Gulf, mostly without visible automatic identification system (AIS) signals. Inbound tanker capacity also softened. An average 6.5 million b/d of new oil and LPG tanker capacity entered the Gulf through Hormuz July 1-12, with VLCCs and Suezmaxes accounting for nearly 80%. Average inbound capacity fell to 6 million b/d July 10-12 from 8.5 million b/d in the first week of July. All compliant outbound energy carriers transiting Hormuz during the 3-day period did so without visible AIS signals, including ADNOC-operated LNG carrier AL HAMRA and several VLCC and product tankers. Iran-linked and US-sanctioned vessels accounted for nearly 60% of all crossings during the period.

Beyond AI Pilots: Scaling AI-Enabled Decision Making in Energy

Date: Thursday, August 6, 2026Time: 11:00 AM (GMT-04:00) Eastern Time – New YorkDuration: 60 minutes Already registered? Click here to log in now. Artificial Intelligence is rapidly becoming a strategic priority across industrial organizations, yet many companies continue to struggle with fragmented data, disconnected workflows, and AI initiatives that never move beyond pilot projects. The challenge is not access to AI—it is creating the business context, governance, and lifecycle intelligence needed to transform AI insights into measurable operational outcomes. Join Siemens Digital Industries Software to learn how Intelligence Center X, part of the Siemens Xcelerator portfolio, helps organizations connect enterprise data, workflows, and AI capabilities into a single governed environment where people and AI work together to drive faster, more informed decisions. In this session, we’ll explore how organizations can: • Move beyond isolated AI experiments to enterprise-scale deployment • Connect engineering, manufacturing, operations, supply chain, and service data into a unified intelligence framework • Enable AI agents to operate within governed, human-in-the-loop business processes • Improve operational performance through AI-assisted decision-making • Accelerate issue resolution, reduce manual effort, and increase organizational agility Attendees will also learn how Intelligence Center X combines lifecycle intelligence, industrial data models, AI orchestration, and low-code application development to create production-ready AI solutions that deliver measurable business value. Real-world examples will demonstrate how organizations have achieved significant improvements, including reductions in manual effort, faster issue resolution, improved data quality, and enhanced decision-making capabilities. Whether you are responsible for digital transformation, operations, manufacturing, engineering, or executive strategy, this webinar will provide practical insight into building a scalable foundation for industrial AI and creating a future where people and AI work together to drive business outcomes.

TotalEnergies lets drilling, completions contract for Suriname deepwater oil project

TotalEnergies has let contracts to Halliburton for work on the GranMorgu deepwater oil development project offshore Suriname. The workscope includes drilling and completions services for a long-term program that includes applying integrated digital workflows, real time data, and remote operations control for drilling and completions. As part of the project scope, Halliburton worked with local suppliers to upgrade its liquid mud and cement plant and supported construction of Suriname’s first completions and drilling workshop, featuring advanced maintenance and repair capabilities, the service provider said in a release July 13. The aim of the GranMorgu project is to develop resources on Block 58, which lies about 150 km off the Surinamese coast. Specifically, Sapakara and Krabdagu fields, which contain estimated recoverable reserves of nearly 760 million bbl, TotalEnergies noted on its website. The project’s floating production, storage, and offloading unit (FPSO), with a capacity of 220,000 b/d, is based on tested design principles of units in nearby Guyana and designed for potential future tie-in of satellite fields. Production start-up is expected in 2028. TotalEnergies is operator of the project with 40% interest. Partners are APA Corp. (40%) and state-owned Staatsolie Maatschappij Suriname NV (20%).

Aramco lets stimulation, completion services contract for unconventional gas development

Saudi Aramco has awarded Halliburton a multi-year contract to provide stimulation and completion services for the company’s unconventional gas development program in Saudi Arabia. Halliburton said July 15 that the award is part of a broader multibillion-dollar contract framework supporting the Kingdom’s unconventional resource expansion. Under the agreement, Halliburton will deploy intelligent fracturing automation technologies designed to optimize treatment performance in real time and support execution across multiwell development campaigns. The company said the technologies will enable greater digital integration across field operations. Development of the Jafurah unconventional gas field, the Middle East’s largest liquids-rich shale gas play, is under way. In support of the program, Halliburton plans to expand local manufacturing capacity, strengthen its supply chain network, and increase workforce development initiatives within the Kingdom as activity levels continue to grow. “Beginning in the third quarter of 2026, Halliburton will deploy the Kingdom’s first fully integrated intelligent fracturing platform through OCTIV® Auto Frac and Sensori™ fracturing monitoring services to contribute to asset value for one of the world’s largest unconventional fields,” said Rami Yassine, senior vice-president, Eastern Hemisphere, Halliburton. Jafurah background Jafurah is a key component of Aramco’s gas expansion strategy intended to help meet rising demand for natural gas in power generation and industry. In February 2026, the operator said it seeks to expand sales gas production capacity by about 80% by 2030 compared with 2021 production levels. At the time, Aramco said unconventional shale gas output from Jafurah began in December 2025. The field covers about 17,000 sq km and is estimated to contain 229 tcf of raw gas and 75 billion stb of condensate. Aramco expects the development to produce 2 bcfd of sales gas, 420 MMscfd of ethane, and about 630,000 b/d of high-value liquids by 2030.

Digitalization paying off for Rompetrol’s Petromidia refinery

Rompetrol Rafinare SA—jointly owned by Kazakhstan’s state-owned JSC NC KazMunayGas (KMG) subsidiary KMG International NV (54.63%) and Romania’s Ministry of Economy, Energy & Business Environment (44.7%)—is using proprietary operations management software from Emerson Electric Co. to improve alarm performance its more than 5-million tonne/year Petromidia refinery in Năvodari, Romania, on the Black Sea. To date, implementation of Emerson’s DeltaV AgileOps operations management software has helped reduce distributed control system (DCS) alarm volumes at the Petromidia refinery by more than 95%, the service provider said on July 14. Emerson said the project improved alarm performance, increased operator effectiveness, and brought alarm rates within the Engineering Equipment and Materials Users Association (EEMUA) 191 guideline recommendations. Before implementation of DeltaV AgileOps, alarm behavior at the refinery—Romania’s largest—expanded beyond recommended best practices, including high alarm volumes during plant disturbances, nuisance-chattering alarms, and alarms that remained active during normal operation. To address those issues, Rompetrol Rafinare worked with KMG International’s engineering and maintenance services provider SC Rominserv SRL to improve alarm quality and reduce nuisance alarms across the refinery. Use of DeltaV AgileOps—which pulls alarm and event data directly from the DeltaV DCS running the plant—provided continuous visibility into alarm performance, including average and peak alarm rates, recurring alarm sequences, and time spent outside recommended operating thresholds, Emerson said. Following implementation, engineering teams at the refinery used performance dashboards and historical trending to identify high-frequency alarms, stale alarms, and nuisance “bad actor” alarms responsible for disproportionate alarm activity. The teams evaluated alarm behavior during steady-state operation, startup conditions, and process disturbances, then assessed proposed changes to alarm limits, priorities, and suppression strategies against plant data. Emerson said the project reduced alarm generation to fewer than 50,000 alarms/month from more than 2 million alarms/month during normal operation. Emerson—which linked the outcome to EEMUA 191 guidance that

EIA: US crude inventories down 1.7 million bbl

US crude oil inventories for the week ended July 10, excluding the Strategic Petroleum Reserve, decreased by 1.7 million bbl from the previous week, according to data from the US Energy Information Administration (EIA). At 409.7 million bbl, US crude oil inventories are about 6% below the 5-year average for this time of year, the EIA report indicated. EIA said total motor gasoline inventories decreased by 1.5 million bbl from last week and are 8% below the 5-year average for this time of year. Finished gasoline inventories and blending components inventories both decreased last week. Distillate fuel inventories increased by 4.6 million bbl last week and are about 11% below the 5-year average for this time of year. Propane-propylene inventories increased by 3 million bbl from last week and are 28% above the 5-year average for this time of year, EIA said. US crude oil refinery inputs averaged 17.1 million b/d for the week ended July 10, which was 99,000 b/d more than the previous week’s average. Refineries operated at 96.2% of capacity. Gasoline production decreased, averaging 9.6 million b/d. Distillate fuel production increased, averaging 5.3 million b/d. US crude oil imports averaged 5.7 million b/d, up 60,000 b/d from the previous week. Over the last 4 weeks, crude oil imports averaged about 5.5 million b/d, 12.2% less than the same 4-week period last year. Total motor gasoline imports averaged 354,000 b/d. Distillate fuel imports averaged 93,000 b/d.

The AI Infrastructure Split Screen: Capital Rush Meets Community Resistance

It would be difficult to construct a more revealing snapshot of the AI infrastructure market than the one delivered in mid-July. In the same news cycle, Csquare completed a billion-dollar initial public offering, Switch was linked to a potential $10 billion IPO, and Databricks reached a reported valuation of $188 billion. At the project level, developers advanced or disclosed campuses measured not in tens or hundreds of megawatts, but in gigawatts—from Meta’s expanding Louisiana complex and Google’s reported Wyoming plans to new Crusoe, QTS, MARA and Tract developments. Yet the same week brought a state-level permitting pause in New York, a decisive project rejection in Palm Beach County, planned protests across more than 20 states, and fresh disputes over parkland, water availability and local control. This is the data center and AI landscape in 2026: capital is abundant but increasingly discriminating; power is more valuable than the underlying real estate; and community consent has become nearly as important as interconnection capacity. Public Markets Put Different Prices on the AI Stack The capital-market headlines illustrated how differently investors are valuing the various layers of AI infrastructure. Csquare priced 50 million shares at $21, raising approximately $1.05 billion and establishing an equity valuation of roughly $3.2 billion. The offering was substantial, but it priced below the proposed $23-to-$27 range, and the shares finished their first trading day slightly below the offer price. Brookfield retained approximately 67% of the company’s voting power following the transaction. That reception contrasts sharply with the valuation being discussed for Switch. The DigitalBridge-backed operator has reportedly engaged Goldman Sachs and JPMorgan for a potential IPO that could raise as much as $10 billion and value Switch near $80 billion, including debt. The transaction remains prospective, but the figure is striking when compared with the $11 billion take-private agreement

New York State just hit pause on the AI data center boom

The moratorium could result in some “border-hopping,” with enterprises hosting local servers in adjacent states like Pennsylvania, Connecticut, or New Jersey, but that’s not likely to be widespread, Kimball noted. The realistic regional impact will be “more of a slow squeeze rather than a shock,” he said. This could result in tighter colocation availability and firmer pricing in the New York Metropolitan area over the next few years. Cloud providers may also steer new AI capacity to regions like Georgia, Ohio, Texas, and Utah, where power and permitting are more predictable. An inflection point, but more trickle-down than direct impact Indeed, noted Jeremy Roberts, senior director for research and content at Info-Tech Research Group, the moratorium is an “inflection point” and a “way to placate an increasingly angry public,”.

TeraWulf’s $19B Anthropic Lease Puts Its Brownfield AI Strategy to the Test

He added that the company’s strategy is centered on owning and operating critical infrastructure, maintaining direct relationships with customers and controlling the long-term evolution of its campuses. This Model Differs Significantly from the Previous Abernathy JV TeraWulf and Fluidstack created the Abernathy venture in 2025 to develop a 168-MW critical IT load campus on approximately 120 acres near Abernathy, Texas. The project’s total utility requirement has been described as approximately 240 MW. Fluidstack committed to a 25-year lease at the campus, with Google providing approximately $1.3 billion of credit support for Fluidstack’s obligations. TeraWulf acquired a 50.1% interest in the joint venture through an investment of approximately $450 million. The project subsequently issued $1.3 billion in senior secured notes to support construction and related expenses. The Abernathy agreements were expected to produce approximately $9.5 billion in contracted revenue for the joint venture over the initial 25-year term. Construction has been advancing toward delivery during the second half of 2026. Following the sale, Fluidstack and the other purchasers will control the project. TeraWulf agreed to sell its Abernathy interest for approximately $530 million, compared with its $450 million investment in the joint venture. The consideration is scheduled to be paid in three installments through April 2027, with the proceeds expected to support investment in infrastructure opportunities that TeraWulf intends to own and operate directly. The decision does not necessarily indicate that TeraWulf has become less interested in partnerships with Fluidstack. Fluidstack remains an important tenant at TeraWulf’s Lake Mariner campus in New York, and the companies have built a substantial pipeline of AI infrastructure together. In infrastructure terms, TeraWulf is acting as both developer and capital allocator. It originated the Abernathy project, helped secure the customer and financing structure, advanced construction and is now monetizing its interest before the campus begins

Comparing Space-Driven Data Center Strategies: Modular Satellites vs. Integrated Rocket Nodes

In addition to developing radiation-tolerant computing, optical communications, deployable solar arrays and orbital thermal-management systems, Cowboy must successfully design, manufacture, test and license a new rocket. Its launch vehicle would require authorization from the Federal Aviation Administration in addition to the approvals needed for the satellite constellation. Cowboy nevertheless enters the race with considerably more capital than Orbital. The company announced a $275 million Series B round in May at a reported $2 billion valuation. Founded in 2024 by Robinhood co-founder Baiju Bhatt, with a focus on space-based solar power before expanding into orbital computing and launch systems. One Hundred Kilowatts Versus One Megawatt The clearest distinction between the two proposals is the capacity assigned to each node. Orbital’s production design calls for approximately 100 kilowatts of computing power per satellite. Cowboy is targeting megawatt-class spacecraft, potentially giving each Stampede node approximately 10 times the power capacity of an Orbital satellite. At their stated maximum scales, Orbital’s 100,000 satellites would provide approximately 10 gigawatts. If Cowboy ultimately achieved one megawatt across all 20,000 Stampede spacecraft, its theoretical aggregate capacity would approach 20 gigawatts. Those figures should be treated as design objectives, not capacity forecasts. Neither company has demonstrated even one operational node at its proposed production power level. Orbital’s smaller satellites may be easier to test and deploy incrementally. The company can begin with a single hosted GPU, progress to a purpose-built prototype and expand as launch economics and customer demand permit. Cowboy’s larger nodes could provide more useful computing capacity with fewer satellites and potentially fewer launches. Combining the rocket stage and data center would also reduce the amount of structural mass that does not directly support power generation or computing. The tradeoff is concentration risk. The failure of a megawatt Cowboy spacecraft would remove considerably more capacity than

Google Cloud configuration update disrupts VMware Engine stretched clusters

“Google made a network setting change that accidentally broke the connection between the two data center zones in VMware Engine. The virtual machines themselves kept running fine, but nobody could reach them, and there was a risk that some machines might lose the ability to save data properly. This indicates that even managed cloud infrastructure can experience failures in critical shared network components,” said Pareekh Jain, CEO at EIIRTrend & Pareekh Consulting. Neil Shah, vice president at Counterpoint Research, said the real culprit here is the SDN orchestration control plane, where a routine internal network update or configuration tweak introduced routing failure across multiple zones. “While most of the physical nodes are distributed for exactly this redundancy purpose, they are still tightly coupled to a singular shared orchestration fabric, so if that control plane crashes, then everything comes crashing down, and the physical distributed nodes become irrelevant.” Stretched clusters fall short Although the outage did not bring down virtual machines, the incident undermined the primary reason enterprises deploy stretched clusters.

AI’s Future Must Return to the Edge: How Power Constraints and Local Politics Are Redefining AI Infrastructure

Over the past two years, AI build plans have driven a sharp escalation in projected data center power demand. One recent assessment1 found that the U.S. disclosed data center development pipeline reached roughly 241 gigawatts by the end of 2025—an increase of about 159% in a single year—illustrating the unprecedented pace at which AI infrastructure demand is expanding. Forecasts from major analysts indicate that total data center power consumption could grow at least 50% by 2027 and potentially as much as 165% by 2030, with AI training and inference responsible for most of the incremental load.2 At this pace, planned AI capacity is growing faster than electric infrastructure can realistically be expanded. In many markets, available land and fiber are not the limiting factors; dependable megawatt delivery is.3 At the facility level, AI hardware is moving standard designs into new ranges. Power densities that once centered around 10–20 kW per rack are being replaced by configurations nearer 40 kW, with dense AI racks pushing toward 85 kW today and credible roadmaps to 200–250 kW per rack by 2030, though we’ve all seen the reports of even larger. These levels do not only affect cooling and white‑space layouts; they materially change the electrical infrastructure required per room and per building, and by extension the strain on local grids. On the power‑system side, constraints are now explicit. Transmission operators and regulators are stating that current generation, interconnection, and build‑out timelines are not sufficient to accommodate another decade of large demand centers in their present form. Analysts tracking AI data center energy demand point to electricity, grid access, and firm capacity as the primary constraints on new builds, with grid bottlenecks and transmission limitations flagged as risks for up to 20% of planned projects.4, 5 At the facility level, AI hardware is moving

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle