Stay Ahead, Stay ONMINE

How to Spot and Prevent Model Drift Before it Impacts Your Business

Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection.  I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production. Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves: Tracking model performance Monitoring feature distributions Detecting both univariate and multivariate drifts A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources. In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production. What’s the difference between feature drift and score drift? Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time. On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions. Simulating score shifts To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features. The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud. Each feature has different underlying distributions: Transaction Amount: Log-normal distribution (right-skewed with a long tail) Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company) Time Since Last Transaction: Exponential distribution Transaction Count: Poisson distribution Entered PIN: Binomial distribution. To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores. As shown in the plot below: Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships. Distribution of drifted features (image by author) Stable feature: Time Since Last Transaction remained unchanged. Distribution of stable feature (image by author) Drifted scores: As a result of the drifted features, the distribution in model scores has also changed. Distribution of model scores (image by author) This setup allows us to analyze how feature drift impacts model scores in production. Detecting model score drift using PSI To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time. PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift. Python implementation: # Define function to calculate PSI given two datasets def calculate_psi(reference, production, bins=10): # Discretize scores into bins min_val, max_val = 0, 1 bin_edges = np.linspace(min_val, max_val, bins + 1) # Calculate proportions in each bin ref_counts, _ = np.histogram(reference, bins=bin_edges) prod_counts, _ = np.histogram(production, bins=bin_edges) ref_proportions = ref_counts / len(reference) prod_proportions = prod_counts / len(production) # Avoid division by zero ref_proportions = np.clip(ref_proportions, 1e-8, 1) prod_proportions = np.clip(prod_proportions, 1e-8, 1) # Calculate PSI for each bin psi = np.sum((ref_proportions – prod_proportions) * np.log(ref_proportions / prod_proportions)) return psi # Calculate PSI psi_value = calculate_psi(ref_data[‘model_score’], prod_data[‘model_score’], bins=10) print(f”PSI Value: {psi_value}”) Below is a summary of how to interpret PSI values: PSI < 0.1: No drift, or very minor drift (distributions are almost identical). 0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different. 0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions. PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data. Histogram of model score distributions (image by author) The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions. Detecting feature drift Kolmogorov-Smirnov test for numeric features The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution. The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1: 0 indicates no difference between the two distributions. Values closer to 1 suggest a greater shift. Python implementation: # Create an empty dataframe ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected']) # Loop through all features and perform the K-S test for col in numeric_cols: ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col]) drift_detected = p_value < 0.05 # Store results in the dataframe ks_results = pd.concat([ ks_results, pd.DataFrame({ 'Feature': [col], 'KS Statistic': [ks_stat], 'p-value': [p_value], 'Drift Detected': [drift_detected] }) ], ignore_index=True) Below are ECDF charts of the four numeric features in our dataset: ECDFs of four numeric features (image by author) Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations have lower account ages. Chi-Square test for categorical features To detect shifts in categorical and boolean features, I like to use the Chi-Square test. This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values: Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets. P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift. Python implementation: # Create empty dataframe with corresponding column names chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected']) for col in categorical_cols: # Get normalized value counts for both reference and production datasets ref_counts = ref_data[col].value_counts(normalize=True) prod_counts = prod_data[col].value_counts(normalize=True) # Ensure all categories are represented in both all_categories = set(ref_counts.index).union(set(prod_counts.index)) ref_counts = ref_counts.reindex(all_categories, fill_value=0) prod_counts = prod_counts.reindex(all_categories, fill_value=0) # Create contingency table contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)]) # Perform Chi-Square test chi2_stat, p_value, _, _ = chi2_contingency(contingency_table) drift_detected = p_value < 0.05 # Store results in chi2_results dataframe chi2_results = pd.concat([ chi2_results, pd.DataFrame({ 'Feature': [col], 'Chi-Square Statistic': [chi2_stat], 'p-value': [p_value], 'Drift Detected': [drift_detected] }) ], ignore_index=True) The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift: Distribution of categorical feature (image by author) Detecting multivariate shifts Spearman Correlation for shifts in pairwise interactions In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data. By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend. To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear. To assess shifts in feature relationships, we compare: Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset. Production correlation (prod_corr): Captures new feature relationships in production. Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts. Python implementation: # Calculate correlation matrices ref_corr = ref_data.corr(method='spearman') prod_corr = prod_data.corr(method='spearman') # Calculate correlation difference corr_diff = abs(ref_corr – prod_corr) Example: Change in correlation Now, let’s look at the correlation between transaction_amount and account_age_in_months: In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features. In prod_corr, the correlation is -0.0325, indicating a weak negative correlation. Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift. The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months. There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction accounts. This is spot on! Autoencoder for complex, high-dimensional multivariate shifts In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data. Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations. An autoencoder is a neural network that learns a compressed representation of data through two components: Encoder: Compresses input data into a lower-dimensional representation. Decoder: Reconstructs the original input from the compressed representation. To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss. Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to it has seen and learned. High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift. Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously. Python implementation: ref_features = ref_data[numeric_cols + categorical_cols] prod_features = prod_data[numeric_cols + categorical_cols] # Normalize the data scaler = StandardScaler() ref_scaled = scaler.fit_transform(ref_features) prod_scaled = scaler.transform(prod_features) # Split reference data into train and validation np.random.shuffle(ref_scaled) train_size = int(0.8 * len(ref_scaled)) train_data = ref_scaled[:train_size] val_data = ref_scaled[train_size:] # Build autoencoder input_dim = ref_features.shape[1] encoding_dim = 3 # Input layer input_layer = Input(shape=(input_dim, )) # Encoder encoded = Dense(8, activation="relu")(input_layer) encoded = Dense(encoding_dim, activation="relu")(encoded) # Decoder decoded = Dense(8, activation="relu")(encoded) decoded = Dense(input_dim, activation="linear")(decoded) # Autoencoder autoencoder = Model(input_layer, decoded) autoencoder.compile(optimizer="adam", loss="mse") # Train autoencoder history = autoencoder.fit( train_data, train_data, epochs=50, batch_size=64, shuffle=True, validation_data=(val_data, val_data), verbose=0 ) # Calculate reconstruction error ref_pred = autoencoder.predict(ref_scaled, verbose=0) prod_pred = autoencoder.predict(prod_scaled, verbose=0) ref_mse = np.mean(np.power(ref_scaled – ref_pred, 2), axis=1) prod_mse = np.mean(np.power(prod_scaled – prod_pred, 2), axis=1) The charts below show the distribution of reconstruction loss between both datasets. Distribution of reconstruction loss between actuals and predictions (image by author) The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions. Summarizing Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers. All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud. In this post, I covered techniques for detecting drift on three different levels: Model score drift: Using Population Stability Index (PSI) Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts. These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively. Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments. There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools. What’s the model monitoring process like at places you’ve worked? Want to build your AI skills? 👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data. Resources

Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection. 

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

  • Tracking model performance
  • Monitoring feature distributions
  • Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

  • Transaction Amount: Log-normal distribution (right-skewed with a long tail)
  • Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
  • Time Since Last Transaction: Exponential distribution
  • Transaction Count: Poisson distribution
  • Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

  • Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.
Distribution of drifted features (image by author)
  • Stable feature: Time Since Last Transaction remained unchanged.
Distribution of stable feature (image by author)
  • Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.
Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

  • PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
  • 0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
  • 0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
  • PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.
Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

  • 0 indicates no difference between the two distributions.
  • Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations have lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

  • Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
  • P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

  • Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
  • Production correlation (prod_corr): Captures new feature relationships in production.
  • Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

  • In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
  • In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
  • Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction accounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

  • Encoder: Compresses input data into a lower-dimensional representation.
  • Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

  • Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to it has seen and learned.
  • High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

  • Model score drift: Using Population Stability Index (PSI)
  • Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
  • Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?


Want to build your AI skills?

👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.


Resources

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

SolarWinds buys Squadcast to speed incident response

Squadcast customers shared their experiences with the technology. “Since implementing Squadcast, we’ve reduced incoming alerts from tens of thousands to hundreds, thanks to flexible deduplication. It has a direct impact on reducing alert fatigue and increasing awareness,” said Avner Yaacov, Senior Manager at Redis, in a statement. According to SolarWinds,

Read More »

‘EPA’s action is illegal,’ says green group attorney about $20B GGRF funding freeze

An attorney representing the Climate United Fund sent a Tuesday letter to the Environmental Protection Agency asking for the agency to immediately reinstate the fund’s access to its nearly $7 billion Greenhouse Gas Reduction Fund grant, calling the freeze illegal. “As already explained, Climate United’s preferred path forward is direct communication to find common ground,” said Adam Unikowsky, a partner at Jenner & Block. “But if the EPA adheres to its decision to suspend or terminate Climate United’s grant, it should stay its decision pending judicial review …. Climate United is likely to succeed in showing that the EPA’s action is illegal.” Review of the matter is urgent, Unikowsky said, as Climate United faces “immediate, irreparable harm” if the funding is not reinstated.  If the group can’t find another source of funding, it will “shortly” run out of cash for operating expenses, employee pay, rent for some offices, and pay for contractors who provide services such as IT and legal, the letter said. In addition, Climate United won’t be able to meet its commitments for already approved loans and awards, which “would cause profound harm to its subawardees,” Unikowsky alleged. With its GGRF funding, Climate United has so far made investments that include a $10.8 million pre-development loan for utility-scale solar projects on tribal lands in eastern Oregon and Idaho and $250 million toward electric truck manufacturing. EPA Administrator Lee Zeldin commented on the freeze Wednesday, saying that his team had “uncovered extensive troubling developments with $20 billion in ‘gold bars’ that the Biden EPA ‘tossed off the Titanic.’” “These taxpayer funds were parked at an outside financial institution to rush out the door and circumvent proper oversight; $20 billion was given to just eight pass-through nongovernmental entities in an effort riddled with self-dealing, conflicts of interest, and an extreme

Read More »

Enfinium unveils next phase of CCS programme

UK energy-from-waste operator Enfinium has announced the next phase of its carbon capture and storage (CCS) pilot programme. As part of the new phase, the company will relocate the CCS pilot plant currently in place at its Ferrybridge 1 facility in West Yorkshire to Parc Adfer, North Wales, in April. The pilot plant will be installed and operated by clean technology company Kanadevia Inova. A new pilot plant will then be installed at Ferrybridge by UK technology company Nuada. The company is in the process of scaling an innovative metal-organic framework (MOF) technology that captures carbon dioxide (CO2) from point sources through a vacuum swing process. Both pilot projects will run for at least six months as part of Enfinium’s plan to deploy CCS across all six of its UK facilities at a total cost of around £1.7 billion. Enfinium noted that the plant at Parc Adfer would be the only active carbon capture pilot in Wales and the first pilot to be deployed within the wider HyNet industrial cluster. The Parc Adfer facility is also a candidate for grant support through the UK government’s Track-1 HyNet Expansion programme. Enfinium’s announcement indicated that the company was hoping for a positive decision on this from the government in the coming months. The CCS pilot currently operating at Ferrybridge was launched in September 2024, becoming the first pilot project of its kind. The pilot entailed use of a containerised technology that Enfinium said at the time was a scaled-down version of CCS technology that could subsequently be deployed across all of its sites. The technology was supplied by green technology player Hitachi Zosen Inova (HZI) and was being used to capture 1 tonne per day of CO2 emissions at the site. At the time, Enfinium said that trial would run for at

Read More »

Basin Electric urges Congress to support clean energy tax credits

Congress should maintain Inflation Reduction Act clean energy tax credits to provide utilities with certainty about their investment decisions, Todd Brickhouse, CEO and general manager of Basin Electric Power Cooperative, said during a House hearing on Wednesday. “The immediate removal of [the production tax credit] will not allow utilities to plan for and avoid increased costs, and this will also immediately harm ratepayers,” Brickhouse said during a hearing held by the Energy and Commerce Committee’s subcommittee on energy on the challenges of responding to rising demand growth. Basin Electric, a generation and transmission wholesale cooperative based in Bismarck, North Dakota, is building 1,500 MW of solar, partly based on the assumption the capacity would be eligible for PTCs, according to Brickhouse. Congressional Republicans are looking for ways to trim federal spending to pay for their budget plans, potentially including changes to tax credits provisions contained in the Inflation Reduction Act. Rep. Mariannette Miller-Meeks, R-Iowa, said the IRA’s tax credits can help spur the buildout of energy infrastructure to meet growing electric demand.  “Tax incentives like the tech-neutral clean energy credits under [sections] 45Y and 45E, and the 45Q carbon sequestration credit, and the 45X advanced manufacturing credit aim to strengthen American manufacturing capability and reduce the engineering procurement and construction risks that have plagued major energy projects,” Miller-Meeks said. She joined 17 other House Republicans in an Aug. 6 letter to House Speaker Mike Johnson, R-La., supporting the IRA’s tax credits. Those tax credits are “incredibly helpful in ensuring that we can get those projects built and online in a manner that’s affordable for our customers,” said Noel Black, senior vice president of federal regulatory affairs for Southern Co., which owns utilities in the Southeast. Renewable energy can help meet electricity demand, in part because it can be built relatively quickly, according

Read More »

USA Crude Oil Inventories Rise WoW

U.S. commercial crude oil inventories, excluding those in the Strategic Petroleum Reserve (SPR), increased by 3.6 million barrels from the week ending February 21 to the week ending February 28, the U.S. Energy Information Administration (EIA) highlighted in its latest weekly petroleum status report. This report was released on March 5 and included data for the week ending February 28. The report showed that crude oil stocks, not including the SPR, stood at 433.8 million barrels on February 28, 430.2 million barrels on February 21, and 448.5 million barrels on March 1, 2024. Crude oil in the SPR stood at 395.3 million barrels on February 28 and February 21, and 361.0 million barrels on March 1, 2024, the report outlined. Total petroleum stocks – including crude oil, total motor gasoline, fuel ethanol, kerosene type jet fuel, distillate fuel oil, residual fuel oil, propane/propylene, and other oils – stood at 1.600 billion barrels on February 28, the report showed. Total petroleum stocks were down 4.6 million barrels week on week and up 16.8 million barrels year on year, the report revealed. “At 433.8 million barrels, U.S. crude oil inventories are about four percent below the five year average for this time of year,” the EIA stated in its latest weekly petroleum status report. “Total motor gasoline inventories decreased by 1.4 million barrels from last week and are one percent above the five year average for this time of year. Finished gasoline inventories increased, while blending components inventories decreased last week,” it added. “Distillate fuel inventories decreased by 1.3 million barrels last week and are about six percent below the five year average for this time of year. Propane/propylene inventories decreased by 2.9 million barrels from last week and are four percent below the five year average for this time of year,”

Read More »

Stranded energy assets put UK on course for $141bn loss, says study

The future global economic exposure to fossil fuel assets that could become stranded by the energy transition has doubled in the past five years from $1.4 trillion to $2.28tn by 2040. The UK is expected to be the ninth country globally to experience the heaviest losses per capita from stranded fossil fuel assets, making it more exposed than the US, Italy or France. About $141 billion (£113bn) of that total could be wiped from the UK economy from stranded fossil fuel assets, according to a new study by UK Sustainable Investment and Finance Association (UKSIF) and Transition Risk Exeter (TREX). Oil, gas or coal reserves, together with associated infrastructure and investments, are expected to lose economic viability before their operational lifetimes conclude due to climate policies, technological changes, and shifting market conditions. In a warming scenario between 2.5°C and 2.9°C, climate-intensified natural disasters may lead to $12.5tn in economic losses by 2050, the study predicts. The study estimates that the financial loss from stranded fossil fuel assets alone amounts to about $3,279 per UK adult. They warned that individual savers will shoulder the cost of the UK’s “outsized exposure” to fossil fuels. James Alexander, chief executive of UKSIF, said: “With asset stranding presenting a material risk to the long-term health of the UK economy, including the retirement savings of millions of people, it is clear that a carefully controlled transition away from fossil fuels is both an environmental and a financial imperative. “Too many oil and gas companies are betting on demand that will not materialise in a decarbonising world, and the public are at risk of paying the bill. The surest way to offset the risk of losses posed by stranded assets is to invest in industries that will thrive as fossil fuels decline.” Alexander called on the UK government to demonstrate global climate leadership by

Read More »

DOE approves LNG export permit extension for Golden Pass

Dive Brief: The U.S. Department of Energy on Wednesday approved a liquefied natural gas export permit extension for Golden Pass LNG, a project owned by QatarEnergy and ExxonMobil and currently under construction in Sabine Pass, Texas. It is the third LNG-related authorization by DOE since President Trump took office, reversing a Biden administration “pause” on export approvals. Trump has said he wants the U.S. to achieve “energy dominance” and fossil fuel exports are a part of that strategy. Consumer and environmental advocates, however, warn that unrestricted gas exports could raise domestic gas prices by more than 30%, send electricity prices higher and stymie efforts to reduce emissions. Dive Insight: Golden Pass is expected to begin exporting LNG by the end of this year, making it the ninth large-scale export terminal operating in the United States, DOE said. Exporting natural gas “supports American jobs, bolsters our national security and strengthens America’s position as a world energy leader,” Secretary of Energy Chris Wright said in a statement. DOE’s decision follows two February actions: the agency approved an export approval for Commonwealth LNG, and issued an order on rehearing that removed barriers for the use of LNG as “bunkering” fuel used by the ships transporting it. “Golden Pass was the first project approved for exports to non-free trade agreement countries by DOE during the first Trump Administration, and it is gratifying that this project is so close to being able to deliver its first LNG,” Tala Goudarzi, acting principal deputy assistant secretary of the Office of Fossil Energy and Carbon Management, said in a statement. In December, DOE published a study concluding increased exports would contribute to higher electricity and natural gas prices for U.S. consumers, as well as increased greenhouse gas emissions and other costs. Then-Energy Secretary Jennifer Granholm warned that U.S. LNG exports have tripled

Read More »

Seven important trends in the server sphere

The pace of change around server technology is advancing considerably, driven by hyperscalers but spilling over into the on-premises world as well. There are numerous overall trends, experts say, including: AI Everything: AI mania is everywhere and without high power hardware to run it, it’s just vapor. But it’s more than just a buzzword, it is a very real and measurable trend. AI servers are notable because they are decked out with high end CPUs, GPU accelerators, and oftentimes a SmartNIC network controller.  All the major players — Nvidia, Supermicro, Google, Asus, Dell, Intel, HPE — as well as smaller vendors are offering purpose-built AI hardware, according to a recent Network World article. AI edge server growth: There is also a trend towards deploying AI edge servers. The Global Edge AI Servers Market size is expected to be worth around $26.6 Billion by 2034, from $2.7 Billion in 2024, according to a Market.US report. Considerable amounts of data are collected on the edge.  Edge servers do the job of culling the useless data and sending only the necessary data back to data centers for processing. The market is rapidly expanding as industries such as manufacturing, automotive, healthcare, and retail increasingly deploy IoT devices and require immediate data processing for decision-making and operational efficiency, according to the report. Liquid cooling gains ground: Liquid cooling is inching its way in from the fringes into the mainstream of data center infrastructure. What was once a difficult add-on is now becoming a standard feature, says Jeffrey Hewitt, vice president and analyst with Gartner. “Server providers are working on developing the internal chassis plumbing for direct-to-chip cooling with the goal of supporting the next generation of AI CPUs and GPUs that will produce high amounts of heat within their servers,” he said.  New data center structures: Not

Read More »

Data center vacancies hit historic lows despite record construction

The growth comes despite considerable headwinds facing data center operators, including higher construction costs, equipment pricing, and persistent shortages in critical materials like generators, chillers and transformers, CRBE stated. There is a considerable pricing disparity between newly built data centers and legacy facilities, reflecting the premium placed on modern, energy-efficient infrastructure. Specifically, liquid/immersion cooling is preferred over air cooling for modern server requirements, CRBE found. On the networking side of things, major telecom companies made substantial investments in fiber in the second half of 2024, reflecting the growing need for more network infrastructure and capacity to accommodate growing demand from AI and data providers. There have also been many notable deals recently: AT&T’s multi-year, $1 billion agreement with Corning to provide next-generation fiber, cable and connectivity solutions; Comcast’s proposed acquisition of Nitel; Verizon’s agreement to acquire Frontier, the largest pure-play fiber internet provider in the U.S.; and T-Mobile’s entry into the fiber internet market via partnerships with fiber-optic providers. In the quarter, Meta announced plans for a 25,000-mile undersea fiber cable that would connect the U.S. East and West coasts with global markets across the Atlantic, Indian and Pacific oceans. The project would mark the first privately owned and operated global fiber cable network. Data Center Outlook

Read More »

AI driving a 165% rise in data center power demand by 2030

Goldman Sachs Research estimates the power usage by the global data center market to be around 55 gigawatts, which breaks down as 54% for cloud computing workloads, 32% for traditional line of business workloads and 14% for AI. By 2027, that number jumps to 84 GW, with AI growing to 27% of the overall market, cloud dropping to 50%, and traditional workloads falling to 23%, Schneider stated. Goldman Sachs Research estimates that there will be around 122 GW of data center capacity online by the end of 2030, and the density of power use in data centers is likely to grow as well, from 162 kilowatts per square foot to 176 KW per square foot in 2027, thanks to AI, Schneider stated.  “Data center supply — specifically the rate at which incremental supply is built — has been constrained over the past 18 months,” Schneider wrote. These constraints have arisen from the inability of utilities to expand transmission capacity because of permitting delays, supply chain bottlenecks, and infrastructure that is both costly and time-intensive to upgrade. The result is that due to power demand from data centers, there will need to be additional utility investment, to the tune of about $720 billion of grid spending through 2030. And then they are subject to the pace of public utilities, which move much slower than hyperscalers. “These transmission projects can take several years to permit, and then several more to build, creating another potential bottleneck for data center growth if the regions are not proactive about this given the lead time,” Schneider wrote.

Read More »

Top data storage certifications to sharpen your skills

Organization: Hitachi Vantara Skills acquired: Knowledge of data center infrastructure management tasks automation using Hitachi Ops Center Automator. Price: $100 Exam duration: 60 minutes How to prepare: Knowledge of all storage-related operations from an end-user perspective, including planning, allocating, and managing storage and architecting storage layouts. Read more about Hitachi Vantara’s training and certification options here. Certifications that bundle cloud, networking and storage skills AWS Certified Solutions Architect – Professional The AWS Certified Solutions Architect – Professional certification from leading cloud provider Amazon Web Services (AWS) helps individuals showcase advanced knowledge and skills in optimizing security, cost, and performance, and automating manual processes. The certification is a means for organizations to identify and develop talent with these skills for implementing cloud initiatives, according to AWS. The ideal candidate has the ability to evaluate cloud application requirements, make architectural recommendations for deployment of applications on AWS, and provide expert guidance on architectural design across multiple applications and projects within a complex organization, AWS says. Certified individuals report increased credibility with technical colleagues and customers as a result of earning this certification, it says. Organization: Amazon Web Services Skills acquired: Helps individuals showcase skills in optimizing security, cost, and performance, and automating manual processes Price: $300 Exam duration: 180 minutes How to prepare: The recommended experience prior to taking the exam is two or more years of experience in using AWS services to design and implement cloud solutions Cisco Certified Internetwork Expert (CCIE) Data Center The Cisco CCIE Data Center certification enables individuals to demonstrate advanced skills to plan, design, deploy, operate, and optimize complex data center networks. They will gain comprehensive expertise in orchestrating data center infrastructure, focusing on seamless integration of networking, compute, and storage components. Other skills gained include building scalable, low-latency, high-performance networks that are optimized to support artificial intelligence (AI)

Read More »

Netskope expands SASE footprint, bolsters AI and automation

Netskope is expanding its global presence by adding multiple regions to its NewEdge carrier-grade infrastructure, which now includes more than 75 locations to ensure processing remains close to end users. The secure access service edge (SASE) provider also enhanced its digital experience monitoring (DEM) capabilities with AI-powered root-cause analysis and automated network diagnostics. “We are announcing continued expansion of our infrastructure and our continued focus on resilience. I’m a believer that nothing gets adopted if end users don’t have a great experience,” says Netskope CEO Sanjay Beri. “We monitor traffic, we have multiple carriers in every one of our more than 75 regions, and when traffic goes from us to that destination, the path is direct.” Netskope added regions including data centers in Calgary, Helsinki, Lisbon, and Prague as well as expanded existing NewEdge regions including data centers in Bogota, Jeddah, Osaka, and New York City. Each data center offers customers a range of SASE capabilities including cloud firewalls, secure web gateway (SWG), inline cloud access security broker (CASB), zero trust network access (ZTNA), SD-WAN, secure service edge (SSE), and threat protection. The additional locations enable Netskope to provide coverage for more than 220 countries and territories with 200 NewEdge Localization Zones, which deliver a local direct-to-net digital experience for users, the company says.

Read More »

Inside the Nuclear Race for Data Center Energy with Aalo Atomics CEO Matt Loszak

The latest episode of the DCF Show podcast delves into one of the most pressing challenges facing the data center industry today: the search for sustainable, high-density power solutions. And how, as hyperscale operators like Google and Meta contend with growing energy demands—and, in some cases, resistance from utilities unwilling or unable to support their expanding footprints—the conversation around nuclear energy has intensified.  Both legacy nuclear providers and innovative startups are racing to secure the future business of data center giants, each bringing unique approaches to the table. Our guest for this podcast episode is Matt Loszak, co-founder and CEO of Aalo Atomics, an Austin-based company that’s taking a fresh approach to nuclear energy. Aalo, which secured a $29.5 million Series A funding round in 2024, stands out in the nuclear sector with its 10-megawatt sodium-cooled reactor design—eliminating the need for water, a critical advantage for siting flexibility. Inspired by the Department of Energy’s MARVEL microreactor, Aalo’s technology benefits from direct expertise, as the company’s CTO was the chief architect behind MARVEL. Beyond reactor design, Aalo’s vision extends to full-scale modular plant production. Instead of just building reactors, the company aims to manufacture entire nuclear plants using prefabricated, LEGO-style components. The fully modular plants, shipped in standard containers, are designed to match the footprint of a data center while requiring no onsite water—features that could make them particularly attractive to hyperscale operators seeking localized, high-density power.  Aalo has already made significant strides, with the Department of Energy identifying land at Idaho National Laboratory (INL) as a potential site for its first nuclear facility. The company is on an accelerated timeline, expecting to complete a non-nuclear prototype within three months and break ground on its first nuclear reactor in about a year—remarkably fast progress for the nuclear industry. In our discussion,

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »