Stay Ahead, Stay ONMINE

Learnings from a Machine Learning Engineer — Part 3: The Evaluation

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions). In Part 1, […]

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions).

In Part 1, I discussed the process of labelling your image data that you use in your Image Classification project. I showed how to define “good” images and create sub-classes. In Part 2, I went over various data sets, beyond the usual train-validation-test sets, such as benchmark sets, plus how to handle synthetic data and duplicate images.

Evaluation of the trained model

As machine learning engineers we look at accuracy, F1, log loss, and other metrics to decide if a model is ready to move to production. These are all important measures, but from my experience, these scores can be deceiving especially as the number of classes grows.

Although it can be time consuming, I find it very important to manually review the images that the model gets wrong, as well as the images that the model gives a low softmax “confidence” score to. This means adding a step immediately after your training run completes to calculate scores for all images — training, validation, test, and the benchmark sets. You only need to bring up for manual review the ones that the model had problems with. This should only be a small percentage of the total number of images. See the Double-check process below

What you do during the manual evaluation is to put yourself in a “training mindset” to ensure that the labelling standards are being followed that you setup in Part 1. Ask yourself:

  • “Is this a good image?” Is the subject front and center, and can you clearly see all the features?
  • “Is this the correct label?” Don’t be surprised if you find wrong labels.

You can either remove the bad images or fix the labels if they are wrong. Otherwise you can keep them in the data set and force the model to do better next time. Other questions I ask are:

  • “Why did the model get this wrong?”
  • “Why did this image get a low score?”
  • “What is it about the image that caused confusion?”

Sometimes the answer has nothing to do with that specific image. Frequently, it has to do with the other images, either in the ground truth class or in the predicted class. It is worth the effort to Double-check all images in both sets if you see a consistently bad guess. Again, don’t be surprised if you find poor images or wrong labels.

Weighted evaluation

When doing the evaluation of the trained model (above), we apply a lot of subjective analysis — “Why did the model get this wrong?” and “Is this a good image?” From these, you may only get a gut feeling.

Frequently, I will decide to hold off moving a model forward to production based on that gut feel. But how can you justify to your manager that you want to hit the brakes? This is where putting a more objective analysis comes in by creating a weighted average of the softmax “confidence” scores.

In order to apply a weighted evaluation, we need to identify sets of classes that deserve adjustments to the score. Here is where I create a list of “commonly confused” classes.

Commonly confused classes

Certain animals at our zoo can easily be mistaken. For example, African elephants and Asian elephants have different ear shapes. If your model gets these two mixed up, that is not as bad as guessing a giraffe! So perhaps you give partial credit here. You and your subject matter experts (SMEs) can come up with a list of these pairs and a weighted adjustment for each.

Photo by Matt Bango on Unsplash
Photo by Mathew Krizmanich on Unsplash

This weight can be factored into a modified cross-entropy loss function in the equation below. The back half of this equation will reduce the impact of being wrong for specific pairs of ground truth and prediction by using the “weight” function as a lookup. By default, the weighted adjustment would be 1 for all pairings, and the commonly confused classes would get something like 0.5.

In other words, it’s better to be unsure (have a lower confidence score) when you are wrong, compared to being super confident and wrong.

Modified cross-entropy loss function, image by author

Once this weighted log loss is calculated, I can compare to previous training runs to see if the new model is ready for production.

Confidence threshold report

Another valuable measure that incorporates the confidence threshold (in my example, 95) is to report on accuracy and false positive rates. Recall that when we apply the confidence threshold before presenting results, we help reduce false positives from being shown to the end user.

In this table, we look at the breakdown of “true positive above 95” for each data set. We get a sense that when a “good” picture comes through (like the ones from our train-validation-test set) it is very likely to surpass the threshold, thus the user is “happy” with the outcome. Conversely, the “false positive above 95” is extremely low for good pictures, thus only a small number of our users will be “sad” about the results.

Example Confidence Threshold Report, image by author

We expect the train-validation-test set results to be exceptional since our data is curated. So, as long as people take “good” pictures, the model should do very well. But to get a sense of how it does on extreme situations, let’s take a look at our benchmarks.

The “difficult” benchmark has more modest true positive and false positive rates, which reflects the fact that the images are more challenging. These values are much easier to compare across training runs, so that lets me set a min/max target. So for example, if I target a minimum of 80% for true positive, and maximum of 5% for false positive on this benchmark, then I can feel confident moving this to production.

The “out-of-scope” benchmark has no true positive rate because none of the images belong to any class the model can identify. Remember, we picked things like a bag of popcorn, etc., that are not zoo animals, so there cannot be any true positives. But we do get a false positive rate, which means the model gave a confident score to that bag of popcorn as some animal. And if we set a target maximum of 10% for this benchmark, then we may not want to move it to production.

Photo by Linus Mimietz on Unsplash

Right now, you may be thinking, “Well, what animal did it pick for the bag of popcorn?” Excellent question! Now you understand the importance of doing a manual review of the images that get bad results.

Evaluation of the deployed model

The evaluation that I described above applies to a model immediately after training. Now, you want to evaluate how your model is doing in the real world. The process is similar, but requires you to shift to a “production mindset” and asking yourself, “Did the model get this correct?” and “Should it have gotten this correct?” and “Did we tell the user the right thing?”

So, imagine that you are logging in for the morning — after sipping on your cold brew coffee, of course — and are presented with 500 images that your zoo guests took yesterday of different animals. Your job is to determine how satisfied the guests were using your model to identify the zoo animals.

Using the softmax “confidence” score for each image, we have a threshold before presenting results. Above the threshold, we tell the guest what the model predicted. I’ll call this the “happy path”. And below the threshold is the “sad path” where we ask them to try again.

Your review interface will first show you all the “happy path” images one at a time. This is where you ask yourself, “Did we get this right?” Hopefully, yes!

But if not, this is where things get tricky. So now you have to ask, “Why not?” Here are some things that it could be:

  • “Bad” picture — Poor lighting, bad angle, zoomed out, etc — refer to your labelling standards.
  • Out-of-scope — It’s a zoo animal, but unfortunately one that isn’t found in this zoo. Maybe it belongs to another zoo (your guest likes to travel and try out your app). Consider adding these to your data set.
  • Out-of-scope — It’s not a zoo animal. It could be an animal in your zoo, but not one typically contained there, like a neighborhood sparrow or mallard duck. This might be a candidate to add.
  • Out-of-scope — It’s something found in the zoo. A zoo usually has interesting trees and shrubs, so people might try to identify those. Another candidate to add.
  • Prankster — Completely out-of-scope. Because people like to play with technology, there’s the possibility you have a prankster that took a picture of a bag of popcorn, or a soft drink cup, or even a selfie. These are hard to prevent, but hopefully get a low enough score (below the threshold) so the model did not identify it as a zoo animal. If you see enough pattern in these, consider creating a class with special handling on the front-end.

After reviewing the “happy path” images, you move on to the “sad path” images — the ones that got a low confidence score and the app gave a “sorry, try again” message. This time you ask yourself, “Should the model have given this image a higher score?” which would have put it in the “happy path”. If so, then you want to ensure these images are added to the training set so next time it will do better. But most of time, the low score reflects many of the “bad” or out-of-scope situations mentioned above.

Perhaps your model performance is suffering and it has nothing to do with your model. Maybe it is the ways you users interacting with the app. Keep an eye out of non-technical problems and share your observations with the rest of your team. For example:

  • Are your users using the application in the ways you expected?
  • Are they not following the instructions?
  • Do the instructions need to be stated more clearly?
  • Is there anything you can do to improve the experience?

Collect statistics and new images

Both of the manual evaluations above open a gold mine of data. So, be sure to collect these statistics and feed them into a dashboard — your manager and your future self will thank you!

Photo by Justin Morgan on Unsplash

Keep track of these stats and generate reports that you and your can reference:

  • How often the model is being called?
  • What times of the day, what days of the week is it used?
  • Are your system resources able to handle the peak load?
  • What classes are the most common?
  • After evaluation, what is the accuracy for each class?
  • What is the breakdown for confidence scores?
  • How many scores are above and below the confidence threshold?

The single best thing you get from a deployed model is the additional real-world images! You can add these now images to improve coverage of your existing zoo animals. But more importantly, they provide you insight on other classes to add. For example, let’s say people enjoy taking a picture of the large walrus statue at the gate. Some of these may make sense to incorporate into your data set to provide a better user experience.

Creating a new class, like the walrus statue, is not a huge effort, and it avoids the false positive responses. It would be more embarrassing to identify a walrus statue as an elephant! As for the prankster and the bag of popcorn, you can configure your front-end to quietly handle these. You might even get creative and have fun with it like, “Thank you for visiting the food court.”

Double-check process

It is a good idea to double-check your image set when you suspect there may be problems with your data. I’m not suggesting a top-to-bottom check, because that would a monumental effort! Rather specific classes that you suspect could contain bad data that is degrading your model performance.

Immediately after my training run completes, I have a script that will use this new model to generate predictions for my entire data set. When this is complete, it will take the list of incorrect identifications, as well as the low scoring predictions, and automatically feed that list into the Double-check interface.

This interface will show, one at a time, the image in question, alongside an example image of the ground truth and an example image of what the model predicted. I can visually compare the three, side-by-side. The first thing I do is ensure the original image is a “good” picture, following my labelling standards. Then I check if the ground-truth label is indeed correct, or if there is something that made the model think it was the predicted label.

At this point I can:

  • Remove the original image if the image quality is poor.
  • Relabel the image if it belongs in a different class.

During this manual evaluation, you might notice dozens of the same wrong prediction. Ask yourself why the model made this mistake when the images seem perfectly fine. The answer may be some incorrect labels on images in the ground truth, or even in the predicted class!

Don’t hesitate to add those classes and sub-classes back into the Double-check interface and step through them all. You may have 100–200 pictures to review, but there is a good chance that one or two of the images will stand out as being the culprit.

Up next…

With a different mindset for a trained model versus a deployed model, we can now evaluate performances to decide which models are ready for production, and how well a production model is going to serve the public. This relies on a solid Double-check process and a critical eye on your data. And beyond the “gut feel” of your model, we can rely on the benchmark scores to support us.

In Part 4, we kick off the training run, but there are some subtle techniques to get the most out of the process and even ways to leverage throw-away models to expand your library image data.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

NFL, AWS drive football modernization with cloud, AI

AWS Next Gen Stats: Initially used for player participation tracking (replacing manual photo-taking), Next Gen Stats uses sensors to capture center-of-mass and contact information, which is then used to generate performance insights. Computer vision: Computer vision was initially insufficient, but the technology has improved greatly over the past few years.

Read More »

Apstra founder launches Aria to tackle AI networking performance

Aria’s technical approach differs from incumbent vendors in its focus on end-to-end path optimization rather than individual switch performance. Karam argues that traditional networking vendors think of themselves primarily as switch companies, with software efforts concentrated on switch operating systems rather than cluster-wide operational models. “It’s no longer just about

Read More »

Gluware tackles AI agent coordination with Titan platform

The first phase focused on configuration management and drift detection. Gluware’s system identified when network devices deviated from approved configurations and proposed fixes, but network operations teams manually reviewed and approved each remediation. The second phase introduced automatic remediation. As customers gained confidence, they allowed the system to automatically correct

Read More »

Lukoil Gas Stations in USA See Disruptions for Card Payments

Lukoil gasoline stations in the US are experiencing major issues with card payments, forcing some franchise owners to consider asking more customers to use cash as parent company Lukoil PJSC, the Russian oil giant, faces US sanctions. There are two separate problems for the US service stations, mainly located in New Jersey. First, franchise owners aren’t able to access revenue from payments made with credit, debit and prepaid cards through Lukoil’s system, according to Eric Blomgren, executive director of the New Jersey Gasoline, C-Store, Automotive Association. Second, the stations are also running into issues processing payments made with some cards, including those issued by American Express Co., Blomgren said.  Several Lukoil gas-station employees that spoke with Bloomberg News confirmed that their locations are seeing both problems. Some franchises are encouraging customers to use cash because of the issues, and at least one has considered switching to cash only, according to the employees who asked not to be named because they aren’t authorized to speak publicly.  Lukoil didn’t respond to requests for comment.  The situation at the US Lukoil gas stations underscores how the sanctions process can unleash chaos for multinational businesses. In New Jersey, Lukoil-branded franchises account for about 5% of all gas stations in the state. The payment issues come just ahead of the Thanksgiving holiday, one of the busiest travel periods of the year. Lukoil’s gas stations in the US have been granted a temporary license allowing payments to process through Dec. 13, according to a statement from the Office of Foreign Assets Control. But it’s unclear how business operations will proceed once the authorization period expires. The US government in October announced sanctions on Russia’s two largest crude oil producers, Rosneft PJSC and Lukoil. As much as 85% of sales at Lukoil stations are made with cards, creating a “big problem”

Read More »

Energy Department Launches ‘Genesis Mission’ to Transform American Science and Innovation Through the AI Computing Revolution

WASHINGTON—President Trump today issued an Executive Order to launch the Genesis Mission, a historic national effort led by the Department of Energy. The Genesis Mission will transform American science and innovation through the power of artificial intelligence (AI), strengthening the nation’s technological leadership and global competitiveness.   The ambitious mission will harness the current AI and advanced computing revolution to double the productivity and impact of American science and engineering within a decade. It will deliver decisive breakthroughs to secure American energy dominance, accelerate scientific discovery, and strengthen national security.   “Throughout history, from the Manhattan Project to the Apollo mission, our nation’s brightest minds and industries have answered the call when their nation needed them,” said U.S. Secretary of Energy Chris Wright. “Today, the United States is calling on them once again. Under President Trump’s leadership, the Genesis Mission will unleash the full power of our National Laboratories, supercomputers, and dataresources to ensure that America is the global leader in artificial intelligence and to usher in a new golden era of American discovery.”  The announcement builds on President Trump’s Executive Order Removing Barriers to American Leadership In Artificial Intelligence and advances his America’s AI Action Plan released earlier this year—a directive to remove barriers to innovation, reduce dependence on foreign adversaries, and unleash the full strength of America’s scientific enterprise.   Secretary Wright has designated Under Secretary for Science Darío Gil to lead the initiative. The Genesis Mission will mobilize the Department of Energy’s 17 National Laboratories, industry, and academia to build an integrated discovery platform.   The platform will connect the world’s best supercomputers, AI systems, and next-generation quantum systems with the most advanced scientific instruments in the nation. Once complete, the platform will be the world’s most complex and powerful scientific instrument ever built. It will draw on the expertise of

Read More »

Oil Closes the Day Up as Equities Rally

Oil pushed higher as equities rose and traders weighed the prospect of a Ukraine-Russia peace deal that could deflate political risk from an already well-supplied market. West Texas Intermediate rose about 1.3% to settle near $59 per barrel, snapping a three day losing streak as crude ticks up following its biggest weekly loss since early October.  While oil followed other risk assets higher, traders awaited further news after Ukraine and its European allies signaled that key sticking points remained in US-brokered peace talks to end Russia’s invasion, even as senior officials hailed progress in winning more favorable terms for Kyiv. “Something good just may be happening,” President Donald Trump wrote in a Truth Social post about the talks.  An end to the hostiltites would also take some risk premium out of the market. “Oil markets are moving in sympathy with equities and awaiting on more news of the Ukraine/Russia talks” said Dennis Kissler, senior vice president for trading at BOK Financial. He expects continued choppy trading and some short covering into the holiday period.  Crude has slumped this year, with futures on course for a fourth monthly loss in November, in what would be the longest losing run since 2023. The decline has been driven by expanded global output, including from OPEC+, with the International Energy Agency forecasting a record surplus for 2026. Traders are monitoring whether a deal on Ukraine will materialize, and if sanctions on Russia will be lifted — developments that could inject more supply. “We should expect a nervous oil market ahead of Thanksgiving on Thursday,” said Arne Lohmann Rasmussen, chief analyst at A/S Global Risk Management. “Several factors point to a peace agreement or possibly a ceasefire moving closer over the weekend, which supports further price declines this week.” Ukraine President Volodymyr Zelenskiy said Monday

Read More »

Saudi’s AlKhorayef Petroleum Said To Prepare For IPO

Saudi Arabia’s AlKhorayef Group has started preparations for a potential listing of its oil and gas services subsidiary, according to people familiar with the matter, adding to the list of companies looking to go public in the kingdom. The group has reached out to firms that could help arrange a possible IPO of AlKhorayef Petroleum, the people said, declining to be identified discussing confidential information. The preparations are at an early stage, and no final decision has been taken on whether to proceed with a transaction, the people said.  Representatives for AlKhorayef Group did not respond to a request for comment. Representatives for the Public Investment Fund, which acquired a 25% stake in Al Khorayef Petroleum in 2023, declined to comment. Saudi Arabia has been the Middle East’s most active IPO market this year, with companies raising nearly $4 billion. Still, performance has been uneven, and only two of the ten largest debuts are currently trading above their offer prices. The kingdom’s benchmark stock index is among the worst-performing in emerging markets, as investors grow wary of prolonged oil price weakness and the potential hit to government spending. The PIF has played a key role in deepening Saudi Arabia’s capital markets by listing portfolio companies. However, it has slowed the pace of share sales, including in firms like Saudi Global Ports, amid softer market conditions, Bloomberg News has reported. Headquartered in Dammam, AlKhorayef Petroleum operates across the Middle East, Africa and Latin America. It is majority-owned by AlKhorayef Group, a conglomerate with businesses spanning industrial services, lubricants and water solutions. WHAT DO YOU THINK? Generated by readers, the comments included herein do not reflect the views and opinions of Rigzone. All comments are subject to editorial review. Off-topic, inappropriate or insulting comments will be removed.

Read More »

ADNOC Keeps $150B Spending in Growth Push

(Update) November 24, 2025, 4:17 PM GMT: Updates with oil production capacity in the last paragraph. Abu Dhabi National Oil Co. will maintain spending at $150 billion over the next five years as it targets growth in production capacity at home and expands internationally. The company’s board approved the capital expenditure plan that’s in line with the previous layout that was announced three years ago. Since then, Abu Dhabi’s biggest oil producer has carved out an international investment business called XRG that is scouring the globe for deals. XRG has boosted its enterprise value to $151 billion from $80 billion since it was set up about a year ago, Adnoc said in a statement. The unit, which this year got stakes in Adnoc’s listed companies with a total market value exceeding $100 billion, aims to become among the world’s top five suppliers of natural gas and petrochemicals, along with the energy needed to meet demand from the AI and tech booms. XRG has also snapped up contracts for liquefied natural gas in the US and Africa, bought into gas fields around the Mediterranean and is in the final stages of a nearly $14 billion takeover of German chemical maker Covestro AG. Still, the company’s biggest effort yet fell apart in September when the firm dropped its planned $19 billion takeover of Australian natural gas producer Santos Ltd. It bounced back with a deal announced this month to explore buying into an LNG project in Argentina. Adnoc’s board, chaired by UAE President and Abu Dhabi ruler Sheikh Mohamed bin Zayed Al Nahyan, reviewed plans to expand oil and gas production capacity. It formed a operating company for the Hail and Ghasha offshore natural gas concession and boosted the project’s production target to 1.8 billion cubic feet per day, from 1.5 billion, by the end of the decade.

Read More »

North America Adds 12 Rigs Week on Week

North America added 12 rigs week on week, according to Baker Hughes’ latest North America rotary rig count, which was published on November 21. The total U.S. rig count increased by five week on week and the total Canada rig count rose by seven during the same period, taking the total North America rig count up to 749, comprising 554 rigs from the U.S. and 195 rigs from Canada, the count outlined. Of the total U.S. rig count of 554, 533 rigs are categorized as land rigs, 19 are categorized as offshore rigs, and two are categorized as inland water rigs. The total U.S. rig count is made up of 419 oil rigs, 127 gas rigs, and eight miscellaneous rigs, according to Baker Hughes’ count, which revealed that the U.S. total comprises 481 horizontal rigs, 61 directional rigs, and 12 vertical rigs. Week on week, the U.S. land rig count rose by six, its offshore rig count remained unchanged, and its inland water rig count dropped by one, Baker Hughes highlighted. The U.S. oil and gas rig counts each increased by two, and the country’s miscellaneous rig count rose by one, week on week, the count showed. The U.S. horizontal rig count increased by five, its vertical rig count rose by one, and its directional rig count dropped by one, week on week, the count revealed. A major state variances subcategory included in the rig count showed that, week on week, Wyoming added three rigs, and Pennsylvania, Oklahoma, and New Mexico each added one rig. North Dakota, Louisiana, and Alaska each dropped one rig, week on week, the count revealed.   A major basin variances subcategory included in Baker Hughes’ rig count showed that, week on week, the Granite Wash basin added two rigs, and the Marcellus and Permian basins

Read More »

Microsoft’s Fairwater Atlanta and the Rise of the Distributed AI Supercomputer

Microsoft’s second Fairwater data center in Atlanta isn’t just “another big GPU shed.” It represents the other half of a deliberate architectural experiment: proving that two massive AI campuses, separated by roughly 700 miles, can operate as one coherent, distributed supercomputer. The Atlanta installation is the latest expression of Microsoft’s AI-first data center design: purpose-built for training and serving frontier models rather than supporting mixed cloud workloads. It links directly to the original Fairwater campus in Wisconsin, as well as to earlier generations of Azure AI supercomputers, through a dedicated AI WAN backbone that Microsoft describes as the foundation of a “planet-scale AI superfactory.” Inside a Fairwater Site: Preparing for Multi-Site Distribution Efficient multi-site training only works if each individual site behaves as a clean, well-structured unit. Microsoft’s intra-site design is deliberately simplified so that cross-site coordination has a predictable abstraction boundary—essential for treating multiple campuses as one distributed AI system. Each Fairwater installation presents itself as a single, flat, high-regularity cluster: Up to 72 NVIDIA Blackwell GPUs per rack, using GB200 NVL72 rack-scale systems. NVLink provides the ultra-low-latency, high-bandwidth scale-up fabric within the rack, while the Spectrum-X Ethernet stack handles scale-out. Each rack delivers roughly 1.8 TB/s of GPU-to-GPU bandwidth and exposes a multi-terabyte pooled memory space addressable via NVLink—critical for large-model sharding, activation checkpointing, and parallelism strategies. Racks feed into a two-tier Ethernet scale-out network offering 800 Gbps GPU-to-GPU connectivity with very low hop counts, engineered to scale to hundreds of thousands of GPUs without encountering the classic port-count and topology constraints of traditional Clos fabrics. Microsoft confirms that the fabric relies heavily on: SONiC-based switching and a broad commodity Ethernet ecosystem to avoid vendor lock-in and accelerate architectural iteration. Custom network optimizations, such as packet trimming, packet spray, high-frequency telemetry, and advanced congestion-control mechanisms, to prevent collective

Read More »

Land & Expand: Hyperscale, AI Factory, Megascale

Land & Expand is Data Center Frontier’s periodic roundup of notable North American data center development activity, tracking the newest sites, land plays, retrofits, and hyperscale campus expansions shaping the industry’s build cycle. October delivered a steady cadence of announcements, with several megascale projects advancing from concept to commitment. The month was defined by continued momentum in OpenAI and Oracle’s Stargate initiative (now spanning multiple U.S. regions) as well as major new investments from Google, Meta, DataBank, and emerging AI cloud players accelerating high-density reuse strategies. The result is a clearer picture of how the next wave of AI-first infrastructure is taking shape across the country. Google Begins $4B West Memphis Hyperscale Buildout Google formally broke ground on its $4 billion hyperscale campus in West Memphis, Arkansas, marking the company’s first data center in the state and the anchor for a new Mid-South operational hub. The project spans just over 1,000 acres, with initial site preparation and utility coordination already underway. Google and Entergy Arkansas confirmed a 600 MW solar generation partnership, structured to add dedicated renewable supply to the regional grid. As part of the launch, Google announced a $25 million Energy Impact Fund for local community affordability programs and energy-resilience improvements—an unusually early community-benefit commitment for a first-phase hyperscale project. Cooling specifics have not yet been made public. Water sourcing—whether reclaimed, potable, or hybrid seasonal mode—remains under review, as the company finalizes environmental permits. Public filings reference a large-scale onsite water treatment facility, similar to Google’s deployments in The Dalles and Council Bluffs. Local governance documents show that prior to the October announcement, West Memphis approved a 30-year PILOT via Groot LLC (Google’s land assembly entity), with early filings referencing a typical placeholder of ~50 direct jobs. At launch, officials emphasized hundreds of full-time operations roles and thousands

Read More »

The New Digital Infrastructure Geography: Green Street’s David Guarino on AI Demand, Power Scarcity, and the Next Phase of Data Center Growth

As the global data center industry races through its most frenetic build cycle in history, one question continues to define the market’s mood: is this the peak of an AI-fueled supercycle, or the beginning of a structurally different era for digital infrastructure? For Green Street Managing Director and Head of Global Data Center and Tower Research David Guarino, the answer—based firmly on observable fundamentals—is increasingly clear. Demand remains blisteringly strong. Capital appetite is deepening. And the very definition of a “data center market” is shifting beneath the industry’s feet. In a wide-ranging discussion with Data Center Frontier, Guarino outlined why data centers continue to stand out in the commercial real estate landscape, how AI is reshaping underwriting and development models, why behind-the-meter power is quietly reorganizing the U.S. map, and what Green Street sees ahead for rents, REITs, and the next wave of hyperscale expansion. A ‘Safe’ Asset in an Uncertain CRE Landscape Among institutional investors, the post-COVID era was the moment data centers stepped decisively out of “niche” territory. Guarino notes that pandemic-era reliance on digital services crystallized a structural recognition: data centers deliver stable, predictable cash flows, anchored by the highest-credit tenants in global real estate. Hyperscalers today dominate new leasing and routinely sign 15-year (or longer) contracts, a duration largely unmatched across CRE categories. When compared with one-year apartment leases, five-year office leases, or mall anchor terms, the stability story becomes plain. “These are AAA-caliber companies signing the longest leases in the sector’s history,” Guarino said. “From a real estate point of view, that combination of tenant quality and lease duration continues to position the asset class as uniquely durable.” And development returns remain exceptional. Even without assuming endless AI growth, the math works: strong demand, rising rents, and high-credit tenants create unusually predictable performance relative to

Read More »

The Flexential Blueprint: New CEO Ryan Mallory on Power, AI, and Bending the Physics Curve

In a coordinated leadership transition this fall, Ryan Mallory has stepped into the role of CEO at Flexential, succeeding Chris Downie. The move, described as thoughtful and planned, signals not a shift in direction, but a reinforcement of the company’s core strategy, with a sharpened focus on the unprecedented opportunities presented by the artificial intelligence revolution. In an exclusive interview on the Data Center Frontier Show Podcast, Mallory outlined a confident vision for Flexential, positioning the company at the critical intersection of enterprise IT and next-generation AI infrastructure. “Flexential will continue to focus on being an industry and market leader in wholesale, multi-tenant, and interconnection capabilities,” Mallory stated, affirming the company’s foundational strengths. His central thesis is that the AI infrastructure boom is not a monolithic wave, but a multi-stage evolution where Flexential’s model is uniquely suited for the emerging “inference edge.” The AI Build Cycle: A Three-Act Play Mallory frames the AI infrastructure market as a three-stage process, each lasting roughly four years. We are currently at the tail end of Stage 1, which began with the ChatGPT explosion three years ago. This phase, characterized by a frantic rush for capacity, has led to elongated lead times for critical infrastructure like generators, switchgear, and GPUs. The capacity from this initial build-out is expected to come online between late 2025 and late 2026. Stage 2, beginning around 2026 and stretching to 2030, will see the next wave of builds, with significant capacity hitting the market in 2028-2029. “This stage will reveal the viability of AI and actual consumption models,” Mallory notes, adding that air-cooled infrastructure will still dominate during this period. Stage 3, looking ahead to the early 2030s, will focus on long-term scale, mirroring the evolution of the public cloud. For Mallory, the enduring nature of this build cycle—contrasted

Read More »

Centersquare Launches $1 Billion Expansion to Scale an AI-Ready North American Data Center Platform

A Platform Built for Both Colo and AI Density The combined Evoque–Cyxtera platform entered the market with hundreds of megawatts of installed capacity and a clear runway for expansion. That scale positioned Centersquare to offer both traditional enterprise colocation and the higher-density, AI-ready footprints increasingly demanded through 2024 and 2025. The addition of these ten facilities demonstrates that the consolidation strategy is gaining traction, giving the platform more owned capacity to densify and more regional optionality as AI deployment accelerates. What’s in the $1 Billion Package — and Why It Matters 1) Lease-to-Own Conversions in Boston & Minneapolis Centersquare’s decision to purchase two long-operated but previously leased sites in Boston and Minneapolis reduces long-term occupancy risk and gives the operator full capex control. Owning the buildings unlocks the ability to schedule power and cooling upgrades on Centersquare’s terms, accelerate retrofits for high-density AI aisles, deploy liquid-ready thermal topologies, and add incremental power blocks without navigating landlord approval cycles. This structural flexibility aligns directly with the platform’s “AI-era backbone” positioning. 2) Eight Additional Data Centers Across Six Metros The acquisitions broaden scale in fast-rising secondary markets—Tulsa, Nashville, Raleigh—while deepening Centersquare’s presence in Dallas and expanding its Canadian footprint in Toronto and Montréal. Dallas remains a core scaling hub, but Nashville and Raleigh are increasingly important for enterprises modernizing their stacks and deploying regional AI workloads at lower cost and with faster timelines than congested Tier-1 corridors. Tulsa provides a network-adjacent, cost-efficient option for disaster recovery, edge aggregation, and latency-tolerant compute. In Canada, Toronto and Montréal offer strong enterprise demand, attractive economics, and grid advantages—including Québec’s hydro-powered, low-carbon energy mix—that position them well for AI training spillover and inference workloads requiring reliable, competitively priced power. 3) Self-Funded With Cash on Hand In the current rate environment, funding the entire $1 billion package

Read More »

Fission Forward: Next-Gen Nuclear Power Developments for the AI Data Center Boom

Constellation proposes to begin with 1.5 GW of fast-tracked projects, including 800 MW of battery energy storage and 700 MW of new natural gas generation to address short-term reliability needs. The remaining 4.3 GW represents longer-term investment at the Calvert Cliffs Clean Energy Center: extending both units for an additional 20 years beyond their current 2034 and 2036 license expirations, implementing a 10% uprate that would add roughly 190 MW of output, and pursuing 2 GW of next-generation nuclear at the existing site. For Maryland, a state defined by a dense I-95 fiber corridor, accelerating data center buildout, and rising AI-driven load, the plan could be transformative. If Constellation moves from “option” to “program,” the company estimates that 70% of the state’s electricity supply could come from clean energy sources, positioning Maryland as a top-tier market for 24/7 carbon-free power. TerraPower’s Natrium SMR Clears a Key Federal Milestone On Oct. 23, the Nuclear Regulatory Commission issued the final environmental impact statement (FEIS) for TerraPower’s Natrium small modular reactor in Kemmerer, Wyoming. While not a construction permit, FEIS completion removes a major element of federal environmental risk and keeps the project on track for the next phase of NRC review. TerraPower and its subsidiary, US SFR Owner, LLC, originally submitted the construction permit application on March 28, 2024. Natrium is a sodium-cooled fast reactor producing roughly 345 MW of electric output, paired with a molten-salt thermal-storage system capable of boosting generation to about 500 MW during peak periods. The design combines firm baseload power with flexible, dispatchable capability, an attractive profile for hyperscalers evaluating 24/7 clean energy options in the western U.S. The project is part of the DOE’s Advanced Reactor Demonstration Program, intended to replace retiring coal capacity in PacifiCorp’s service territory while showcasing advanced fission technology. For operators planning multi-GW

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »