Stay Ahead, Stay ONMINE

Learnings from a Machine Learning Engineer — Part 3: The Evaluation

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions). In Part 1, […]

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions).

In Part 1, I discussed the process of labelling your image data that you use in your Image Classification project. I showed how to define “good” images and create sub-classes. In Part 2, I went over various data sets, beyond the usual train-validation-test sets, such as benchmark sets, plus how to handle synthetic data and duplicate images.

Evaluation of the trained model

As machine learning engineers we look at accuracy, F1, log loss, and other metrics to decide if a model is ready to move to production. These are all important measures, but from my experience, these scores can be deceiving especially as the number of classes grows.

Although it can be time consuming, I find it very important to manually review the images that the model gets wrong, as well as the images that the model gives a low softmax “confidence” score to. This means adding a step immediately after your training run completes to calculate scores for all images — training, validation, test, and the benchmark sets. You only need to bring up for manual review the ones that the model had problems with. This should only be a small percentage of the total number of images. See the Double-check process below

What you do during the manual evaluation is to put yourself in a “training mindset” to ensure that the labelling standards are being followed that you setup in Part 1. Ask yourself:

  • “Is this a good image?” Is the subject front and center, and can you clearly see all the features?
  • “Is this the correct label?” Don’t be surprised if you find wrong labels.

You can either remove the bad images or fix the labels if they are wrong. Otherwise you can keep them in the data set and force the model to do better next time. Other questions I ask are:

  • “Why did the model get this wrong?”
  • “Why did this image get a low score?”
  • “What is it about the image that caused confusion?”

Sometimes the answer has nothing to do with that specific image. Frequently, it has to do with the other images, either in the ground truth class or in the predicted class. It is worth the effort to Double-check all images in both sets if you see a consistently bad guess. Again, don’t be surprised if you find poor images or wrong labels.

Weighted evaluation

When doing the evaluation of the trained model (above), we apply a lot of subjective analysis — “Why did the model get this wrong?” and “Is this a good image?” From these, you may only get a gut feeling.

Frequently, I will decide to hold off moving a model forward to production based on that gut feel. But how can you justify to your manager that you want to hit the brakes? This is where putting a more objective analysis comes in by creating a weighted average of the softmax “confidence” scores.

In order to apply a weighted evaluation, we need to identify sets of classes that deserve adjustments to the score. Here is where I create a list of “commonly confused” classes.

Commonly confused classes

Certain animals at our zoo can easily be mistaken. For example, African elephants and Asian elephants have different ear shapes. If your model gets these two mixed up, that is not as bad as guessing a giraffe! So perhaps you give partial credit here. You and your subject matter experts (SMEs) can come up with a list of these pairs and a weighted adjustment for each.

Photo by Matt Bango on Unsplash
Photo by Mathew Krizmanich on Unsplash

This weight can be factored into a modified cross-entropy loss function in the equation below. The back half of this equation will reduce the impact of being wrong for specific pairs of ground truth and prediction by using the “weight” function as a lookup. By default, the weighted adjustment would be 1 for all pairings, and the commonly confused classes would get something like 0.5.

In other words, it’s better to be unsure (have a lower confidence score) when you are wrong, compared to being super confident and wrong.

Modified cross-entropy loss function, image by author

Once this weighted log loss is calculated, I can compare to previous training runs to see if the new model is ready for production.

Confidence threshold report

Another valuable measure that incorporates the confidence threshold (in my example, 95) is to report on accuracy and false positive rates. Recall that when we apply the confidence threshold before presenting results, we help reduce false positives from being shown to the end user.

In this table, we look at the breakdown of “true positive above 95” for each data set. We get a sense that when a “good” picture comes through (like the ones from our train-validation-test set) it is very likely to surpass the threshold, thus the user is “happy” with the outcome. Conversely, the “false positive above 95” is extremely low for good pictures, thus only a small number of our users will be “sad” about the results.

Example Confidence Threshold Report, image by author

We expect the train-validation-test set results to be exceptional since our data is curated. So, as long as people take “good” pictures, the model should do very well. But to get a sense of how it does on extreme situations, let’s take a look at our benchmarks.

The “difficult” benchmark has more modest true positive and false positive rates, which reflects the fact that the images are more challenging. These values are much easier to compare across training runs, so that lets me set a min/max target. So for example, if I target a minimum of 80% for true positive, and maximum of 5% for false positive on this benchmark, then I can feel confident moving this to production.

The “out-of-scope” benchmark has no true positive rate because none of the images belong to any class the model can identify. Remember, we picked things like a bag of popcorn, etc., that are not zoo animals, so there cannot be any true positives. But we do get a false positive rate, which means the model gave a confident score to that bag of popcorn as some animal. And if we set a target maximum of 10% for this benchmark, then we may not want to move it to production.

Photo by Linus Mimietz on Unsplash

Right now, you may be thinking, “Well, what animal did it pick for the bag of popcorn?” Excellent question! Now you understand the importance of doing a manual review of the images that get bad results.

Evaluation of the deployed model

The evaluation that I described above applies to a model immediately after training. Now, you want to evaluate how your model is doing in the real world. The process is similar, but requires you to shift to a “production mindset” and asking yourself, “Did the model get this correct?” and “Should it have gotten this correct?” and “Did we tell the user the right thing?”

So, imagine that you are logging in for the morning — after sipping on your cold brew coffee, of course — and are presented with 500 images that your zoo guests took yesterday of different animals. Your job is to determine how satisfied the guests were using your model to identify the zoo animals.

Using the softmax “confidence” score for each image, we have a threshold before presenting results. Above the threshold, we tell the guest what the model predicted. I’ll call this the “happy path”. And below the threshold is the “sad path” where we ask them to try again.

Your review interface will first show you all the “happy path” images one at a time. This is where you ask yourself, “Did we get this right?” Hopefully, yes!

But if not, this is where things get tricky. So now you have to ask, “Why not?” Here are some things that it could be:

  • “Bad” picture — Poor lighting, bad angle, zoomed out, etc — refer to your labelling standards.
  • Out-of-scope — It’s a zoo animal, but unfortunately one that isn’t found in this zoo. Maybe it belongs to another zoo (your guest likes to travel and try out your app). Consider adding these to your data set.
  • Out-of-scope — It’s not a zoo animal. It could be an animal in your zoo, but not one typically contained there, like a neighborhood sparrow or mallard duck. This might be a candidate to add.
  • Out-of-scope — It’s something found in the zoo. A zoo usually has interesting trees and shrubs, so people might try to identify those. Another candidate to add.
  • Prankster — Completely out-of-scope. Because people like to play with technology, there’s the possibility you have a prankster that took a picture of a bag of popcorn, or a soft drink cup, or even a selfie. These are hard to prevent, but hopefully get a low enough score (below the threshold) so the model did not identify it as a zoo animal. If you see enough pattern in these, consider creating a class with special handling on the front-end.

After reviewing the “happy path” images, you move on to the “sad path” images — the ones that got a low confidence score and the app gave a “sorry, try again” message. This time you ask yourself, “Should the model have given this image a higher score?” which would have put it in the “happy path”. If so, then you want to ensure these images are added to the training set so next time it will do better. But most of time, the low score reflects many of the “bad” or out-of-scope situations mentioned above.

Perhaps your model performance is suffering and it has nothing to do with your model. Maybe it is the ways you users interacting with the app. Keep an eye out of non-technical problems and share your observations with the rest of your team. For example:

  • Are your users using the application in the ways you expected?
  • Are they not following the instructions?
  • Do the instructions need to be stated more clearly?
  • Is there anything you can do to improve the experience?

Collect statistics and new images

Both of the manual evaluations above open a gold mine of data. So, be sure to collect these statistics and feed them into a dashboard — your manager and your future self will thank you!

Photo by Justin Morgan on Unsplash

Keep track of these stats and generate reports that you and your can reference:

  • How often the model is being called?
  • What times of the day, what days of the week is it used?
  • Are your system resources able to handle the peak load?
  • What classes are the most common?
  • After evaluation, what is the accuracy for each class?
  • What is the breakdown for confidence scores?
  • How many scores are above and below the confidence threshold?

The single best thing you get from a deployed model is the additional real-world images! You can add these now images to improve coverage of your existing zoo animals. But more importantly, they provide you insight on other classes to add. For example, let’s say people enjoy taking a picture of the large walrus statue at the gate. Some of these may make sense to incorporate into your data set to provide a better user experience.

Creating a new class, like the walrus statue, is not a huge effort, and it avoids the false positive responses. It would be more embarrassing to identify a walrus statue as an elephant! As for the prankster and the bag of popcorn, you can configure your front-end to quietly handle these. You might even get creative and have fun with it like, “Thank you for visiting the food court.”

Double-check process

It is a good idea to double-check your image set when you suspect there may be problems with your data. I’m not suggesting a top-to-bottom check, because that would a monumental effort! Rather specific classes that you suspect could contain bad data that is degrading your model performance.

Immediately after my training run completes, I have a script that will use this new model to generate predictions for my entire data set. When this is complete, it will take the list of incorrect identifications, as well as the low scoring predictions, and automatically feed that list into the Double-check interface.

This interface will show, one at a time, the image in question, alongside an example image of the ground truth and an example image of what the model predicted. I can visually compare the three, side-by-side. The first thing I do is ensure the original image is a “good” picture, following my labelling standards. Then I check if the ground-truth label is indeed correct, or if there is something that made the model think it was the predicted label.

At this point I can:

  • Remove the original image if the image quality is poor.
  • Relabel the image if it belongs in a different class.

During this manual evaluation, you might notice dozens of the same wrong prediction. Ask yourself why the model made this mistake when the images seem perfectly fine. The answer may be some incorrect labels on images in the ground truth, or even in the predicted class!

Don’t hesitate to add those classes and sub-classes back into the Double-check interface and step through them all. You may have 100–200 pictures to review, but there is a good chance that one or two of the images will stand out as being the culprit.

Up next…

With a different mindset for a trained model versus a deployed model, we can now evaluate performances to decide which models are ready for production, and how well a production model is going to serve the public. This relies on a solid Double-check process and a critical eye on your data. And beyond the “gut feel” of your model, we can rely on the benchmark scores to support us.

In Part 4, we kick off the training run, but there are some subtle techniques to get the most out of the process and even ways to leverage throw-away models to expand your library image data.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Secretary Wright Acts to Unleash American Industry and Innovation with Newly Proposed Rules

WASHINGTON—U.S. Secretary of Energy Chris Wright directed the Federal Energy Regulatory Commission (FERC) today to initiate rulemaking procedures with a proposed rule to rapidly accelerate the interconnection of large loads, including data centers, positioning the United States to lead in AI innovation and in the revitalization of domestic manufacturing. Secretary Wright’s proposed rule allows customers to file joint, co-located load and generation interconnection requests. It will also significantly reduce study times and grid upgrade costs, while reducing the time needed for additional generation and power to come online. The proposed rule advances President Trump’s agenda to ensure all Americans and domestic industries have access to affordable, reliable, and secure electricity. Click here to read Secretary Wright’s letter and proposed rule. Secretary Wright also directed FERC today to initiate rulemaking procedures with a proposed rule to remove unnecessary burdens for preliminary hydroelectric power permits. Secretary Wright’s proposed rule clarifies that third parties do not have veto rights over the issuance of preliminary hydroelectric power permits. Click here to read Secretary Wright’s letter and proposed rule. President Trump and Secretary Wright have been clear: The United States is experiencing an unprecedented surge in electricity demand and the United States’ ability to remain at the forefront of technological innovation depends on an affordable, reliable, and secure supply of energy. ###

Read More »

Brazil Oil Auction Shows Renewed Interest in Offshore Exploration

Brazil sold five of the seven blocks offered in an oil auction on Wednesday to majors including Petrobras and Equinor ASA in a sign of renewed interest to explore in deep waters off the country’s southern coast despite low oil prices.  The auction presented surprises with Melbourne-based Karoon Energy Ltd winning the Esmeralda block by itself. Chinese oil majors CNOOC Ltd. and China Petroleum & Chemical Corp., or Sinopec, won a block without any local partners.  “The bid round was successful,” said Marcelo De Assis, a Rio de Janeiro-based independent oil consultant. “A Chinese operated block in the pre-salt is a first.” The fields are located in the so-called pre-salt region that is so productive that the single biggest project produces more oil than all of Colombia. Discoveries made in the 2000s propelled Brazil to become Latin America’s biggest producer, but exploration was lackluster for more than a decade until BP Plc announced the Bumerangue discovery in the pre-salt this year.  State-controlled Petroleo Brasileiro SA, as it is formally known, has also announced a series of oil finds at the Aram block in the pre-salt.  “What’s most important is that the pre-salt has heated up again,” said Pedro Zalan a geologist and consultant who previously worked at Petrobras. “The pre-salt, where there was little interest recently, got a new lease on life.”  Still, European majors who operate in Brazil, Shell Plc, BP Plc and TotalEnergies SE didn’t present any bids. The National Agency for Petroleum, Natural Gas and Biofuels, or ANP, received investment commitments even amid low prices that are prompting the oil companies to slash spending and downsize staff. WHAT DO YOU THINK? Generated by readers, the comments included herein do not reflect the views and opinions of Rigzone. All comments are subject to editorial review. Off-topic, inappropriate or

Read More »

Oil Futures Take Flight on Russia Sanctions

Oil posted its biggest one-day gain in more than four months after the US announced sanctions on Russia’s biggest oil companies, threatening supplies from one of the world’s top producing countries.  West Texas Intermediate jumped 5.6% to settle near $62 a barrel, the most since the start of the Israel-Iran conflict on June 13. Heating oil led the oil complex higher, ending the day up 6.8%. The US blacklisted Russian oil giants Rosneft PJSC and Lukoil PJSC in an effort to cut off revenue Russia needs for its war in Ukraine. Senior refinery executives in India — a key buyer of Russian crude — said the restrictions would make it impossible for flows to continue. The latest US sanctions are a radical change of policy, where previous efforts to pressure Russia to end the war included a Group-of-Seven price cap on Russian oil that sought to limit revenue for the Kremlin without disrupting supply and causing a spike in global prices. The step comes at a time when global supply looks plentiful. Nations inside and outside the OPEC+ producer alliance have been ramping up output amid signs of cooling demand growth. If India does drastically cut purchases the question will become whether China, the other top buyer of Russian crude, is willing to step into the void. “The latest US sanctions on Russia’s largest oil producers represent a significant and unprecedented escalation in Washington’s pressure campaign against Moscow,” said Rystad Energy’s head of geopolitical analysis, Jorge Leon. “Combined with the recent wave of attacks on Russian oil infrastructure, these sanctions raise the prospect of major disruptions to Russian crude production and exports, heightening the risk of forced production shut-ins.” The European Union also piled additional pressure on the Kremlin with a new package of sanctions targeting Russia’s energy infrastructure, including

Read More »

China Keeps Importing Russian LNG After Dodging New USA Curbs

China is pushing ahead with imports of US-sanctioned Russian liquefied natural gas, after the White House stopped short of putting additional restrictions on the trade in its latest wave of sanctions. The Iris vessel, carrying a shipment from the blacklisted Arctic LNG 2 facility in Russia, docked at the Beihai import terminal in southern China on Thursday, according to ship-tracking data compiled by Bloomberg. This is China’s 11th shipment of restricted Russian LNG since late-August. The move comes after US President Donald Trump ramped up pressure on Russia by blacklisting state-run oil giants Rosneft PJSC and Lukoil PJSC, citing Moscow’s lack of commitment to Ukrainian peace. However, the White House hasn’t yet hit companies circumventing sanctions on LNG — a growing source of revenue for Moscow, which aims to triple exports of the superchilled fuel by 2030.  The lack of new restrictions on Russian LNG is notable, given that the UK slapped sanctions on Beihai last week. Meanwhile, European Union nations have adopted a new package of sanctions aimed at Russia that will target 45 entities, including 12 companies in China and Hong Kong.   China had designated Beihai as the sole entry point for shipments from Arctic LNG 2 — a Russian project already sanctioned by the US in 2023. Arctic LNG 2 started delivering the blacklisted fuel to the Asian nation in late August, a move that coincided with a visit to Beijing by Russian President Vladimir Putin. The Iris vessel loaded an LNG shipment from a floating storage unit in eastern Russia in early October, according to ship-tracking data. The fuel in storage was sourced from the Arctic LNG 2 project. The storage facility and Iris have both been previously sanctioned by the US. At least three more vessels carrying blacklisted Russian LNG are heading to the Beihai terminal, ship data

Read More »

US slaps sanctions on Russian oil giants; EU follows with LNG ban

The Treasury Department carved out an exception to allow oilfield services for the Caspian Pipeline Consortium and the Tengizchevroil project, authorizing those transactions despite the broader sanctions in place. EU sanctions on Russia The European Union (EU) ratcheted up its own sanctions on Moscow Oct. 23, announcing a phased ban on Russian LNG imports into the 27 nations that comprise the organization, with the goal of halting all trade in Russian LNG by Jan. 1, 2027. Most Russian gas travels through pipeline to Europe, but bans on piped gas have boosted LNG exports. The EU sanctions also ban transactions with Rosneft and Lukoil and institute port bans on 117 newly designated ships in Russia’s so-called “shadow fleet,” which has sought to evade previous sanctions measures. The sanctions come a week after the UK imposed its own sanctions on Lukoil and Rosneft, the shadow fleet, a major Indian oil refinery, and four Chinese oil terminals. Russia is dependent on oil and gas sales to fuel its economy. EIA estimates that Russia’s oil exports averaged 4.3 million b/d in first-half 2005, down from 4.8 million b/d in 2024.  Trade has shifted as a result of sanctions, with the EU’s imports falling to 11% (from over 50% in 2020) and Turkey’s imports taking at least half of that oil, the US independent statistical arm said. China remains the single largest importer of Russian oil, averaging 2.0 million b/d in first-half 2025. India holds the second spot, boosting imports to 1.6 million b/d in first-half 2025 from 50,000 b/d in 2020, EIA added.

Read More »

Williams buys into Louisiana LNG, sells upstream Haynesville asset

Woodside Energy Ltd. has sold a 10% interest in Louisiana LNG LLC and and 80% interest and operatorship of Driftwood Pipeline LLC to Williams Cos. As part of the transaction, Williams will contribute its share expenditure—roughly $1.9 billion—to the LNG plant and pipeline and assumes LNG offtake obligations for 10% of produced volumes. Williams will also build and operate the 2.4-bcfd Line 200 pipeline, which will use 42-in. OD pipe to transport natural gas from the 5.4-bcfd Driftwood pipeline to the 27.6-million tonne/year (tpy) Louisiana LNG plant. Woodside chief executive officer Meg O’Neill noted that this was Williams’ first investment in LNG and that the project was “on track to deliver first LNG in 2029.” Woodside last month signed its first long-term LNG offtake agreement with Petronas, which could include production from Louisiana LNG. First deliveries under the deal are also targeted for 2029. The deal was executed at a purchase price of $250 million with an effective date of Jan. 1, 2025. The total proceeds received were $378 million, including proportionate capital reimbursement since the effective date.  Relatedly, Williams agreed to sell its minority interest in the South Mansfield upstream asset of Louisiana’s Haynesville shale to JERA Co. Inc. for $398 million plus deferred monthly payments through 2029. The total transaction value is estimated at $1.5 billion. South Mansfield currently produces 500 MMcfd and includes 200 undeveloped locations, JERA said. The company plans to increase production to 1 bcfd. GEP Haynesville II LLC is also selling its majority interest in South Mansfield upstream but will continue to operate and develop the asset on a fixed fee basis. GEP continues to own and operate other Haynesville shale assets not involved in this transaction. Under JERA’s ownership, Williams will continue to gather natural gas volumes from South Mansfield and will deliver those volumes through its Louisiana Energy Gateway (LEG) system

Read More »

How to set up an AI data center in 90 days

“Personally, I think that a brownfield is very creative way to deal with what I think is the biggest problem that we’ve got right now, which is time and speed to market,” he said. “On a brownfield, I can go into a building that’s already got power coming into the building. Sometimes they’ve already got chiller plants, like what we’ve got with the building I’m in right now.” Patmos certainly made the most of the liquid facilities in the old printing press building. The facility is built to handle anywhere from 50 to over 140 kilowatts per cabinet, a leap far beyond the 1–2 kW densities typical of legacy data centers. The chips used in the servers are Nvidia’s Grace Blackwell processors, which run extraordinarily hot. To manage this heat load, Patmos employs a multi-loop liquid cooling system. The design separates water sources into distinct, closed loops, each serving a specific function and ensuring that municipal water never directly contacts sensitive IT equipment. “We have five different, completely separated water loops in this building,” said Morgan. “The cooling tower uses city water for evaporation, but that water never mixes with the closed loops serving the data hall. Everything is designed to maximize efficiency and protect the hardware.” The building taps into Kansas City’s district chilled water supply, which is sourced from a nearby utility plant. This provides the primary cooling resource for the facility. Inside the data center, a dedicated loop circulates a specialized glycol-based fluid, filtered to extremely low micron levels and formulated to be electronically safe. Heat exchangers transfer heat from the data hall fluid to the district chilled water, keeping the two fluids separate and preventing corrosion or contamination. Liquid-to-chip and rear-door heat exchangers are used for immediate heat removal.

Read More »

INNIO and VoltaGrid: Landmark 2.3 GW Modular Power Deal Signals New Phase for AI Data Centers

Why This Project Marks a Landmark Shift The deployment of 2.3 GW of modular generation represents utility-scale capacity, but what makes it distinct is the delivery model. Instead of a centralized plant, the project uses modular gas-reciprocating “power packs” that can be phased in step with data-hall readiness. This approach allows staged energization and limits the bottlenecks that often stall AI campuses as they outgrow grid timelines or wait in interconnection queues. AI training loads fluctuate sharply, placing exceptional stress on grid stability and voltage quality. The INNIO/VoltaGrid platform was engineered specifically for these GPU-driven dynamics, emphasizing high transient performance (rapid load acceptance) and grid-grade power quality, all without dependence on batteries. Each power pack is also designed for maximum permitting efficiency and sustainability. Compared with diesel generation, modern gas-reciprocating systems materially reduce both criteria pollutants and CO₂ emissions. VoltaGrid markets the configuration as near-zero criteria air emissions and hydrogen-ready, extending allowable runtimes under air permits and making “prime-as-a-service” viable even in constrained or non-attainment markets. 2025: Momentum for Modular Prime Power INNIO has spent 2025 positioning its Jenbacher platform as a next-generation power solution for data centers: combining fast start, high transient performance, and lower emissions compared with diesel. While the 3 MW J620 fast-start lineage dates back to 2019, this year the company sharpened its data center narrative and booked grid stability and peaking projects in markets where rapid data center growth is stressing local grids. This momentum was exemplified by an 80 MW deployment in Indonesia announced earlier in October. The same year saw surging AI-driven demand and INNIO’s growing push into North American data-center markets. Specifications for the 2.3 GW VoltaGrid package highlight the platform’s heat tolerance, efficiency, and transient response, all key attributes for powering modern AI campuses. VoltaGrid’s 2025 Milestones VoltaGrid’s announcements across 2025 reflect

Read More »

Inside Google’s multi-architecture revolution: Axion Arm joins x86 in production clusters

Matt Kimball, VP and principal analyst with Moor Insights and Strategy, pointed out that AWS and Microsoft have already moved many workloads from x86 to internally designed Arm-based servers. He noted that, when Arm first hit the hyperscale datacenter market, the architecture was used to support more lightweight, cloud-native workloads with an interpretive layer where architectural affinity was “non-existent.” But now there’s much more focus on architecture, and compatibility issues “largely go away” as Arm servers support more and more workloads. “In parallel, we’ve seen CSPs expand their designs to support both scale out (cloud-native) and traditional scale up workloads effectively,” said Kimball. Simply put, CSPs are looking to monetize chip investments, and this migration signals that Google has found its performance-per-dollar (and likely performance-per-watt) better on Axion than x86. Google will likely continue to expand its Arm footprint as it evolves its Axion chip; as a reference point, Kimball pointed to AWS Graviton, which didn’t really support “scale up” performance until its v3 or v4 chip. Arm is coming to enterprise data centers too When looking at architectures, enterprise CIOs should ask themselves questions such as what instance do they use for cloud workloads, and what servers do they deploy in their data center, Kimball noted. “I think there is a lot less concern about putting my workloads on an Arm-based instance on Google Cloud, a little more hesitance to deploy those Arm servers in my datacenter,” he said. But ultimately, he said, “Arm is coming to the enterprise datacenter as a compute platform, and Nvidia will help usher this in.” Info-Tech’s Jain agreed that Nvidia is the “biggest cheerleader” for Arm-based architecture, and Arm is increasingly moving from niche and mobile use to general-purpose and AI workload execution.

Read More »

AMD Scales the AI Factory: 6 GW OpenAI Deal, Korean HBM Push, and Helios Debut

What 6 GW of GPUs Really Means The 6 GW of accelerator load envisioned under the OpenAI–AMD partnership will be distributed across multiple hyperscale AI factory campuses. If OpenAI begins with 1 GW of deployment in 2026, subsequent phases will likely be spread regionally to balance supply chains, latency zones, and power procurement risk. Importantly, this represents entirely new investment in both power infrastructure and GPU capacity. OpenAI and its partners have already outlined multi-GW ambitions under the broader Stargate program; this new initiative adds another major tranche to that roadmap. Designing for the AI Factory Era These upcoming facilities are being purpose-built for next-generation AI factories, where MI450-class clusters could drive rack densities exceeding 100 kW. That level of compute concentration makes advanced power and cooling architectures mandatory, not optional. Expected solutions include: Warm-water liquid cooling (manifold, rear-door, and CDU variants) as standard practice. Facility-scale water loops and heat-reuse systems—including potential district-heating partnerships where feasible. Medium-voltage distribution within buildings, emphasizing busway-first designs and expanded fault-current engineering. While AMD has not yet disclosed thermal design power (TDP) specifications for the MI450, a 1 GW campus target implies tens of thousands of accelerators. That scale assumes liquid cooling, ultra-dense racks, and minimal network latency footprints, pushing architectures decisively toward an “AI-first” orientation. Design considerations for these AI factories will likely include: Liquid-to-liquid cooling plants engineered for step-function capacity adders (200–400 MW blocks). Optics-friendly white space layouts with short-reach topologies, fiber raceways, and aisles optimized for module swaps. Substation adjacency and on-site generation envelopes negotiated during early land-banking phases. Networking, Memory, and Power Integration As compute density scales, networking and memory bottlenecks will define infrastructure design. Expect fat-tree and dragonfly network topologies, 800 G–1.6 T interconnects, and aggressive optical-module roadmaps to minimize collective-operation latency, aligning with recent disclosures from major networking vendors.

Read More »

Study Finds $4B in Data Center Grid Costs Shifted to Consumers Across PJM Region

In a new report spanning 2022 through 2024, the Union of Concerned Scientists (UCS) identifies a significant regulatory gap in the PJM Interconnection’s planning and rate-making process—one that allows most high-voltage (“transmission-level”) interconnection costs for large, especially AI-scale, data centers to be socialized across all utility customers. The result, UCS argues, is a multi-billion-dollar pass-through that is poised to grow as more data center projects move forward, because these assets are routinely classified as ordinary transmission infrastructure rather than customer-specific hookups. According to the report, between 2022 and 2024, utilities initiated more than 150 local transmission projects across seven PJM states specifically to serve data center connections. In 2024 alone, 130 projects were approved with total costs of approximately $4.36 billion. Virginia accounted for nearly half that total—just under $2 billion—followed by Ohio ($1.3 billion) and Pennsylvania ($492 million) in data-center-related interconnection spending. Yet only six of those 130 projects, about 5 percent, were reported as directly paid for by the requesting customer. The remaining 95 percent, representing more than $4 billion in 2024 connection costs, were rolled into general transmission charges and ultimately recovered from all retail ratepayers. How Does This Happen? When data center project costs are discussed, the focus is usually on the price of the power consumed, or megawatts multiplied by rate. What the UCS report isolates, however, is something different: the cost of physically delivering that power: the substations, transmission lines, and related infrastructure needed to connect hyperscale facilities to the grid. So why aren’t these substantial consumer-borne costs more visible? The report identifies several structural reasons for what effectively functions as a regulatory loophole in how development expenses are reported and allocated: Jurisdictional split. High-voltage facilities fall under the Federal Energy Regulatory Commission (FERC), while retail electricity rates are governed by state public utility

Read More »

OCP Global Summit 2025 Highlights: Advancing Data Center Densification and Security

With the conclusion of the 2025 OCP Global Summit, William G. Wong, Senior Content Director at DCF’s sister publications Electronic Design and Microwaves & RF, published a comprehensive roundup of standout technologies unveiled at the event. For Data Center Frontier readers, we’ve revisited those innovations through the lens of data center impact, focusing on how they reshape infrastructure design and operational strategy. This year’s OCP Summit marked a decisive shift toward denser GPU racks, standardized direct-to-chip liquid cooling, 800-V DC power distribution, high-speed in-rack fabrics, and “crypto-agile” platform security. Collectively, these advances aim to accelerate time-to-capacity, reduce power-distribution losses at megawatt rack scales, simplify retrofits in legacy halls, and fortify data center platforms against post-quantum threats. Rack Design and Cooling: From Ad-Hoc to Production-Grade Liquid Cooling NVIDIA’s Vera Rubin compute tray, newly offered to OCP for standardization, packages Rubin-generation GPUs with an integrated liquid-cooling manifold and PCB midplane. Compared with the GB300 tray, Vera Rubin represents a production-ready module delivering four times the memory and three times the memory bandwidth: a 7.5× performance factor at rack scale, with 150 TB of memory at 1.7 PB/s per rack. The system implements 45 °C liquid cooling, a 5,000-amp liquid-cooled busbar, and on-tray energy storage with power-resilience features such as flexible 100-amp whips and automatic-transfer power-supply units. NVIDIA also previewed a Kyber rack generation targeted for 2027, pivoting from 415/480 VAC to 800 V DC to support up to 576 Rubin Ultra GPUs, potentially eliminating the 200-kg copper busbars typical today. These refinements are aimed at both copper reduction and aisle-level manageability. Wiwynn’s announcements filled in the practicalities of deploying such densities. The company showcased rack- and system-level designs across NVIDIA GB300 NVL72 (72 Blackwell Ultra GPUs with 800 Gb/s ConnectX-8 SuperNICs) for large-scale inference and reasoning, and HGX B300 (eight GPUs /

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »