Stay Ahead, Stay ONMINE

Learnings from a Machine Learning Engineer — Part 3: The Evaluation

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions). In Part 1, […]

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions).

In Part 1, I discussed the process of labelling your image data that you use in your Image Classification project. I showed how to define “good” images and create sub-classes. In Part 2, I went over various data sets, beyond the usual train-validation-test sets, such as benchmark sets, plus how to handle synthetic data and duplicate images.

Evaluation of the trained model

As machine learning engineers we look at accuracy, F1, log loss, and other metrics to decide if a model is ready to move to production. These are all important measures, but from my experience, these scores can be deceiving especially as the number of classes grows.

Although it can be time consuming, I find it very important to manually review the images that the model gets wrong, as well as the images that the model gives a low softmax “confidence” score to. This means adding a step immediately after your training run completes to calculate scores for all images — training, validation, test, and the benchmark sets. You only need to bring up for manual review the ones that the model had problems with. This should only be a small percentage of the total number of images. See the Double-check process below

What you do during the manual evaluation is to put yourself in a “training mindset” to ensure that the labelling standards are being followed that you setup in Part 1. Ask yourself:

  • “Is this a good image?” Is the subject front and center, and can you clearly see all the features?
  • “Is this the correct label?” Don’t be surprised if you find wrong labels.

You can either remove the bad images or fix the labels if they are wrong. Otherwise you can keep them in the data set and force the model to do better next time. Other questions I ask are:

  • “Why did the model get this wrong?”
  • “Why did this image get a low score?”
  • “What is it about the image that caused confusion?”

Sometimes the answer has nothing to do with that specific image. Frequently, it has to do with the other images, either in the ground truth class or in the predicted class. It is worth the effort to Double-check all images in both sets if you see a consistently bad guess. Again, don’t be surprised if you find poor images or wrong labels.

Weighted evaluation

When doing the evaluation of the trained model (above), we apply a lot of subjective analysis — “Why did the model get this wrong?” and “Is this a good image?” From these, you may only get a gut feeling.

Frequently, I will decide to hold off moving a model forward to production based on that gut feel. But how can you justify to your manager that you want to hit the brakes? This is where putting a more objective analysis comes in by creating a weighted average of the softmax “confidence” scores.

In order to apply a weighted evaluation, we need to identify sets of classes that deserve adjustments to the score. Here is where I create a list of “commonly confused” classes.

Commonly confused classes

Certain animals at our zoo can easily be mistaken. For example, African elephants and Asian elephants have different ear shapes. If your model gets these two mixed up, that is not as bad as guessing a giraffe! So perhaps you give partial credit here. You and your subject matter experts (SMEs) can come up with a list of these pairs and a weighted adjustment for each.

Photo by Matt Bango on Unsplash
Photo by Mathew Krizmanich on Unsplash

This weight can be factored into a modified cross-entropy loss function in the equation below. The back half of this equation will reduce the impact of being wrong for specific pairs of ground truth and prediction by using the “weight” function as a lookup. By default, the weighted adjustment would be 1 for all pairings, and the commonly confused classes would get something like 0.5.

In other words, it’s better to be unsure (have a lower confidence score) when you are wrong, compared to being super confident and wrong.

Modified cross-entropy loss function, image by author

Once this weighted log loss is calculated, I can compare to previous training runs to see if the new model is ready for production.

Confidence threshold report

Another valuable measure that incorporates the confidence threshold (in my example, 95) is to report on accuracy and false positive rates. Recall that when we apply the confidence threshold before presenting results, we help reduce false positives from being shown to the end user.

In this table, we look at the breakdown of “true positive above 95” for each data set. We get a sense that when a “good” picture comes through (like the ones from our train-validation-test set) it is very likely to surpass the threshold, thus the user is “happy” with the outcome. Conversely, the “false positive above 95” is extremely low for good pictures, thus only a small number of our users will be “sad” about the results.

Example Confidence Threshold Report, image by author

We expect the train-validation-test set results to be exceptional since our data is curated. So, as long as people take “good” pictures, the model should do very well. But to get a sense of how it does on extreme situations, let’s take a look at our benchmarks.

The “difficult” benchmark has more modest true positive and false positive rates, which reflects the fact that the images are more challenging. These values are much easier to compare across training runs, so that lets me set a min/max target. So for example, if I target a minimum of 80% for true positive, and maximum of 5% for false positive on this benchmark, then I can feel confident moving this to production.

The “out-of-scope” benchmark has no true positive rate because none of the images belong to any class the model can identify. Remember, we picked things like a bag of popcorn, etc., that are not zoo animals, so there cannot be any true positives. But we do get a false positive rate, which means the model gave a confident score to that bag of popcorn as some animal. And if we set a target maximum of 10% for this benchmark, then we may not want to move it to production.

Photo by Linus Mimietz on Unsplash

Right now, you may be thinking, “Well, what animal did it pick for the bag of popcorn?” Excellent question! Now you understand the importance of doing a manual review of the images that get bad results.

Evaluation of the deployed model

The evaluation that I described above applies to a model immediately after training. Now, you want to evaluate how your model is doing in the real world. The process is similar, but requires you to shift to a “production mindset” and asking yourself, “Did the model get this correct?” and “Should it have gotten this correct?” and “Did we tell the user the right thing?”

So, imagine that you are logging in for the morning — after sipping on your cold brew coffee, of course — and are presented with 500 images that your zoo guests took yesterday of different animals. Your job is to determine how satisfied the guests were using your model to identify the zoo animals.

Using the softmax “confidence” score for each image, we have a threshold before presenting results. Above the threshold, we tell the guest what the model predicted. I’ll call this the “happy path”. And below the threshold is the “sad path” where we ask them to try again.

Your review interface will first show you all the “happy path” images one at a time. This is where you ask yourself, “Did we get this right?” Hopefully, yes!

But if not, this is where things get tricky. So now you have to ask, “Why not?” Here are some things that it could be:

  • “Bad” picture — Poor lighting, bad angle, zoomed out, etc — refer to your labelling standards.
  • Out-of-scope — It’s a zoo animal, but unfortunately one that isn’t found in this zoo. Maybe it belongs to another zoo (your guest likes to travel and try out your app). Consider adding these to your data set.
  • Out-of-scope — It’s not a zoo animal. It could be an animal in your zoo, but not one typically contained there, like a neighborhood sparrow or mallard duck. This might be a candidate to add.
  • Out-of-scope — It’s something found in the zoo. A zoo usually has interesting trees and shrubs, so people might try to identify those. Another candidate to add.
  • Prankster — Completely out-of-scope. Because people like to play with technology, there’s the possibility you have a prankster that took a picture of a bag of popcorn, or a soft drink cup, or even a selfie. These are hard to prevent, but hopefully get a low enough score (below the threshold) so the model did not identify it as a zoo animal. If you see enough pattern in these, consider creating a class with special handling on the front-end.

After reviewing the “happy path” images, you move on to the “sad path” images — the ones that got a low confidence score and the app gave a “sorry, try again” message. This time you ask yourself, “Should the model have given this image a higher score?” which would have put it in the “happy path”. If so, then you want to ensure these images are added to the training set so next time it will do better. But most of time, the low score reflects many of the “bad” or out-of-scope situations mentioned above.

Perhaps your model performance is suffering and it has nothing to do with your model. Maybe it is the ways you users interacting with the app. Keep an eye out of non-technical problems and share your observations with the rest of your team. For example:

  • Are your users using the application in the ways you expected?
  • Are they not following the instructions?
  • Do the instructions need to be stated more clearly?
  • Is there anything you can do to improve the experience?

Collect statistics and new images

Both of the manual evaluations above open a gold mine of data. So, be sure to collect these statistics and feed them into a dashboard — your manager and your future self will thank you!

Photo by Justin Morgan on Unsplash

Keep track of these stats and generate reports that you and your can reference:

  • How often the model is being called?
  • What times of the day, what days of the week is it used?
  • Are your system resources able to handle the peak load?
  • What classes are the most common?
  • After evaluation, what is the accuracy for each class?
  • What is the breakdown for confidence scores?
  • How many scores are above and below the confidence threshold?

The single best thing you get from a deployed model is the additional real-world images! You can add these now images to improve coverage of your existing zoo animals. But more importantly, they provide you insight on other classes to add. For example, let’s say people enjoy taking a picture of the large walrus statue at the gate. Some of these may make sense to incorporate into your data set to provide a better user experience.

Creating a new class, like the walrus statue, is not a huge effort, and it avoids the false positive responses. It would be more embarrassing to identify a walrus statue as an elephant! As for the prankster and the bag of popcorn, you can configure your front-end to quietly handle these. You might even get creative and have fun with it like, “Thank you for visiting the food court.”

Double-check process

It is a good idea to double-check your image set when you suspect there may be problems with your data. I’m not suggesting a top-to-bottom check, because that would a monumental effort! Rather specific classes that you suspect could contain bad data that is degrading your model performance.

Immediately after my training run completes, I have a script that will use this new model to generate predictions for my entire data set. When this is complete, it will take the list of incorrect identifications, as well as the low scoring predictions, and automatically feed that list into the Double-check interface.

This interface will show, one at a time, the image in question, alongside an example image of the ground truth and an example image of what the model predicted. I can visually compare the three, side-by-side. The first thing I do is ensure the original image is a “good” picture, following my labelling standards. Then I check if the ground-truth label is indeed correct, or if there is something that made the model think it was the predicted label.

At this point I can:

  • Remove the original image if the image quality is poor.
  • Relabel the image if it belongs in a different class.

During this manual evaluation, you might notice dozens of the same wrong prediction. Ask yourself why the model made this mistake when the images seem perfectly fine. The answer may be some incorrect labels on images in the ground truth, or even in the predicted class!

Don’t hesitate to add those classes and sub-classes back into the Double-check interface and step through them all. You may have 100–200 pictures to review, but there is a good chance that one or two of the images will stand out as being the culprit.

Up next…

With a different mindset for a trained model versus a deployed model, we can now evaluate performances to decide which models are ready for production, and how well a production model is going to serve the public. This relies on a solid Double-check process and a critical eye on your data. And beyond the “gut feel” of your model, we can rely on the benchmark scores to support us.

In Part 4, we kick off the training run, but there are some subtle techniques to get the most out of the process and even ways to leverage throw-away models to expand your library image data.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

AI-driven network management gains enterprise trust

The way the full process works is that the raw data feed comes in, and machine learning is used to identify an anomaly that could be a possible incident. That’s where the generative AI agents step up. In addition to the history of similar issues, the agents also look for

Read More »

Chinese cyberspies target VMware vSphere for long-term persistence

Designed to work in virtualized environments The CISA, NSA, and Canadian Cyber Center analysts note that some of the BRICKSTORM samples are virtualization-aware and they create a virtual socket (VSOCK) interface that enables inter-VM communication and data exfiltration. The malware also checks the environment upon execution to ensure it’s running

Read More »

IBM boosts DNS protection for multicloud operations

“In addition to this DNS synchronization, you can publish DNS configurations to your Amazon Simple Storage Service (S3) bucket. As you implement DNS changes, the S3 bucket will automatically update. The ability to store multiple configurations in your S3 bucket allows you to choose the most appropriate restore point if

Read More »

North America Adds Rigs Week on Week

North America added eight rigs week on week, according to Baker Hughes’ latest North America rotary rig count, which was published on December 5. The total U.S. rig count increased by five week on week and the total Canada rig count rose by three during the same period, taking the total North America rig count up to 740, comprising 549 rigs from the U.S. and 191 rigs from Canada, the count outlined. Of the total U.S. rig count of 549, 527 rigs are categorized as land rigs, 19 are categorized as offshore rigs, and three are categorized as inland water rigs. The total U.S. rig count is made up of 413 oil rigs, 129 gas rigs, and seven miscellaneous rigs, according to Baker Hughes’ count, which revealed that the U.S. total comprises 476 horizontal rigs, 58 directional rigs, and 15 vertical rigs. Week on week, the U.S. land rig count rose by three, and its offshore and inland water rig counts each increased by one, Baker Hughes highlighted. The U.S. oil rig count rose by six week on week, its gas rig count dropped by one by week on week, and its miscellaneous rig count remained unchanged week on week, the count showed. The U.S. horizontal rig count rose by one, its directional rig count remained unchanged, and its vertical rig count increased by four, week on week, the count revealed. A major state variances subcategory included in the rig count showed that, week on week, Louisiana added four rigs and New Mexico added one rig. A major basin variances subcategory included in Baker Hughes’ rig count showed that, week on week, the Eagle Ford basin dropped one rig. Canada’s total rig count of 191 is made up of 126 oil rigs and 65 gas rigs, Baker Hughes pointed out.

Read More »

Trump Admin Backs Potential American Buy of Lukoil Iraq Field

The US government is backing Iraq’s plan to transfer Lukoil PJSC’s stake in a giant oil field to an American company, days before a sanctions waiver on the Russian firm is set to expire. Iraq’s Oil Ministry last week said it’s approaching US companies to take over the majority holding in West Qurna 2, which pumps about 10 percent of the country’s crude. The Trump administration’s preference is for the Russian firm’s global assets to be taken over by a US entity, people familiar with the matter said last month. The ministry didn’t name any companies, but US firms including Exxon Mobil Corp. and Chevron Corp. have emerged as potential suitors for Lukoil’s assets. For West Qurna-2, Iraq would prefer Exxon, which had previously operated the neighboring West Qurna 1 oil field, one person said, asking not to be identified because the information is private. Exxon recently returned to Iraq after a two-year absence, signing an initial agreement in October that could pave the way for developing the Majnoon field in the country’s south. Chevron is in discussions to enter Iraq, Chief Executive Officer Mike Wirth said at the company’s Nov. 12 investor day. The company’s officials met with Iraq’s oil minister in Baghdad this week, according to a Iraqi statement. “We are encouraged by the Iraqi Ministry of Oil’s initial agreements with Exxon and Chevron, the recent commitment to transition West Qurna-2 to a US operator,” a State Department spokesperson said in answer to questions from Bloomberg. “The United States will continue to champion the interests of American companies in Iraq.” Exxon and Chevron declined to comment. A call to Lukoil’s press service went unanswered and the company didn’t respond to an email sent outside of normal business hours in Moscow on Monday. Iraq, the second-largest producer in the Organization of

Read More »

Noble to Sell 6 Jackups, Become Pureplay Deepwater Driller

Noble Corp said Monday it had signed separate deals to sell five jackup rigs to Borr Drilling Ltd for $360 million and one jackup to Ocean Oilfield Drilling for $64 million. After the completion of the transactions, expected next year, “Noble will be a pureplay deepwater and ultra-harsh environment jackup operator”, the offshore driller said in an online statement. Borr will acquire Noble Resilient (built 2009), Noble Resolute (built 2009), Noble Mick O’Brien (built 2013), Noble Regina Allen (built 2013) and Noble Tom Prosser (built 2014). The purchase price consists of $210 million in cash and $150 million in seller notes. “The $150 million in proposed seller notes to Borr are expected to have a six-year maturity and be secured by a first lien on three jackups (Noble Tom Prosser, Noble Regina Allen and Noble Resilient)”, Noble said. “Additionally, Noble intends to operate two rigs – Noble Mick O’Brien and Noble Resolute – under a bareboat charter agreement with Borr for one year from signing of the definitive agreement”, it said. Meanwhile Ocean Oilfield Drilling will buy Noble Resolve, built 2009, after the rig’s ongoing contract ends. Noble Resolve will be freed in the first quarter of 2026, Nobel says on its online fleet inventory. The rig is currently deployed in Spain for an unnamed operator, according to Noble’s latest fleet status report, published October 27. Ocean Oilfield Drilling will pay in cash. “These transactions are expected to be immediately accretive to our shareholders based on both trailing 2025 and anticipated 2026 EBITDA and free cash flow, while also bolstering our balance sheet and sharpening the focus on our established positions in the deepwater and ultra-harsh jackup segments”, said president and chief executive Robert W. Eifler. In its quarterly report October 27, Noble said the Noble Globetrotter II drillship, built 2013, was also being sold. During the third

Read More »

Equinor Scores 2 Gas, Condensate Discoveries in Sleipner

Equinor ASA and its partners have achieved two new natural gas and condensate discoveries in the Sleipner area on Norway’s side of the North Sea. Preliminary estimates for Lofn (well 15/5-8 S) and Langemann (15/5-8 A), in production license 1140, indicate 5-18 million standard cubic meters oil-equivalent recoverable resources, or 30-110 million barrels, according to the Norwegian majority state-owned company. “These are Equinor’s largest discoveries so far this year and can be developed for the European market through existing infrastructure”, it said in an online statement. The discoveries sit between the Gudrun and Eirin fields and about 40 kilometers (24.85 miles) northwest of the Sleipner A processing, drilling and living quarters platform, according to Equinor. The platform is one of several installations serving the Sleipner gas and condensate fields Sleipner East (which started production 1993), Gungne (started up 1996) and Sleipner West (also put onstream 1996). Sleipner infrastructure also serves tie-in fields Sigyn (online since 2002), Volve (started up 2008), Gudrun (started up 2014) and Gina Krog (started up 2017). Lofn and Langemann encountered gas and condensate in the Hugin Formation, which consists of sandstones with “good reservoir properties”, Equinor said. “The discoveries reduce uncertainty in several nearby prospects, which will now be further evaluated”, it said. Kjetil Hove, executive vice president for Norwegian exploration and production at Equinor, said, “This demonstrates the importance of maintaining exploration activity on the Norwegian continental shelf. There are still significant energy resources on the shelf, and Europe needs stable oil and gas deliveries”. “Discoveries near existing fields can be developed quickly through subsea facilities, with limited environmental impact, very low CO2 emissions from production and strong profitability”, Hove said. “Equinor plans to accelerate such developments on the Norwegian continental shelf”. Karl Johnny Hersvik, chief executive of license co-owner Aker BP ASA, said separately the

Read More »

Crude Settles Lower

Oil eased by the most in almost three weeks as traders monitored India’s buying of Russian crude and refined products markets slumped, leading the energy complex lower. West Texas Intermediate futures fell 2% to settle near $59 a barrel, weighed down by losses in US equities, and have now been trading in a range of less than $4 since the start of November. Russian President Vladimir Putin last week promised “uninterrupted shipments” of fuel to India even as Moscow faces steeper sanctions over its war in Ukraine. The shipments will likely be a key point for discussions as US negotiators arrive in the South Asian nation for trade talks. “Oversupply concerns will eventually be realized, especially as Russian oil and refined product flows eventually circumvent existing sanctions,” said Vivek Dhar, an analyst with Commonwealth Bank of Australia. That will see Brent futures fall toward $60 a barrel through 2026, he said. Among products, gasoline futures dropped 2% in New York, after hitting the lowest level since May 2021 last week. Diesel prices also weakened in a drag on energy commodities across the board. The focus on Moscow’s flows comes as a potential peace deal between Ukraine and Russia also remained in focus. US President Donald Trump said he was disappointed in Ukrainian President Volodymyr Zelenskiy’s handling of a US proposal to end the nearly four-year-old war. Those tensions will be weighed against glut concerns, with higher supply from OPEC+ and producers outside the group — including the US, Brazil and Guyana — set to overwhelm tepid demand growth. The US’s Energy Information Administration, the International Energy Agency and the Organization of the Petroleum Exporting Countries will publish monthly market outlooks this week that may provide further insights. Both WTI and Brent remain on their longest runs below their 100-day moving

Read More »

Energy Department Announces $11 Million in Awards to Develop HALEU Transportation Packages

IDAHO FALLS, ID. —The U.S. Department of Energy (DOE) today announced $11 million in awards to five U.S. companies to develop and license new or modified transportation packages for high-assay low-enriched uranium (HALEU). The announcement was made during U.S. Secretary of Energy Chris Wright’s visit to Idaho National Laboratory (INL), marking the final stop in his ongoing tour of all 17 DOE National Laboratories. These selections advance President Trump’s recent executive orders and commitment to rebuild the Nation’s nuclear fuel cycle, strengthen domestic enrichment and fabrication capabilities, and accelerate the deployment of advanced reactors to usher in a new American nuclear renaissance. “From critical minerals to nuclear fuel, the Trump administration is fully committed to restoring the supply chains needed to secure America’s future,” said Secretary Wright. “Thanks to President Trump, the Energy Department is operating at record speeds to unleash the next American Nuclear Renaissance and to deliver more affordable, reliable, and secure energy for American families and businesses.” DOE’s $11 million in awards will support industry-led efforts to design, modify, and license transportation packages through the U.S. Nuclear Regulatory Commission (NRC). These investments will help establish long-term, economical HALEU transport capabilities that better serve domestic reactor developers and strengthen the U.S. nuclear supply chain. The following companies were selected to develop long-term economic solutions for the safe transport of HALEU through two topic areas: Topic Area 1: Develop new package designs that can be licensed by the NRC NAC International Westinghouse Electric Company Container Technologies Industries, LLC American Centrifuge Operating Paragon D&E Topic Area 2: Modify existing design packages for NRC certification NAC International Projects under Topic Area 1 will have performance periods of up to three years; the Topic Area 2 project will have a performance period of up to two years. Funding is provided through DOE’s

Read More »

What does Arm need to do to gain enterprise acceptance?

But in 2017, AMD released the Zen architecture, which was equal if not superior to the Intel architecture. Zen made AMD competitive, and it fueled an explosive rebirth for a company that was near death a few years prior. AMD now has about 30% market share, while Intel suffers from a loss of technology as well as corporate leadership. Now, customers have a choice of Intel or AMD, and they don’t have to worry about porting their applications to a new platform like they would have to do if they switched to Arm. Analysts weigh in on Arm Tim Crawford sees no demand for Arm in the data center. Crawford is president of AVOA, a CIO consultancy. In his role, he talks to IT professionals all the time, but he’s not hearing much interest in Arm. “I don’t see Arm really making a dent, ever, into the general-purpose processor space,” Crawford said. “I think the opportunity for Arm is special applications and special silicon. If you look at the major cloud providers, their custom silicon is specifically built to do training or optimized to do inference. Arm is kind of in the same situation in the sense that it has to be optimized.” “The problem [for Arm] is that there’s not necessarily a need to fulfill at this point in time,” said Rob Enderle, principal analyst with The Enderle Group. “Obviously, there’s always room for other solutions, but Arm is still going to face the challenge of software compatibility.” And therein lies what may be Arm’s greatest challenge: software compatibility. Software doesn’t care (usually) if it’s on Intel or AMD, because both use the x86 architecture, with some differences in extensions. But Arm is a whole new platform, and that requires porting and testing. Enterprises generally don’t like disruption —

Read More »

Intel decides to keep networking business after all

That doesn’t explain why Intel made the decision to pursue spin-off in the first place. In July, NEX chief Sachin Katti issued a memo that outlined plans to establish key elements of the Networking and Communications business as a stand-alone company. It looked like a done deal, experts said. Jim Hines, research director for enabling technologies and semiconductors at IDC, declined to speculate on whether Intel could get a decent offer but noted NEX is losing ground. IDC estimates Intel’s market share in overall semiconductors at 6.8% in Q3 2025, which is down from 7.4% for the full year 2024 and 9.2% for the full year 2023. Intel’s course reversal “is a positive for Intel in the long term, and recent improvements in its financial situation may have contributed to the decision to keep NEX in house,” he said. When Tan took over as CEO earlier this year, prioritized strengthening the balance sheet and bringing a greater focus on execution. Divest NEX was aligned with these priorities, but since then, Intel has secured investments from the US Government, Nvidia and SoftBank that have reduced the need to raise cash through other means, Hines notes. “The NEX business will prove to be a strategic asset for Intel as it looks to protect and expand its position in the AI datacenter market. Success in this market now requires processor suppliers to offer a full-stack solution, not just silicon. Scale-up and scale-out networking solutions are a key piece of the package, and Intel will be able to leverage its NEX technologies and software, including silicon photonics, to develop differentiated product offerings in this space,” Hines said.

Read More »

At the Crossroads of AI and the Edge: Inside 1623 Farnam’s Rising Role as a Midwest Interconnection Powerhouse

That was the thread that carried through our recent conversation for the DCF Show podcast, where Severn walked through the role Farnam now plays in AI-driven networking, multi-cloud connectivity, and the resurgence of regional interconnection as a core part of U.S. digital infrastructure. Aggregation, Not Proximity: The Practical Edge Severn is clear-eyed about what makes the edge work and what doesn’t. The idea that real content delivery could aggregate at the base of cell towers, he noted, has never been realistic. The traffic simply isn’t there. Content goes where the network already concentrates, and the network concentrates where carriers, broadband providers, cloud onramps, and CDNs have amassed critical mass. In Farnam’s case, that density has grown steadily since the building changed hands in 2018. At the time an “underappreciated asset,” the facility has since become a meeting point for more than 40 broadband providers and over 60 carriers, with major content operators and hyperscale platforms routing traffic directly through its MMRs. That aggregation effect feeds on itself; as more carrier and content traffic converges, more participants anchor themselves to the hub, increasing its gravitational pull. Geography only reinforces that position. Located on the 41st parallel, the building sits at the historical shortest-distance path for early transcontinental fiber routes. It also lies at the crossroads of major east–west and north–south paths that have made Omaha a natural meeting point for backhaul routes and hyperscale expansions across the Midwest. AI and the New Interconnection Economy Perhaps the clearest sign of Farnam’s changing role is the sheer volume of fiber entering the building. More than 5,000 new strands are being brought into the property, with another 5,000 strands being added internally within the Meet-Me Rooms in 2025 alone. These are not incremental upgrades—they are hyperscale-grade expansions driven by the demands of AI traffic,

Read More »

Schneider Electric’s $2.3 Billion in AI Power and Cooling Deals Sends Message to Data Center Sector

When Schneider Electric emerged from its 2025 North American Innovation Summit in Las Vegas last week with nearly $2.3 billion in fresh U.S. data center commitments, it didn’t just notch a big sales win. It arguably put a stake in the ground about who controls the AI power-and-cooling stack over the rest of this decade. Within a single news cycle, Schneider announced: Together, the deals total about $2.27 billion in U.S. data center infrastructure, a number Schneider confirmed in background with multiple outlets and which Reuters highlighted as a bellwether for AI-driven demand.  For the AI data center ecosystem, these contracts function like early-stage fuel supply deals for the power and cooling systems that underpin the “AI factory.” Supply Capacity Agreements: Locking in the AI Supply Chain Significantly, both deals are structured as supply capacity agreements, not traditional one-off equipment purchase orders. Under the SCA model, Schneider is committing dedicated manufacturing lines and inventory to these customers, guaranteeing output of power and cooling systems over a multi-year horizon. In return, Switch and Digital Realty are providing Schneider with forecastable volume and visibility at the scale of gigawatt-class campus build-outs.  A Schneider spokesperson told Reuters that the two contracts are phased across 2025 and 2026, underscoring that this arrangement is about pipeline, as opposed to a one-time backlog spike.  That structure does three important things for the market: Signals confidence that AI demand is durable.You don’t ring-fence billions of dollars of factory output for two customers unless you’re highly confident the AI load curve runs beyond the current GPU cycle. Pre-allocates power & cooling the way the industry pre-allocated GPUs.Hyperscalers and neoclouds have already spent two years locking up Nvidia and AMD capacity. These SCAs suggest power trains and thermal systems are joining chips on the list of constrained strategic resources.

Read More »

The Data Center Power Squeeze: Mapping the Real Limits of AI-Scale Growth

As we all know, the data center industry is at a crossroads. As artificial intelligence reshapes the already insatiable digital landscape, the demand for computing power is surging at a pace that outstrips the growth of the US electric grid. As engines of the AI economy, an estimated 1,000 new data centers1 are needed to process, store, and analyze the vast datasets that run everything from generative models to autonomous systems. But this transformation comes with a steep price and the new defining criteria for real estate: power. Our appetite for electricity is now the single greatest constraint on our expansion, threatening to stall the very innovation we enable. In 2024, US data centers consumed roughly 4% of the nation’s total electricity, a figure that is projected to triple by 2030, reaching 12% or more.2 For AI-driven hyperscale facilities, the numbers are even more staggering. With the largest planned data centers requiring gigawatts of power, enough to supply entire cities, the cumulative demand from all data centers is expected to reach 134 gigawatts by 2030, nearly three times the current load.​3 This presents a systemic challenge. The U.S. power grid, built for a different era, is struggling to keep pace. Utilities are reporting record interconnection requests, with some regions seeing demand projections that exceed their total system capacity by fivefold.4 In Virginia and Texas, the epicenters of data center expansion, grid operators are warning of tight supply-demand balances and the risk of blackouts during peak periods.5 The problem is not just the sheer volume of power needed, but the speed at which it must be delivered. Data center operators are racing to secure power for projects that could be online in as little as 18 months, but grid upgrades and new generation can take years, if not decades. The result

Read More »

The Future of Hyperscale: Neoverse Joins NVLink Fusion as SC25 Accelerates Rack-Scale AI Architectures

Neoverse’s Expanding Footprint and the Power-Efficiency Imperative With Neoverse deployments now approaching roughly 50% of all compute shipped into top hyperscalers in 2025 (representing more than a billion Arm cores) and with nation-scale AI campuses such as the Stargate project already anchored on Arm compute, the addition of NVLink Fusion becomes a pivotal extension of the Neoverse roadmap. Partners can now connect custom Arm CPUs to their preferred NVIDIA accelerators across a coherent, high-bandwidth, rack-scale fabric. Arm characterized the shift as a generational inflection point in data-center architecture, noting that “power—not FLOPs—is the bottleneck,” and that future design priorities hinge on maximizing “intelligence per watt.” Ian Buck, vice president and general manager of accelerated computing at NVIDIA, underscored the practical impact: “Folks building their own Arm CPU, or using an Arm IP, can actually have access to NVLink Fusion—be able to connect that Arm CPU to an NVIDIA GPU or to the rest of the NVLink ecosystem—and that’s happening at the racks and scale-up infrastructure.” Despite the expanded design flexibility, this is not being positioned as an open interconnect ecosystem. NVIDIA continues to control the NVLink Fusion fabric, and all connections ultimately run through NVIDIA’s architecture. For data-center planners, the SC25 announcement translates into several concrete implications: 1.   NVIDIA “Grace-style” Racks Without Buying Grace With NVLink Fusion now baked into Neoverse, hyperscalers and sovereign operators can design their own Arm-based control-plane or pre-processing CPUs that attach coherently to NVIDIA GPU domains—such as NVL72 racks or HGX B200/B300 systems—without relying on Grace CPUs. A rack-level architecture might now resemble: Custom Neoverse SoC for ingest, orchestration, agent logic, and pre/post-processing NVLink Fusion fabric Blackwell GPU islands and/or NVLink-attached custom accelerators (Marvell, MediaTek, others) This decouples CPU choice from NVIDIA’s GPU roadmap while retaining the full NVLink fabric. In practice, it also opens

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »