Stay Ahead, Stay ONMINE

Learnings from a Machine Learning Engineer — Part 3: The Evaluation

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions). In Part 1, […]

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions).

In Part 1, I discussed the process of labelling your image data that you use in your Image Classification project. I showed how to define “good” images and create sub-classes. In Part 2, I went over various data sets, beyond the usual train-validation-test sets, such as benchmark sets, plus how to handle synthetic data and duplicate images.

Evaluation of the trained model

As machine learning engineers we look at accuracy, F1, log loss, and other metrics to decide if a model is ready to move to production. These are all important measures, but from my experience, these scores can be deceiving especially as the number of classes grows.

Although it can be time consuming, I find it very important to manually review the images that the model gets wrong, as well as the images that the model gives a low softmax “confidence” score to. This means adding a step immediately after your training run completes to calculate scores for all images — training, validation, test, and the benchmark sets. You only need to bring up for manual review the ones that the model had problems with. This should only be a small percentage of the total number of images. See the Double-check process below

What you do during the manual evaluation is to put yourself in a “training mindset” to ensure that the labelling standards are being followed that you setup in Part 1. Ask yourself:

  • “Is this a good image?” Is the subject front and center, and can you clearly see all the features?
  • “Is this the correct label?” Don’t be surprised if you find wrong labels.

You can either remove the bad images or fix the labels if they are wrong. Otherwise you can keep them in the data set and force the model to do better next time. Other questions I ask are:

  • “Why did the model get this wrong?”
  • “Why did this image get a low score?”
  • “What is it about the image that caused confusion?”

Sometimes the answer has nothing to do with that specific image. Frequently, it has to do with the other images, either in the ground truth class or in the predicted class. It is worth the effort to Double-check all images in both sets if you see a consistently bad guess. Again, don’t be surprised if you find poor images or wrong labels.

Weighted evaluation

When doing the evaluation of the trained model (above), we apply a lot of subjective analysis — “Why did the model get this wrong?” and “Is this a good image?” From these, you may only get a gut feeling.

Frequently, I will decide to hold off moving a model forward to production based on that gut feel. But how can you justify to your manager that you want to hit the brakes? This is where putting a more objective analysis comes in by creating a weighted average of the softmax “confidence” scores.

In order to apply a weighted evaluation, we need to identify sets of classes that deserve adjustments to the score. Here is where I create a list of “commonly confused” classes.

Commonly confused classes

Certain animals at our zoo can easily be mistaken. For example, African elephants and Asian elephants have different ear shapes. If your model gets these two mixed up, that is not as bad as guessing a giraffe! So perhaps you give partial credit here. You and your subject matter experts (SMEs) can come up with a list of these pairs and a weighted adjustment for each.

Photo by Matt Bango on Unsplash
Photo by Mathew Krizmanich on Unsplash

This weight can be factored into a modified cross-entropy loss function in the equation below. The back half of this equation will reduce the impact of being wrong for specific pairs of ground truth and prediction by using the “weight” function as a lookup. By default, the weighted adjustment would be 1 for all pairings, and the commonly confused classes would get something like 0.5.

In other words, it’s better to be unsure (have a lower confidence score) when you are wrong, compared to being super confident and wrong.

Modified cross-entropy loss function, image by author

Once this weighted log loss is calculated, I can compare to previous training runs to see if the new model is ready for production.

Confidence threshold report

Another valuable measure that incorporates the confidence threshold (in my example, 95) is to report on accuracy and false positive rates. Recall that when we apply the confidence threshold before presenting results, we help reduce false positives from being shown to the end user.

In this table, we look at the breakdown of “true positive above 95” for each data set. We get a sense that when a “good” picture comes through (like the ones from our train-validation-test set) it is very likely to surpass the threshold, thus the user is “happy” with the outcome. Conversely, the “false positive above 95” is extremely low for good pictures, thus only a small number of our users will be “sad” about the results.

Example Confidence Threshold Report, image by author

We expect the train-validation-test set results to be exceptional since our data is curated. So, as long as people take “good” pictures, the model should do very well. But to get a sense of how it does on extreme situations, let’s take a look at our benchmarks.

The “difficult” benchmark has more modest true positive and false positive rates, which reflects the fact that the images are more challenging. These values are much easier to compare across training runs, so that lets me set a min/max target. So for example, if I target a minimum of 80% for true positive, and maximum of 5% for false positive on this benchmark, then I can feel confident moving this to production.

The “out-of-scope” benchmark has no true positive rate because none of the images belong to any class the model can identify. Remember, we picked things like a bag of popcorn, etc., that are not zoo animals, so there cannot be any true positives. But we do get a false positive rate, which means the model gave a confident score to that bag of popcorn as some animal. And if we set a target maximum of 10% for this benchmark, then we may not want to move it to production.

Photo by Linus Mimietz on Unsplash

Right now, you may be thinking, “Well, what animal did it pick for the bag of popcorn?” Excellent question! Now you understand the importance of doing a manual review of the images that get bad results.

Evaluation of the deployed model

The evaluation that I described above applies to a model immediately after training. Now, you want to evaluate how your model is doing in the real world. The process is similar, but requires you to shift to a “production mindset” and asking yourself, “Did the model get this correct?” and “Should it have gotten this correct?” and “Did we tell the user the right thing?”

So, imagine that you are logging in for the morning — after sipping on your cold brew coffee, of course — and are presented with 500 images that your zoo guests took yesterday of different animals. Your job is to determine how satisfied the guests were using your model to identify the zoo animals.

Using the softmax “confidence” score for each image, we have a threshold before presenting results. Above the threshold, we tell the guest what the model predicted. I’ll call this the “happy path”. And below the threshold is the “sad path” where we ask them to try again.

Your review interface will first show you all the “happy path” images one at a time. This is where you ask yourself, “Did we get this right?” Hopefully, yes!

But if not, this is where things get tricky. So now you have to ask, “Why not?” Here are some things that it could be:

  • “Bad” picture — Poor lighting, bad angle, zoomed out, etc — refer to your labelling standards.
  • Out-of-scope — It’s a zoo animal, but unfortunately one that isn’t found in this zoo. Maybe it belongs to another zoo (your guest likes to travel and try out your app). Consider adding these to your data set.
  • Out-of-scope — It’s not a zoo animal. It could be an animal in your zoo, but not one typically contained there, like a neighborhood sparrow or mallard duck. This might be a candidate to add.
  • Out-of-scope — It’s something found in the zoo. A zoo usually has interesting trees and shrubs, so people might try to identify those. Another candidate to add.
  • Prankster — Completely out-of-scope. Because people like to play with technology, there’s the possibility you have a prankster that took a picture of a bag of popcorn, or a soft drink cup, or even a selfie. These are hard to prevent, but hopefully get a low enough score (below the threshold) so the model did not identify it as a zoo animal. If you see enough pattern in these, consider creating a class with special handling on the front-end.

After reviewing the “happy path” images, you move on to the “sad path” images — the ones that got a low confidence score and the app gave a “sorry, try again” message. This time you ask yourself, “Should the model have given this image a higher score?” which would have put it in the “happy path”. If so, then you want to ensure these images are added to the training set so next time it will do better. But most of time, the low score reflects many of the “bad” or out-of-scope situations mentioned above.

Perhaps your model performance is suffering and it has nothing to do with your model. Maybe it is the ways you users interacting with the app. Keep an eye out of non-technical problems and share your observations with the rest of your team. For example:

  • Are your users using the application in the ways you expected?
  • Are they not following the instructions?
  • Do the instructions need to be stated more clearly?
  • Is there anything you can do to improve the experience?

Collect statistics and new images

Both of the manual evaluations above open a gold mine of data. So, be sure to collect these statistics and feed them into a dashboard — your manager and your future self will thank you!

Photo by Justin Morgan on Unsplash

Keep track of these stats and generate reports that you and your can reference:

  • How often the model is being called?
  • What times of the day, what days of the week is it used?
  • Are your system resources able to handle the peak load?
  • What classes are the most common?
  • After evaluation, what is the accuracy for each class?
  • What is the breakdown for confidence scores?
  • How many scores are above and below the confidence threshold?

The single best thing you get from a deployed model is the additional real-world images! You can add these now images to improve coverage of your existing zoo animals. But more importantly, they provide you insight on other classes to add. For example, let’s say people enjoy taking a picture of the large walrus statue at the gate. Some of these may make sense to incorporate into your data set to provide a better user experience.

Creating a new class, like the walrus statue, is not a huge effort, and it avoids the false positive responses. It would be more embarrassing to identify a walrus statue as an elephant! As for the prankster and the bag of popcorn, you can configure your front-end to quietly handle these. You might even get creative and have fun with it like, “Thank you for visiting the food court.”

Double-check process

It is a good idea to double-check your image set when you suspect there may be problems with your data. I’m not suggesting a top-to-bottom check, because that would a monumental effort! Rather specific classes that you suspect could contain bad data that is degrading your model performance.

Immediately after my training run completes, I have a script that will use this new model to generate predictions for my entire data set. When this is complete, it will take the list of incorrect identifications, as well as the low scoring predictions, and automatically feed that list into the Double-check interface.

This interface will show, one at a time, the image in question, alongside an example image of the ground truth and an example image of what the model predicted. I can visually compare the three, side-by-side. The first thing I do is ensure the original image is a “good” picture, following my labelling standards. Then I check if the ground-truth label is indeed correct, or if there is something that made the model think it was the predicted label.

At this point I can:

  • Remove the original image if the image quality is poor.
  • Relabel the image if it belongs in a different class.

During this manual evaluation, you might notice dozens of the same wrong prediction. Ask yourself why the model made this mistake when the images seem perfectly fine. The answer may be some incorrect labels on images in the ground truth, or even in the predicted class!

Don’t hesitate to add those classes and sub-classes back into the Double-check interface and step through them all. You may have 100–200 pictures to review, but there is a good chance that one or two of the images will stand out as being the culprit.

Up next…

With a different mindset for a trained model versus a deployed model, we can now evaluate performances to decide which models are ready for production, and how well a production model is going to serve the public. This relies on a solid Double-check process and a critical eye on your data. And beyond the “gut feel” of your model, we can rely on the benchmark scores to support us.

In Part 4, we kick off the training run, but there are some subtle techniques to get the most out of the process and even ways to leverage throw-away models to expand your library image data.

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Yea or nay: Will Nvidia H200 chips go to China?

He noted, “the broader implications and potential impacts may signal to enterprise customers of Nvidia that perhaps they don’t need the latest and greatest GPUs from [them] either to achieve acceptable results across select AI workloads. It is doubtful that Nvidia would commission additional production issues for H200 without China

Read More »

Chinese AI firm trains state-of-the-art model entirely on Huawei chips

The pricing positions GLM-Image as a cost-effective option for enterprises generating marketing materials, presentations, and other text-heavy visual content at scale. Technical approach and benchmark performance GLM-Image employs a hybrid architecture combining a 9-billion-parameter autoregressive model with a 7-billion-parameter diffusion decoder, according to Zhipu’s technical report. The autoregressive component handles

Read More »

BPCL lets contracts for expansions at Bina, Mumbai refineries

Bharat Petroleum Corp. Ltd. (BPCL) has awarded separate contracts to Technip Energies NV for delivery of major works on key projects designed to support expanded production of petrochemicals at two of the operator’s Indian refineries. Under a first contract revealed on Jan. 7, Technip Energies said it will provide engineering,

Read More »

Oil Slips After Trump Signals Iran De-Escalation

Oil fell after the close, wiping out a day of gains after US President Donald Trump said he had been assured that Iran would stop killing protesters, signaling he could hold off on a threatened military response to the repression of widespread demonstrations in the nation. West Texas Intermediate was down as much as 3% after settlement on Wednesday, dropping to around $59 a barrel in a rapid reversal before paring some of those losses. Prices had settled on Wednesday at $62.02. Oil had gained in each of the last five sessions as traders awaited the US response to political upheaval in Iran, with the US moving military staff and Tehran warning neighboring countries against assisting an attack. Concerns about a disruption to Iran’s approximately 3.3 million barrel-per-day production and key shipping lanes had helped push prices to their highest since October. But prices fell sharply after US President Donald Trump told reporters in the Oval Office Wednesday, “we’ve been told that the killing in Iran is stopping – it’s stopped.” The comments lessened expectations of an immediate US military response to the demonstrations against the government of Supreme Leader Ayatollah Ali Khamenei. Trump said he would be “very upset” if the information proved untrue and the violent crackdown continued. Oil has pushed higher in the new year as turmoil in OPEC’s fourth-largest producer, along with upheaval in Venezuela, restored a premium to prices following a run of five monthly losses spurred by expectations for a glut. The bumper rally in crude over recent days had caught off guard a market that had been steeped with bearish bets, while further boosts came from bullish options wagers, where volumes soared to a record this week, and an annual commodity index rebalancing that added inflows to crude markets. On the physical front,

Read More »

Shell, Exxon Pull Planned North Sea Gas Sale To Viaro

Shell Plc and Exxon Mobil Corp. canceled a proposed deal to sell natural gas assets in the North Sea to upstart firm Viaro Energy. Shell said in a statement that the oil majors couldn’t complete the transaction to sell the strategic Bacton onshore gas terminal and 11 offshore facilities to oil tycoon Francesco Mazzagatti’s Viaro. The ending of the transaction follows a protracted regulatory review by the North Sea Transition Authority, which said it had needed further information from Viaro before any decision. “The parties have worked hard and in close alignment to try and complete this transaction over many months, but despite this being a fully funded opportunity, the completion conditions were not met as commercial and market conditions evolved and we mutually agreed not to proceed,” Mazzagatti said Wednesday. When it announced the deal in the summer of 2024, Shell said the transaction was expected to complete in 2025. The NSTA, which was recently given new powers to oversee mergers and acquisitions in the North Sea, said the regulator was “waiting to receive the additional information requested from the purchasing party to make a decision.” The deal included the Bacton terminal on the east coast of England, a site of “strategic national importance,” according to Shell. It’s the sole entry point for gas from Belgium and the Netherlands, supplying as much as one-third of the UK’s gas supply. Mazzagatti, Viaro’s founder, is facing criminal charges in Italy and civil forgery and fraud allegations in the UK. He denies all allegations made against him.  The halt to the deal has paused an acquisition streak that made Viaro the most prolific buyer of UK oil and gas assets over the past five years, according to data compiled by Bloomberg. The decision also follows a London Court of Appeal ruling over

Read More »

New York Gov. Hochul expands nuclear aspirations to 8-GW fleet

Listen to the article 5 min This audio is auto-generated. Please let us know if you have feedback. New York will target development of 5 GW of new nuclear power, vastly expanding on a 1-GW goal set last June, Gov. Kathy Hochul announced Tuesday in her State of the State address. Hochul’s speech was short on details regarding her nuclear aspirations. The state’s Climate Leadership and Community Protection Act requires New York to achieve a 100% zero-emission electricity system by 2040. And last month, the New York State Energy Planning Board adopted a new state energy plan that cast nuclear energy as key to New York’s reliability and decarbonization goals.  New York’s recently-adopted state energy plan offers “broad program and policy development direction.” Retrieved from New York State Energy Planning Board. The plan, which offers “broad” guidance rather than binding targets, also highlighted some of the challenges facing nuclear. It described nuclear projects’ “long lead times and uncertain costs,” and noted the likely need for changes to zero emission credit programs and wholesale markets to balance concerns over capacity, reliability and ratepayer impacts. A policy book released alongside Hochul’s speech says the governor plans to direct state agencies to “establish a clear pathway for additional advanced nuclear generation to support grid reliability.” A nuclear reliability “backbone” will be developed through a new Department of Public Service process “to consider, review, and facilitate a cost-effective pathway to 4 gigawatts of new nuclear energy.” If successful, the buildout would bring New York’s total nuclear fleet to more than 8 GW. The state currently has three plants with four operating reactors totaling 3.4 GW of capacity, all owned by Constellation Energy. Nuclear power supplies about 21% of New York’s electricity. “Go big or go home,” the democratic governor said during her address, adding that

Read More »

Record Offshore Wind Auction Boosts UK Hopes for 2030 Goal

Britain stepped up support for offshore wind in the latest subsidy auction, showing the government is still determined to meet its ambitious 2030 clean-power goal even as costs rise. The 8.2 gigawatts of offshore wind beat analysts’ expectations and will boost the likelihood of the government delivering on its promise to almost totally exit fossil fuels in power generation. The UK now needs around 7 gigawatts of new capacity in the next auction, which is the last realistic chance to get projects built in time.   The government will pay developers more for projects won in this auction compared with last year, a cost that’s ultimately paid for by consumers. It creates a difficult balancing act for Prime Minister Keir Starmer, who has pledged to cut household bills during the current parliament.  “With these results, Britain is taking back control of our energy sovereignty,” said Energy Secretary Ed Miliband in a statement. The results deliver the biggest single procurement of offshore wind energy in British and European history, according to the statement. The auction secured capacity at a price of £65.45 ($88) per megawatt-hour in 2012 prices, a commonly used benchmark, or £91.20 in 2024 terms, accounting for some inflation. This price, higher than in last year’s auction, still represents a “net benefit to bills over the next decade,” according to analysis from Aurora Energy Research. RWE AG was the major winner, involved in all but one of the projects that won. Separately, RWE said it has agreed a deal with KKR & Co to develop, construct and operate the Norfolk Vanguard East and Norfolk Vanguard West projects, which were awarded contracts in the auction.  Another winner, RWE’s Dogger Bank South, doesn’t yet have planning permission, which means it may not be built in time to meet the 2030 goal. RWE’s

Read More »

GasBuddy Reveals 2026 USA Gasoline Price Forecast

In a report published recently, GasBuddy revealed its average U.S. gasoline price prediction for 2026. According to this report, the company expects the U.S. gasoline price to come in at $2.97 per gallon this year and sees December as the month with the lowest average U.S. gasoline price in 2026, at $2.83 per gallon. GasBuddy expects May to see the highest average U.S. gasoline price in 2026, at $3.12 per gallon, the report outlined. “GasBuddy’s forecast projects the national average price of gas to fall to $2.97 per gallon in 2026, marking the fourth consecutive annual decline and the lowest average since 2020,” Patrick De Haan, Head of Petroleum Analysis at GasBuddy, said in the report. “This continued decrease reflects the unwinding of post-pandemic market distortions, expanding global refining capacity, and more stable supply chains,” he added. “While the drop from 2025 is modest compared to previous years, it underscores a meaningful shift toward greater overall market stability,” he continued. A statement accompanying the release of the report posted on GasBuddy’s website highlighted that the U.S. gasoline price averaged $3.102 per gallon in 2025. GasBuddy also pointed out in that statement that it is forecasting the yearly U.S. average price of gasoline to fall back below $3 per gallon for the first time since the Covid-19 pandemic. In the report, GasBuddy projected that average household spending on gasoline will come in at $2,083 in 2026. That’s the lowest figure since 2021, which saw an average household gasoline spend of $1,979, the report showed.  De Haan went on to project in the report that gasoline prices “are expected to follow a traditional seasonal pattern in 2026, with imbalances left behind by Covid and geopolitical tensions balanced for the time being”. “The national average is projected to briefly rise into the low

Read More »

DOE, NASA Advance Partnership to Enable Nuclear Power on Moon

The United States Department of Energy (DOE) and the National Aeronautics and Space Administration (NASA) on Tuesday announced a memorandum of understanding (MOU) renewing their commitment to developing a lunar power system using fission by 2030. The collaboration aims to enable sustained NASA missions on the Moon – though radioisotope systems have already powered long-term U.S. space missions for decades according to DOE. “Thanks to President Trump’s leadership and his America First Space Policy, the department is proud to work with NASA and the commercial space industry on what will be one of the greatest technical achievements in the history of nuclear energy and space exploration”, Energy Secretary Chris Wright declared. The agencies eye deploying a surface power system able to operate for years without refueling. “The deployment of a lunar surface reactor will enable future sustained lunar missions by providing continuous and abundant power, regardless of sunlight or temperature”, DOE and NASA said. “Under President Trump’s national space policy, America is committed to returning to the Moon, building the infrastructure to stay and making the investments required for the next giant leap to Mars and beyond”, said NASA Administrator Jared Isaacman. “Achieving this future requires harnessing nuclear power.  “This agreement enables closer collaboration between NASA and the Department of Energy to deliver the capabilities necessary to usher in the golden age of space exploration and discovery”. Westinghouse Contract Before the MOU, DOE and NASA had already contracted Westinghouse Electric Co LLC to develop a space microreactor design under the agencies’ Fission Surface Power (FSP) project. On January 7, 2025, Pennsylvania-based Westinghouse announced a new contract that builds on “the successful design work Westinghouse completed during phase 1 to optimize its contributions to the design of FSP systems and their configuration, and begin testing of critical technology elements”. “The continued progress

Read More »

Cisco’s 2026 agenda prioritizes AI-ready infrastructure, connectivity

While most of the demand for AI data center capacity today comes from hyperscalers and neocloud providers, that will change as enterprise customers delve more into the AI networking world. “The other ecosystem members and enterprises themselves are becoming responsible for an increasing proportion of the AI infrastructure buildout as inferencing and agentic AI, sovereign cloud, and edge AI become more mainstream,” Katz wrote. More enterprises will move to host AI on premises via the introduction of AI agents that are designed to inject intelligent insight into applications and help improve operations. That’s where the AI impact on enterprise network traffic will appear, suggests Nolle. “Enterprises need to host AI to create AI network impact. Just accessing it doesn’t do much to traffic. Having cloud agents access local data center resources (RAG etc.) creates a governance issue for most corporate data, so that won’t go too far either,” Nolle said.  “Enterprises are looking at AI agents, not the way hyperscalers tout agentic AI, but agents running on small models, often open-source, and are locally hosted. This is where real AI traffic will develop, and Cisco could be vulnerable if they don’t understand this point and at least raise it in dialogs where AI hosting comes up,” Nolle said. “I don’t expect they’d go too far, because the real market for enterprise AI networking is probably a couple years out.” Meanwhile, observers expect Cisco to continue bolstering AI networking capabilities for enterprise branch, campus and data centers as well as hyperscalers, including through optical support and other gear.

Read More »

Microsoft tells communities it will ‘pay its way’ as AI data center resource usage sparks backlash

It will work with utilities and public commissions to set the rates it pays high enough to cover data center electricity costs (including build-outs, additions, and active use). “Our goal is straightforward: To ensure that the electricity cost of serving our data centers is not passed on to residential customers,” Smith emphasized. For example, the company is supporting a new rate structure Wisconsin that would charge a class of “very large customers,” including data centers, the true cost of the electricity required to serve them. It will collaborate “early, closely, and transparently” with local utilities to add electricity and supporting infrastructure to existing grids when needed. For instance, Microsoft has contracted with the Midcontinent Independent System Operator (MISO) to add 7.9GW of new electricity generation to the grid, “more than double our current consumption,” Smith noted. It will pursue ways to make data centers more efficient. For example, it is already experimenting with AI to improve planning, extract more electricity from existing infrastructure, improve system resilience, and speed development of new infrastructure and technologies (like nuclear energy). It will advocate for state and national public policies that ensure electricity access that is affordable, reliable, and sustainable in neighboring communities. Microsoft previously established priorities for electricity policy advocacy, Smith noted, but “progress has been uneven. This needs to change.” Microsoft is similarly committed when it comes to data center water use, promising four actions: Reducing the overall amount of water its data centers use, initially improving it by 40% by 2030. The company is exploring innovations in cooling, including closed-loop systems that recirculate cooling liquids. It will collaborate with local utilities to map out water, wastewater, and pressure needs, and will “fully fund” infrastructure required for growth. For instance, in Quincy, Washington, Microsoft helped construct a water reuse utility that recirculates

Read More »

Can retired naval power plants solve the data center power crunch?

HGP’s plan includes a revenue share with the government, and the company would create a decommissioning fund, according to Bloomberg. The alternative? After a lengthy decommissioning process, the reactors are shipped to a remote storage facility in Washington state together dust along with dozens of other retired nuclear reactors. So the carrier itself isn’t going to be turned into a data center, but its power plants are being proposed for a data center on land. And even with the lengthening decommissioning process, that’s still faster than building a nuclear power plant from scratch. Don’t hold your breath, says Kristen Vosmaer, managing director, JLL Work Dynamics Data Center team. The idea of converting USS Nimitz’s nuclear reactors to power AI data centers sounds compelling but faces insurmountable obstacles, he argues. “Naval reactors use weapons-grade uranium that civilian entities cannot legally possess, and the Nuclear Regulatory Commission has no pathway to license such facilities. Even setting aside the fuel issue, these military-designed systems would require complete reconstruction to meet civilian safety standards, eliminating any cost advantages over purpose-built nuclear plants,” Vosmaer said. The maritime concept itself, however, does have some merit, said Vosmaer. “Ocean cooling can reduce energy consumption compared to land-based data centers, and floating platforms offer positioning flexibility that fixed facilities cannot match,” Vosmaer said.

Read More »

What exactly is an AI factory?

Others, however, seem to use the word to mean something smaller than a data center, referring more to the servers, software, and other systems used to run AI. For example, the AWS AI Factory is a combination of hardware and software that runs on-premises but is managed by AWS and comes with AWS services such as Bedrock, networking, storage and databases, and security.  At Lenovo, AI factories appear to be packaged servers designed to be used for AI. “We’re looking at the architecture being a fixed number of racks, all working together as one design,” said Scott Tease, vice president and general manager of AI and high-performance computing at Lenovo’s infrastructure solutions group. That number of racks? Anything from a single rack to hundreds, he told Computerworld. Each rack is a little bigger than a refrigerator, comes fully assembled, and is often fully preconfigured for the customer’s use case. “Once it arrives at the customer site, we’ll have service personnel connect power and networking,” Tease said. For others, the AI factory concept is more about the software.

Read More »

Meta establishes Meta Compute to lead AI infrastructure buildout

At that scale, infrastructure constraints are becoming a binding limit on AI expansion, influencing decisions like where new data centers can be built and how they are interconnected. The announcement follows Meta’s recent landmark agreements with Vistra, TerraPower, and Oklo aimed at supporting access to up to 6.6 gigawatts of nuclear energy to fuel its Ohio and Pennsylvania data center clusters. Implications for hyperscale networking Analysts say Meta’s approach indicates how hyperscalers are increasingly treating networking and interconnect strategy as first-order concerns in the AI race. Tulika Sheel, senior vice president at Kadence International, said that Meta’s initiative signals that hyperscale networking will need to evolve rapidly to handle massive internal data flows with high bandwidth and ultra-low latency. “As data centers grow in size and GPU density, pressure on networking and optical supply chains will intensify, driving demand for more advanced interconnects and faster fiber,” Sheel added. Others pointed to the potential architectural shifts from this. “Meta is using Disaggregated Scheduled Fabric and Non-Scheduled Fabric, along with new 51 Tbps switches and Ethernet for Scale-Up Networking, which is intensifying pressure on switch silicon, optical modules, and open rack standards,” said Biswajeet Mahapatra, principal analyst at Forrester. “This shift is forcing the ecosystem to deliver faster optical interconnects and greater fiber capacity, as Meta targets significant backbone growth and more specialized short-reach and coherent optical technologies to support cluster expansion.” The network is no longer a secondary pipe but a primary constraint. Next-generation connectivity, Sheel said, is becoming as critical as access to compute itself, as hyperscalers look to avoid network bottlenecks in large-scale AI deployments.

Read More »

AI, edge, and security: Shaping the need for modern infrastructure management

The rapidly evolving IT landscape, driven by artificial intelligence (AI), edge computing, and rising security threats, presents unprecedented challenges in managing compute infrastructure. Traditional management tools struggle to provide the necessary scalability, visibility, and automation to keep up with business demand, leading to inefficiencies and increased business risk. Yet organizations need their IT departments to be strategic business partners that enable innovation and drive growth. To realize that goal, IT leaders should rethink the status quo and free up their teams’ time by adopting a unified approach to managing infrastructure that supports both traditional and AI workloads. It’s a strategy that enables companies to simplify IT operations and improve IT job satisfaction. 5 IT management challenges of the AI era Cisco recently commissioned Forrester Consulting to conduct a Total Economic Impact™ analysis of Cisco Intersight. This IT operations platform provides visibility, control, and automation capabilities for the Cisco Unified Computing System (Cisco UCS), including Cisco converged, hyperconverged, and AI-ready infrastructure solutions across data centers, colocation facilities, and edge environments. Intersight uses a unified policy-driven approach to infrastructure management and integrates with leading operating systems, storage providers, hypervisors, and third-party IT service management and security tools. The Forrester study first uncovered the issues IT groups are facing: Difficulty scaling: Manual, repetitive processes cause lengthy IT compute infrastructure build and deployment times. This challenge is particularly acute for organizations that need to evolve infrastructure to support traditional and AI workloads across data centers and distributed edge environments. Architectural specialization and AI workloads: AI is altering infrastructure requirements, Forrester found.  Companies design systems to support specific AI workloads — such as data preparation, model training, and inferencing — and each demands specialized compute, storage, and networking capabilities. Some require custom chip sets and purpose-built infrastructure, such as for edge computing and low-latency applications.

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »