Stay Ahead, Stay ONMINE

Just add humans: Oxford medical study underscores the missing link in chatbot testing

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Headlines have been blaring it for years: Large language models (LLMs) can not only pass medical licensing exams but also outperform humans. GPT-4 could correctly answer U.S. medical exam licensing questions 90% […]

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more


Headlines have been blaring it for years: Large language models (LLMs) can not only pass medical licensing exams but also outperform humans. GPT-4 could correctly answer U.S. medical exam licensing questions 90% of the time, even in the prehistoric AI days of 2023. Since then, LLMs have gone on to best the residents taking those exams and licensed physicians.

Move over, Doctor Google, make way for ChatGPT, M.D. But you may want more than a diploma from the LLM you deploy for patients. Like an ace medical student who can rattle off the name of every bone in the hand but faints at the first sight of real blood, an LLM’s mastery of medicine does not always translate directly into the real world.

A paper by researchers at the University of Oxford found that while LLMs could correctly identify relevant conditions 94.9% of the time when directly presented with test scenarios, human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time.

Perhaps even more notably, patients using LLMs performed even worse than a control group that was merely instructed to diagnose themselves using “any methods they would typically employ at home.” The group left to their own devices was 76% more likely to identify the correct conditions than the group assisted by LLMs.

The Oxford study raises questions about the suitability of LLMs for medical advice and the benchmarks we use to evaluate chatbot deployments for various applications.

Guess your malady

Led by Dr. Adam Mahdi, researchers at Oxford recruited 1,298 participants to present themselves as patients to an LLM. They were tasked with both attempting to figure out what ailed them and the appropriate level of care to seek for it, ranging from self-care to calling an ambulance.

Each participant received a detailed scenario, representing conditions from pneumonia to the common cold, along with general life details and medical history. For instance, one scenario describes a 20-year-old engineering student who develops a crippling headache on a night out with friends. It includes important medical details (it’s painful to look down) and red herrings (he’s a regular drinker, shares an apartment with six friends, and just finished some stressful exams).

The study tested three different LLMs. The researchers selected GPT-4o on account of its popularity, Llama 3 for its open weights and Command R+ for its retrieval-augmented generation (RAG) abilities, which allow it to search the open web for help.

Participants were asked to interact with the LLM at least once using the details provided, but could use it as many times as they wanted to arrive at their self-diagnosis and intended action.

Behind the scenes, a team of physicians unanimously decided on the “gold standard” conditions they sought in every scenario, and the corresponding course of action. Our engineering student, for example, is suffering from a subarachnoid haemorrhage, which should entail an immediate visit to the ER.

A game of telephone

While you might assume an LLM that can ace a medical exam would be the perfect tool to help ordinary people self-diagnose and figure out what to do, it didn’t work out that way. “Participants using an LLM identified relevant conditions less consistently than those in the control group, identifying at least one relevant condition in at most 34.5% of cases compared to 47.0% for the control,” the study states. They also failed to deduce the correct course of action, selecting it just 44.2% of the time, compared to 56.3% for an LLM acting independently.

What went wrong?

Looking back at transcripts, researchers found that participants both provided incomplete information to the LLMs and the LLMs misinterpreted their prompts. For instance, one user who was supposed to exhibit symptoms of gallstones merely told the LLM: “I get severe stomach pains lasting up to an hour, It can make me vomit and seems to coincide with a takeaway,” omitting the location of the pain, the severity, and the frequency. Command R+ incorrectly suggested that the participant was experiencing indigestion, and the participant incorrectly guessed that condition.

Even when LLMs delivered the correct information, participants didn’t always follow its recommendations. The study found that 65.7% of GPT-4o conversations suggested at least one relevant condition for the scenario, but somehow less than 34.5% of final answers from participants reflected those relevant conditions.

The human variable

This study is useful, but not surprising, according to Nathalie Volkheimer, a user experience specialist at the Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill.

“For those of us old enough to remember the early days of internet search, this is déjà vu,” she says. “As a tool, large language models require prompts to be written with a particular degree of quality, especially when expecting a quality output.”

She points out that someone experiencing blinding pain wouldn’t offer great prompts. Although participants in a lab experiment weren’t experiencing the symptoms directly, they weren’t relaying every detail.

“There is also a reason why clinicians who deal with patients on the front line are trained to ask questions in a certain way and a certain repetitiveness,” Volkheimer goes on. Patients omit information because they don’t know what’s relevant, or at worst, lie because they’re embarrassed or ashamed.

Can chatbots be better designed to address them? “I wouldn’t put the emphasis on the machinery here,” Volkheimer cautions. “I would consider the emphasis should be on the human-technology interaction.” The car, she analogizes, was built to get people from point A to B, but many other factors play a role. “It’s about the driver, the roads, the weather, and the general safety of the route. It isn’t just up to the machine.”

A better yardstick

The Oxford study highlights one problem, not with humans or even LLMs, but with the way we sometimes measure them—in a vacuum.

When we say an LLM can pass a medical licensing test, real estate licensing exam, or a state bar exam, we’re probing the depths of its knowledge base using tools designed to evaluate humans. However, these measures tell us very little about how successfully these chatbots will interact with humans.

“The prompts were textbook (as validated by the source and medical community), but life and people are not textbook,” explains Dr. Volkheimer.

Imagine an enterprise about to deploy a support chatbot trained on its internal knowledge base. One seemingly logical way to test that bot might simply be to have it take the same test the company uses for customer support trainees: answering prewritten “customer” support questions and selecting multiple-choice answers. An accuracy of 95% would certainly look pretty promising.

Then comes deployment: Real customers use vague terms, express frustration, or describe problems in unexpected ways. The LLM, benchmarked only on clear-cut questions, gets confused and provides incorrect or unhelpful answers. It hasn’t been trained or evaluated on de-escalating situations or seeking clarification effectively. Angry reviews pile up. The launch is a disaster, despite the LLM sailing through tests that seemed robust for its human counterparts.

This study serves as a critical reminder for AI engineers and orchestration specialists: if an LLM is designed to interact with humans, relying solely on non-interactive benchmarks can create a dangerous false sense of security about its real-world capabilities. If you’re designing an LLM to interact with humans, you need to test it with humans – not tests for humans. But is there a better way?

Using AI to test AI

The Oxford researchers recruited nearly 1,300 people for their study, but most enterprises don’t have a pool of test subjects sitting around waiting to play with a new LLM agent. So why not just substitute AI testers for human testers?

Mahdi and his team tried that, too, with simulated participants. “You are a patient,” they prompted an LLM, separate from the one that would provide the advice. “You have to self-assess your symptoms from the given case vignette and assistance from an AI model. Simplify terminology used in the given paragraph to layman language and keep your questions or statements reasonably short.” The LLM was also instructed not to use medical knowledge or generate new symptoms.

These simulated participants then chatted with the same LLMs the human participants used. But they performed much better. On average, simulated participants using the same LLM tools nailed the relevant conditions 60.7% of the time, compared to below 34.5% in humans.

In this case, it turns out LLMs play nicer with other LLMs than humans do, which makes them a poor predictor of real-life performance.

Don’t blame the user

Given the scores LLMs could attain on their own, it might be tempting to blame the participants here. After all, in many cases, they received the right diagnoses in their conversations with LLMs, but still failed to correctly guess it. But that would be a foolhardy conclusion for any business, Volkheimer warns.

“In every customer environment, if your customers aren’t doing the thing you want them to, the last thing you do is blame the customer,” says Volkheimer. “The first thing you do is ask why. And not the ‘why’ off the top of your head: but a deep investigative, specific, anthropological, psychological, examined ‘why.’ That’s your starting point.”

You need to understand your audience, their goals, and the customer experience before deploying a chatbot, Volkheimer suggests. All of these will inform the thorough, specialized documentation that will ultimately make an LLM useful. Without carefully curated training materials, “It’s going to spit out some generic answer everyone hates, which is why people hate chatbots,” she says. When that happens, “It’s not because chatbots are terrible or because there’s something technically wrong with them. It’s because the stuff that went in them is bad.”

“The people designing technology, developing the information to go in there and the processes and systems are, well, people,” says Volkheimer. “They also have background, assumptions, flaws and blindspots, as well as strengths. And all those things can get built into any technological solution.”

Shape
Shape
Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy,  bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Shape

Trump Overturns California Phaseout of Fossil Fuel Cars

President Donald Trump on Thursday signed into law congressional resolutions that overturn three California regulations for cleaner transport, including one that would phase out the sale of new fossil fuel vehicles by 2035. Last February the Environmental Protection Agency (EPA) said it was letting Congress review waivers it had issued

Read More »

Why people love Linux

The people who love Linux love it for a wide variety of reasons. Some of them appreciate having access to source code and the ability (if they’re so inclined) to modify it. Most love that the majority of Linux distributions are completely free. Some understand and appreciate that Linux is

Read More »

AMD steps up AI competition with Instinct MI350 chips, rack-scale platform

Other announcements included ROCm 7, the latest version of AMD’s open-source AI software stack, and the broad availability of its Developer Cloud, a fully managed platform aimed at accelerating high-performance AI development. Openness and Nvidia challenge AMD underscored its commitment to open standards and ecosystem collaboration, positioning itself in contrast

Read More »

NRG Expects to Line Up New US Data Center Deals This Quarter

NRG Energy Inc. is working on deals to power US data centers “all over the country” and will reveal details with second-quarter results, according to the top executive of the independent generator.  These are huge, complicated deals and it’s sensitive to figure out when to announce them, Chief Executive Officer Larry Coben said on Bloomberg TV.  Data centers are supercharging growth in electricity demand, and shares of power generators, long unloved by Wall Street, have rallied since early last year in this new paradigm. Investors are rewarding companies signing deals to supply power to tech giants. It’ll take all kinds of resources – including natural gas, nuclear, solar, wind, hydro and batteries – to meet needs for artificial intelligence, Coben said. Houston-based NRG will help meet this new demand thanks to turbine orders for natural gas plants and through its $12 billion deal with LS Power Equity Advisors LLC announced last month, he said.  “Every time there’s not a data center deal for a month, people wonder if they are real,” Coben said. “But now that we’ve seen the last two, they are clearly real.” WHAT DO YOU THINK? Generated by readers, the comments included herein do not reflect the views and opinions of Rigzone. All comments are subject to editorial review. Off-topic, inappropriate or insulting comments will be removed.

Read More »

State regulators, utilities support SPP fast-track interconnection plan

Dive Brief: State utility regulators and utilities such as American Electric Power and Southwestern Public Service Co. are urging the Federal Energy Regulatory Commission to approve the Southwest Power Pool’s fast-track interconnection proposal, according to filings at the agency on Thursday. However, NextEra Energy Resources, other independent power producers, clean energy trade groups and environmental groups asked FERC to reject the proposal, which they contend would give utilities and other “load-responsible entities” the sole ability to select projects for the process without state regulatory review. “We are concerned that [SPP’s Expedited Resource Adequacy Study] framework will undermine open access and enable preferential grid access for utility-affiliated generation, elevating the risk of uneconomic resources advancing through the queue, locking up transmission capacity, and raising costs for consumers,” the Clean Energy Buyers Association said in a filing. Dive Insight: Like some other grid operators, SPP is proposing to address near-term power supply needs by creating a temporary interconnection process that would run separately from its normal interconnection queue. SPP expects available capacity to drop below its reserve margins by 2027, according to its proposal, which it filed on May 22, a week after FERC rejected a similar proposal from the Midcontinent Independent System Operator. Under the proposal, load-responsible entities must verify whether a project can help close their resource adequacy deficiency, according to SPP. To enter the ERAS process, projects must meet readiness standards and other requirements. State oversight will help prevent utilities and other load-responsible entities from exercising undue preference, according to SPP. “LREs are subject to state regulation that mandates certain processes for resource selection and scrutinizes the prudence of the utility’s resource adequacy procurements, including ensuring cost-effective procurement to protect consumers,” the grid operator said in its proposal. The ERAS proposal was unanimously supported by the SPP Regional State Committee, which

Read More »

Oil Spikes as Israel’s Attacks on Iran Stoke Fears of Wider War

Oil surged the most in more than three years after Israel carried out airstrikes against Iran, raising fears of a wider war in a region that accounts for a third of global crude production. West Texas Intermediate crude futures advanced more than 7% to settle near $73 a barrel, the biggest one-day jump since March 2022. The price of European natural gas — also a major Middle Eastern export — rallied, and haven demand pushed gold closer to a record. US President Donald Trump urged Tehran to make a deal “before it is too late,” in a post on Truth Social. The next attacks already being planned will be “even more brutal,” he said. The attack marks the most dramatic escalation yet in a conflict that has loomed in the background of the oil market for about 20 months, but had yet to result in a significant loss of barrels. A broader regional clash in the Middle East threatens a major rerouting of global oil flows by restricting supplies through the Strait of Hormuz in addition to the possible reduction of Iranian exports, with JPMorgan Chase & Co. pegging the potential impact at more than 2.1 million barrels a day.  Israel launched another round of attacks on several locations in Iran, including the Natanz nuclear site and Tabriz. Israeli Prime Minister Benjamin Netanyahu earlier said strikes targeted Iran’s nuclear and ballistic missile programs and that the operation would continue until the threat was removed.  Hours after the first Israeli strikes, Tehran launched more than 100 drones, Israel Defence Forces said. Israel expects Iran to respond with missiles and further drone attacks, according to a military official. The Qatari Ministry of Foreign Affairs said in a post on X that it will work with regional and international partners to urgently stop the aggression on Iran, taking

Read More »

Trump Urges Iran to Agree Nuclear Deal to Avoid More Attacks

US President Donald Trump urged Iran to accept a nuclear deal to avoid further attacks, hours after Israel bombed the Islamic Republic’s atomic facilities and killed some of its top commanders. “There is still time to make this slaughter, with the next already planned attacks being even more brutal, come to an end,” Trump said on Truth Social. Tehran must make a deal “before it is too late,” he said. Israel said it struck around 100 targets across Iranian cities on Friday morning, using 200 planes. The attacks, which Israel has said will likely continue over the coming days, caused oil to surge as much as 13 percent, though it later pared its gains, and investors to buy havens such as gold and US Treasuries. Iran quickly responded by sending a wave of drones toward Israel, though it was unclear if they caused any damage. Some were intercepted over Jordan. Still, Israel expects Iran to retaliate with more drone strikes and also by firing ballistic missiles, according to a military official speaking on condition of anonymity. Explosions were heard across Tehran, Natanz – home to a key atomic site – and other cities, according to local and social media. Israeli Prime Minister Benjamin Netanyahu said Israel “struck at the heart of Iran’s nuclear-enrichment program.”  The head of the Islamic Revolutionary Guard Corps, Hossein Salami, and the military’s chief of staff, Mohammad Bagheri, were both killed, according to Iranian media. At least two other senior IRGC members also died. The United Nations’ atomic watchdog said there were no indications of increased radiation levels at Iran’s main uranium-enrichment site of Natanz, an early sign the strikes haven’t penetrated the layers of steel and concrete protecting the Islamic Republic’s nuclear stockpile. Still, Netanyahu said the strikes “will continue for as many days as it takes

Read More »

Qualitas Leads Bidding for $8 Billion Clean Energy Firm Cubico

Spain’s largest renewable fund Qualitas Energy is the frontrunner to acquire clean-energy company Cubico Sustainable Investments, according to people with knowledge of the matter. Qualitas has emerged as the leading bidder for the business after interest from KKR & Co.’s ContourGlobal, the other remaining suitor, has cooled, the people said, asking not to be identified as the information is private. London-based Cubico is owned by Ontario Teachers’ Pension Plan and PSP Investments. The Canadian pension funds have been working with Bank of America Corp. on the potential deal, seeking a valuation of about EUR 7 billion ($7.9 billion), Bloomberg News reported last year. A sale of Cubico could rank as one of the largest renewable energy deals this year globally and comes at a time the US administration’s refocus on boosting oil and gas supplies has reduced investor appetite for such assets. Qualitas may consider bringing in partners to help fund the acquisition or seek to divest some of Cubico’s assets shortly after a deal is completed, one of the people said. There’s no certainty the deliberations will lead to a transaction, and ContourGlobal’s interest could still be rekindled, the people said.  Representatives for Qualitas, KKR, OTPP and PSP declined to comment. Cubico was launched in 2015 by its current owners alongside Spain’s largest lender, Banco Santander SA, which sold its stake a year later. The company’s portfolio includes onshore wind and solar assets as well as battery energy storage systems and transmission lines, according to its website. It has a total installed capacity of 2.8 gigawatts, as well as 450 megawatts under construction and a development pipeline of more than 17 gigawatts. The firm has more than 500 employees globally. Qualitas chiefly invests in energy transition and sustainability-related assets. It has raised EUR 4.6 billion since its inception in 2006,

Read More »

Batteries are making the grid more reliable: NERC

Dive Brief: There were 30 weather-related events in the United States and Canada that caused more than $1 billion in damages last year, but “none resulted in operator-initiated load shed, unlike previous events of a similar scale,” the North American Electric Reliability Corp. said Thursday in its annual State of Reliability report. On whole, the bulk power system “remained reliable but challenged by adverse weather conditions and transitions in resource mix and usage,” according to the report. “Today’s transmission system is demonstrably more reliable and resilient with the severity and duration of outages declining,” the reliability watchdog said. The North American grid is challenged by the proliferation of large loads, such as data centers, and the operating profile of inverter-based resources, NERC said. But “reliability improvements were observed in areas with high concentrations of battery energy storage systems,” also known as BESS. Dive Insight: The size and speed with which data centers are expanding across the country presents “a significant near-term reliability challenge,” NERC said. Though there is uncertainty surrounding how much data center capacity will be built, and how quickly, experts agree they are driving up electricity demand. Data centers could account for 44% of U.S. electricity load growth from 2023 to 2028, Bain & Company said in an October analysis. These large loads can be developed faster than the generation and transmission infrastructure needed to support them, “resulting in lower system stability,” NERC said in its report. “Additionally, the voltage sensitivity and rapidly changing, often unpredictable, power usage of these facilities creates new operating challenges.” NERC said more accurate models of the operational characteristics of data centers “are essential to reliability to prevent instability caused by these large changes in electricity demand.” There was also more than 45 GW of new inverter-based capacity added to the bulk power system in

Read More »

Oracle’s struggle with capacity meant they made the difficult but responsible decisions

IDC President Crawford Del Prete agreed, and said that Oracle senior management made the right move, despite how difficult the situation is today. “Oracle is being incredibly responsible here. They don’t want to have a lot of idle capacity. That capacity does have a shelf life,” Del Prete said. CEO Katz “is trying to be extremely precise about how much capacity she puts on.” Del Prete said that, for the moment, Oracle’s capacity situation is unique to the company, and has not been a factor with key rivals AWS, Microsoft, and Google. During the investor call, Katz said that her team “made engineering decisions that were much different from the other hyperscalers and that were better suited to the needs of enterprise customers, resulting in lower costs to them and giving them deployment flexibility.” Oracle management certainly anticipated a flurry of orders, but Katz said that she chose to not pay for expanded capacity until she saw finalized “contracted noncancelable bookings.” She pointed to a huge capex line of $9.1 billion and said, “the vast majority of our capex investments are for revenue generating equipment that is going into data centers and not for land or buildings.”

Read More »

Winners and losers in the Top500 supercomputer ranking

GPU winner: AMD AMD is finally making a showing for itself, albeit modestly, in GPU accelerators. For the June 2025 edition of the list, AMD Instinct accelerators are in 23 systems, a nice little jump from the 10 systems on the June 2024 list. Of course, it helps with the sales pitch when AMD processors and coprocessors can be found powering the No. 1 and No. 2 supercomputers in the world. GPU loser: Intel Intel’s GPU efforts have been a disaster. It failed to make a dent in the consumer space with its Arc GPUs, and it isn’t making much headway in the data center, either. There were only four systems running GPU Max processors on the list, and that’s up from three a year ago. Still, it’s pitiful showing given the effort Intel made. Server winners: HPE, Dell, EVIDAN, Nvidia The four server vendors — servers, not component makers — all saw share increases. Nvidia is also a server vendor, selling its SuperPOD AI servers directly to customers. They all gained at the expense of Lenovo and Arm. Server loser: Lenovo It saw the sharpest drop in server share, going from 163 systems in June of 2024 to 136 in this most recent listing. Loser: Arm Other than the 13 Nvidia Grace chips, the ARM architecture was completely absent from this spring’s list.

Read More »

Micron joins HBM4 race with 36GB 12-high stack, eyes AI and data center dominance

Race to power the next generation of AI By shipping samples of the HMB4 to the key customers, Micron has joined SK hynix in the HBM4 race. In March this year, SK hynix shipped the 12-Layer HBM4 samples to customers. SK hynix’s HBM4 has implemented bandwidth capable of processing more than 2TB of data per second, processing data equivalent to more than 400 full-HD movies (5GB each) in a second, said the company. “HBM competitive landscape, SK hynix has already sampled and secured approval of HBM4 12-high stack memory early Q1’2025 to NVIDIA for its next generation Rubin product line and plans to mass produce HBM4 in 2H 2025,” said Danish Faruqui, CEO, Fab Economics. “Closely following, Micron is pending Nvidia’s tests for its latest HBM4 samples, and Micron plans to mass produce HBM4 in 1H 2026. On the other hand, the last contender, Samsung is struggling with Yield Ramp on HBM4 Technology Development stage, and so has to delay the customer samples milestones to Nvidia and other players while it earlier shared an end of 2025 milestone for mass producing HBM4.” Faruqui noted another key differentiator among SK hynix, Micron, and Samsung: the base die that anchors the 12-high DRAM stack. For the first time, both SK hynix and Samsung have introduced a logic-enabled base die on 3nm and 4nm process technology to enable HBM4 product for efficient and faster product performance via base logic-driven memory management. Both Samsung and SK hynix rely on TSMC for the production of their logic-enabled base die. However, it remains unclear whether Micron is using a logic base die, as the company lacks in-house capability to fabricate at 3nm.

Read More »

Cisco reinvigorates data center, campus, branch networking with AI demands in mind

“We have a number of … enterprise data center customers that have been using bi-directional optics for many generations, and this is the next generation of that feature,” said Bill Gartner, senior vice president and general manager of Cisco’s optical systems and optics business. “The 400G lets customer use their existing fiber infrastructure and reduces fiber count for them so they can use one fiber instead of two, for example,” Gartner said. “What’s really changed in the last year or so is that with AI buildouts, there’s much, much more optics that are part of 400G and 800G, too. For AI infrastructure, the 400G and 800G optics are really the dominant optics going forward,” Gartner said. New AI Pods Taking aim at next-generation interconnected compute infrastructures, Cisco expanded its AI Pod offering with the Nvidia RTX 6000 Pro and Cisco UCS C845A M8 server package. Cisco AI Pods are preconfigured, validated, and optimized infrastructure packages that customers can plug into their data center or edge environments as needed. The Pods include Nvidia AI Enterprise, which features pretrained models and development tools for production-ready AI, and are managed through Cisco Intersight. The Pods are based on Cisco Validated Design principals, which offer customers pre-tested and validated network designs that provide a blueprint for building reliable, scalable, and secure network infrastructures, according to Cisco. Building out the kind of full-scale AI infrastructure compute systems that hyperscalers and enterprises will utilize is a huge opportunity for Cisco, said Daniel Newman, CEO of The Futurum Group. “These are full-scale, full-stack systems that could land in a variety of enterprise and enterprise service application scenarios, which will be a big story for Cisco,” Newman said. Campus networking For the campus, Cisco has added two new programable SiliconOne-based Smart Switches: the C9350 Fixed Access Smart Switches and C9610

Read More »

Qualcomm’s $2.4B Alphawave deal signals bold data center ambitions

Qualcomm says its Oryon CPU and Hexagon NPU processors are “well positioned” to meet growing demand for high-performance, low-power compute as AI inferencing accelerates and more enterprises move to custom CPUs housed in data centers. “Qualcomm’s advanced custom processors are a natural fit for data center workloads,” Qualcomm president and CEO Cristiano Amon said in the press release. Alphawave’s connectivity and compute technologies can work well with the company’s CPU and NPU cores, he noted. The deal is expected to close in the first quarter of 2026. Complementing the ‘great CPU architecture’ Qualcomm has been amassing Client CPUs have been a “big play” for Qualcomm, Moor’s Kimball noted; the company acquired chip design company Nuvia in 2021 for $1.4 billion and has also announced that it will be designing data center CPUs with Saudi AI company Humain. “But there was a lot of data center IP that was equally valuable,” he said. This acquisition of Alphawave will help Qualcomm complement the “great CPU architecture” it acquired from Nuvia with the latest in connectivity tools that link a compute complex with other devices, as well as with chip-to-chip communications, and all of the “very low level architectural goodness” that allows compute cores to deliver “absolute best performance.” “When trying to move data from, say, high bandwidth memory to the CPU, Alphawave provides the IP that helps chip companies like Qualcomm,” Kimball explained. “So you can see why this is such a good complement.”

Read More »

LiquidStack launches cooling system for high density, high-powered data centers

The CDU is serviceable from the front of the unit, with no rear or end access required, allowing the system to be placed against the wall. The skid-mounted system can come with rail and overhead piping pre-installed or shipped as separate cabinets for on-site assembly. The single-phase system has high-efficiency dual pumps designed to protect critical components from leaks and a centralized design with separate pump and control modules reduce both the number of components and complexity. “AI will keep pushing thermal output to new extremes, and data centers need cooling systems that can be easily deployed, managed, and scaled to match heat rejection demands as they rise,” said Joe Capes, CEO of LiquidStack in a statement. “With up to 10MW of cooling capacity at N, N+1, or N+2, the GigaModular is a platform like no other—we designed it to be the only CDU our customers will ever need. It future-proofs design selections for direct-to-chip liquid cooling without traditional limits or boundaries.”

Read More »

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs).  In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

Read More »

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

Read More »

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

Read More »

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Read More »