GPT-4.5 for enterprise: Do its accuracy and knowledge justify the cost?

Stay Ahead, Stay ONMINE

GPT-4.5 for enterprise: Do its accuracy and knowledge justify the cost?

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The release of OpenAI GPT-4.5 has been somewhat disappointing, with many pointing out its insane price point (about 10 to 20X more expensive than Claude 3.7 Sonnet and 15 to 30X more costly than GPT-4o). However, […]

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

The release of OpenAI GPT-4.5 has been somewhat disappointing, with many pointing out its insane price point (about 10 to 20X more expensive than Claude 3.7 Sonnet and 15 to 30X more costly than GPT-4o).

However, given that this is OpenAI’s largest and most powerful non-reasoning model, it is worth considering its strengths and the areas where it shines.

Better knowledge and alignment

There is little detail about the model’s architecture or training corpus, but we have a rough estimate that it has been trained with 10X more compute. And, the model was so large that OpenAI needed to spread training across multiple data centers to finish in a reasonable time.

Bigger models have a larger capacity for learning world knowledge and the nuances of human language (given that they have access to high-quality training data). This is evident in some of the metrics presented by the OpenAI team. For example, GPT-4.5 has a record-high ranking on PersonQA, a benchmark that evaluates hallucinations in AI models.

Practical experiments also show that GPT-4.5 is better than other general-purpose models at remaining true to facts and following user instructions.

Users have pointed out that GPT-4.5’s responses feel more natural and context-aware than previous models. Its ability to follow tone and style guidelines has also improved.

After the release of GPT-4.5, AI scientist and OpenAI co-founder Andrej Karpathy, who had early access to the model, said he “expect[ed] to see an improvement in tasks that are not reasoning-heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc.”

However, evaluating writing quality is also very subjective. In a survey that Karpathy ran on different prompts, most people preferred the responses of GPT-4o over GPT-4.5. He wrote on X: “Either the high-taste testers are noticing the new and unique structure but the low-taste ones are overwhelming the poll. Or we’re just hallucinating things. Or these examples are just not that great. Or it’s actually pretty close and this is way too small sample size. Or all of the above.”

Better document processing

In its experiments, Box, which has integrated GPT-4.5 into its Box AI Studio product, wrote that GPT-4.5 is “particularly potent for enterprise use-cases, where accuracy and integrity are mission critical… our testing shows that GPT-4.5 is one of the best models available both in terms of our eval scores and also its ability to handle many of the hardest AI questions that we have come across.”

In its internal evaluations, Box found GPT-4.5 to be more accurate on enterprise document question-answering tasks — outperforming the original GPT-4 by about 4 percentage points on their test set.

Box’s tests also indicated that GPT-4.5 excelled at math questions embedded in business documents, which older GPT models often struggled with. For example, it was better at answering questions about financial documents that required reasoning over data and performing calculations.

GPT-4.5 also showed improved performance at extracting information from unstructured data. In a test that involved extracting fields from hundreds of legal documents, GPT-4.5 was 19% more accurate than GPT-4o.

Planning, coding, evaluating results

Given its improved world knowledge, GPT-4.5 can also be a suitable model for creating high-level plans for complex tasks. Broken-down steps can then be handed over to smaller but more efficient models to elaborate and execute.

According to Constellation Research, “In initial testing, GPT-4.5 seems to show strong capabilities in agentic planning and execution, including multi-step coding workflows and complex task automation.”

GPT-4.5 can also be useful in coding tasks that require internal and contextual knowledge. GitHub now provides limited access to the model in its Copilot coding assistant and notes that GPT-4.5 “performs effectively with creative prompts and provides reliable responses to obscure knowledge queries.”

Given its deeper world knowledge, GPT-4.5 is also suitable for “LLM-as-a-Judge” tasks, where a strong model evaluates the output of smaller models. For example, a model such as GPT-4o or o3 can generate one or several responses, reason over the solution and pass the final answer to GPT-4.5 for revision and refinement.

Is it worth the price?

Given the huge costs of GPT-4.5, though, it is very hard to justify many of the use cases. But that doesn’t mean it will remain that way. One of the constant trends we have seen in recent years is the plummeting costs of inference, and if this trend applies to GPT-4.5, it is worth experimenting with it and finding ways to put its power to use in enterprise applications.

It is also worth noting that this new model can become the basis for future reasoning models. Per Karpathy: “Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning and RLHF [reinforcement learning from human feedback], so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.)… Presumably, OpenAI will now be looking to further train with reinforcement learning on top of GPT-4.5 model to allow it to think, and push model capability in these domains.”

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Microsoft doubles down on multi-model AI as it builds a Copilot super app

All of the major AI providers want you to use, and ideally stay within, their super apps, and now Microsoft is looking to capture that attention, too. During an earnings call this week, CEO Satya Nadella confirmed that the tech giant is building a Copilot ‘super app’ that will be

How Port Nelson overhauled its operations with private 5G

Some 80% of New Zealand’s wine is grown and produced in the Marlborough region, making Port Nelson a critical trading gateway for the island nation. The busy port also ships forestry products, dairy, and seafood. In 2025, during a time of global trading challenges and domestic economic difficulties, 3.2m tons

Most corporate IT is off premises. AI is reshaping infrastructure, Uptime reports

“All of these are linked, and they’re all linked through AI,” Andy Lawrence, Uptime’s executive director of research, said during a webinar discussing the findings. “The pressures it’s putting on the entire supply chain, the cost base, the availability of power—all of these are being driven by this massive surge

CISA unveils a six-step blueprint for isolating critical infrastructure during cyberattacks

It’s also important to identify connections that might lower trust or increase network vulnerability, such as those to carrier-provided networks, or Wi-Fi, satellite, radio point-to-point, or mobile connections. Operators should identify any protection mechanisms, like encryption, that may be in place in these areas, according to the agencies. Further, operators

Energy Department Announces Partnership to Expand Reliable, Affordable Energy Access and Power America’s AI Future in Western Kentucky

WASHINGTON—The U.S. Department of Energy (DOE) today announced a landmark partnership with Brookfield, NextEra Energy, Big Rivers Electric Power Corporation, Jackson Purchase Energy Cooperative, and Paducah Power System to redevelop portions of DOE’s Paducah Site into a data center campus accompanied by new affordable energy infrastructure. The more than $100 billion privately funded investment, one of the largest in Kentucky’s history, is expected to create approximately 8,000 construction jobs and 600 permanent jobs. Consistent with President Trump’s Ratepayer Protection Pledge, the project’s generation and storage capacity would exceed the campus’s energy needs, allowing excess electricity to be delivered to the regional grid, helping reduce energy costs for American families and businesses. At or near the Paducah campus, NextEra Energy plans to develop 2 gigawatts (GW) of new grid-connected natural gas-fired generation, upgrade existing transmission infrastructure, and deploy up to 2.6 GW of battery energy storage to support a new 1.8 GW artificial intelligence and high-performance computing (HPC) innovation campus. “Thanks to President Trump, the U.S. government is leveraging its assets—like our federal lands—to add power generation, create jobs, and ensure the United States wins the AI race,” said U.S. Secretary of Energy Chris Wright. “The President’s Ratepayer Protection Pledge ensures America can build the energy infrastructure needed to power the AI revolution while lowering electricity costs for American families and businesses. By transforming former DOE sites into engines of innovation and economic growth, we can revitalize communities with increased tax revenue and thousands of jobs, while also strengthening America’s energy security.” “The $100 billion private sector investment in Paducah is historic,” said U.S. Congressman Andy Barr (R-KY). “I’m grateful to Secretary Wright for his leadership and partnership in securing a long-term commitment from the federal government to back this project. After speaking with local and state officials who were strongly supportive

Nuclear Lifecycle Innovation Campuses Contenders Announced

WASHINGTON—The U.S. Department of Energy (DOE) today announced the selection of Utah, Tennessee, Oklahoma, Louisiana, and Idaho as potential host states for Nuclear Lifecycle Innovation Campuses, a new effort to strengthen and modernize the nation’s full nuclear fuel cycle. The campuses will attract significant investment, expand domestic manufacturing, and create thousands of new high-paying jobs in their respective regions. Following record levels of interest in the application process, U.S. Secretary of Energy Chris Wright signed Memorandums of Understanding with the five states to continue exploring opportunities to host Innovation Campuses and support President Trump’s bold vision for American energy dominance and national energy security. “I’m pleased to announce that after reviewing 28 applications from 26 states, the Energy Department has selected five initial contenders to further explore building Nuclear Lifecycle Innovation Campuses,” Secretary Wright said. “These campuses will be massive generators of economic growth, create thousands of high-paying jobs, and be crucial to unleashing America’s nuclear renaissance. The innovative concept is a direct result of President Trump’s leadership and ambitious directives to restore the domestic nuclear fuel cycle and get America’s nuclear industry growing again.” “Utah welcomes the chance to help America reclaim its leadership in civil nuclear energy,” said Utah Governor Spencer J. Cox. “We’re building the advanced technologies that will drive affordable, abundant power across our country. Through Operation Gigawatt, Utah is developing the entire nuclear lifecycle, from fuel production to advanced reactor deployment—strengthening our national security while helping secure America’s energy independence. This campus will accelerate that work.” “As the global epicenter of nuclear energy, Tennessee is honored to be selected as a potential host for a Nuclear Lifecycle Innovation Campus,” said Tennessee Governor Bill Lee. “As our state answered the call during the Manhattan Project and helped shape the course of history, Tennessee stands ready once again

Energy Secretary Secures Grid Across 17 States Amid Period of Hot Weather

WASHINGTON—The U.S. Department of Energy (DOE) issued an emergency order to keep Americans across 17 states powered during the region’s energy emergency brought on by hot weather conditions. The order directs the Southwest Power Pool, Inc. (SPP) to dispatch specified generation units and to order their operation as needed to maintain reliability. The order also authorizes SPP to direct backup generation resources to operate as a last resort before declaring an Energy Emergency Alert (EEA) 3 or during an EEA 3. The order was issued pursuant to a request from SPP. “The Trump Administration is tapping into an abundant supply of unused backup generation to maintain affordable, reliable, and secure power for hardworking American families and businesses,” said U.S. Secretary of Energy Chris Wright. “The previous administration’s energy subtraction policies weakened the grid, leaving Americans more vulnerable during emergency events. Thanks to President Trump’s leadership, we are reversing those failures and using every available tool to ensure Americans have continued access to affordable, reliable, and secure energy to power and cool their homes.” DOE estimates more than 35 gigawatts (GW) of unused backup generation remain available nationwide. On day one of his second term, President Trump declared a national energy emergency after the Biden administration’s energy subtraction agenda left behind a grid increasingly vulnerable to blackouts. Power outages cost the American people $44 billion per year, according to data from DOE’s National Laboratories. This order mitigates the possibility of power outages in the region and highlights the commonsense policies of the Trump Administration to ensure Americans have access to affordable, reliable, and secure electricity. The order is effective on July 26, 2026, and shall expire at 11:59 PM CDT on August 3, 2026.

Magnolia expands Giddings position with $4-billion WildFire Energy acquisition

In the filing, Magnolia said WildFire’s second-quarter 2025 production is expected to average 53,000 boe/d, about 70% oil, primarily from the Eagle Ford, Austin Chalk, and Woodbine formations. Magnolia said the acquisition would strengthen its position in the Eagle Ford/Austin Chalk trend by expanding its inventory of high-return drilling locations, adding development flexibility and longer laterals, and leveraging its technical expertise to improve well performance and lower costs. “WildFire has a large, low-decline oily PDP base with historic development centered on the Eagle Ford. While there are significant future Eagle Ford development opportunities, our technical teams see extensive future potential in the Austin Chalk with further upside in the Woodbine as well as other appraisal opportunities that should expand on our success in Giddings since 2018,” said Chris Stavros, Magnolia’s chairman, president, and chief executive officer. The deal is expected to result in a pro forma position in Giddings of more than 1.25 million net acres, add more than 500 miles of gas-gathering pipelines, and offer various cost savings, the company said. “Magnolia is guiding to $100 million in run rate synergies by the end of 2027, with savings coming from the chance to deploy long laterals, shared facilities and infrastructure and additional sand sourcing for operations from WildFire’s in-basin mine. As always, successful execution will be key for the longer-term success of the deal,” Enverus’ Dittmar said. Total consideration consists of $2.65 billion in cash, 32.2 million shares of Magnolia Class A common stock, and the assumption of $600 million of outstanding debt.

Vår Energi inks deal to acquire BlueNord

Vår Energi ASA has agreed to buy BlueNord ASA as part of a proposed merger that, if completed, will expand Vår Energi’s presence beyond the Norwegian Continental Shelf (NCS), positioning the operator as Europe’s largest independent oil and gas producer. Acqusition of BlueNord would add producing assets on the Danish Continental Shelf (DCS) to Vår Energi’s current holdings, with the combined post-merger portfolio anticipated to lift long-term production to about 450,000 boe/d, with about 2.4 billion boe of reserves and resources and an estimated reserve and resource life of about 15 years. BlueNord’s portfolio includes interests in the Tyra, Halfdan, Dan, and Gorm hub areas, which are part of the Danish Underground Consortium operated by TotalEnergies SE. The assets are expected to contribute about 45,000 boe/d of net production beginning in 2026 and include about 195 million boe of net 2P reserves and 2C contingent resources, extending production beyond 2040. “The transaction marks a significant milestone in Vår Energi’s growth journey, creating the largest independent producer of oil and gas in Europe with a long-term production target of [about 450,000 b/d] and reinforcing our role as a reliable and secure supplier of energy to Europe,” said Nick Walker, Vår Energi’s chief executive officer. Vår Energi said the DCS assets complement its existing North Sea operations because of their geological, operational, and fiscal similarities to the NCS. The combination also expands the company’s exposure to European natural gas markets through access to the Nybro and Den Helder gas delivery points. The combined portfolio would maintain a production mix of about 65% oil and 35% natural gas, with operating costs projected to remain at $10-11/boe. The proposed merger remains subject to approval by BlueNord shareholders, regulatory and governmental approvals, license and partner consent, and other customary conditions. If approved, the companies said

Bahrain’s GPIC enlists Fluor for new unit at Sitra complex

Gulf Petrochemical Industries Co. (GPIC) has awarded Fluor Corp. a contract to execute front-end engineering and design (FEED) for a proposed aromatics plant to be built at GPIC’s petrochemicals complex located across 60 hectares of reclaimed land in Sitra, Bahrain. As part of the contract, Fluor will deliver a FEED study based on commercially proven process technologies for the plant’s targeted production of 1.2 million tonnes/year (tpy) of paraxylene and 500,000 tpy of benzene, the service provider said on July 21. Critical building blocks for plastics, polyester fibers, and packaging materials, paraxylene and benzene production from the plant would help meet global demand for high‑performance consumer and industrial products, as well as expand capabilities of GPIC’s current operations at Sitra, Fluor said. GPIC’s existing complex currently uses a feedstock of natural gas domestically produced in Bahrain to produce about 1.2 million tonnes/day of ammonia, 1.2 million tonnes/day of methanol, and 1.7 million tonnes/day of urea. Neither Fluor nor GPIC revealed details regarding a timeline for completion of the proposed aromatics plant. GPIC is a joint venture of Bahrain Petroleum Co. (33.3%), SABIC Agri-Nutrients Investment Co. (33.3%), and Kuwait’s Petrochemical Industries Co. (PIC; 33.3%).

Why KVM-over-IP is becoming the backbone of modern infrastructure management

IT teams responsible for data centers, colocation facilities, and test labs face a persistent challenge when it comes to providing affordable, timely maintenance and troubleshooting for their organizations’ IT infrastructure. Facilities aren’t often staffed around the clock, and sending someone on-site to address every hardware failure or issue is slow and expensive, especially for colocation customers that already pay for local staff support. Additionally, when critical infrastructure goes down, organizations can’t afford to endure an outage during the time it takes to get a technician on-site. IT teams typically employ tools to enable remote access, but the most common have significant limitations and drawbacks. Remote desktop protocol (RDP), for instance, provides access, but only while the operating system is running. If the OS crashes or fails to boot, or if a firmware change ends the session, RDP is useless, and IT will need to send a technician for an on-site visit. There are also security concerns, as open software ports make RDP vulnerable to attack. Another tool, physical intelligent platform management interface (IPMI), sits below the OS layer, so it can be used regardless of the state of the OS. However, legacy IPMI implementations have historically been associated with security concerns, particularly when exposed to public or poorly secured networks. Many organizations now restrict IPMI access or supplement it with additional security controls. Some organizations have moved to Distributed Management Task Force (DMTF) Redfish because it provides stronger security. But its security behaviors, such as session timeouts, rate limiting, and lockout policies, are undefined and left to the implementer, so improper implementation poses a significant risk. KVM (keyboard, video, and mouse)-over-IP addresses both the reliability and security concerns of RDP, IPMI, and DMTF Redfish. KVM connects directly to hardware, which gives administrators remote BIOS-level access and full control of a

AI data centers in the US may face power cuts under PJM reliability proposal

High risk for new builds While PJM coordinates the wholesale electricity grid across 13 states, including Delaware, Illinois, Indiana, Kentucky, Maryland, etc., and the District of Columbia, not every data center will be affected by this development. The proposal is expected to primarily impact new facilities that will fail to secure dedicated or contracted power supplies. “The data centers most at risk are new ones being built that haven’t signed contracts for their own power source yet, especially smaller or newer companies without deep pockets. Giant companies like Amazon, Google, or Microsoft can more easily afford to build their own backup power, so they’re safer. The riskiest locations are places already packed with data centers, like Northern Virginia and growing areas in Ohio, Pennsylvania, and Maryland, where the local power grid is already stretched thin,” noted Jain.

A 13-year-old flaw is exposing tens of thousands of data center management systems

Why this attack method is dangerous BMCs sit a layer below that which many security products monitor, on shared out-of-band management networks where administrative credentials are often reused. Thus, malicious changes made to BMCs or other platform hardware are able to survive OS reinstalls, disk replacements, and standard incident response procedures, Katchinskiy noted. The risk is “especially pronounced” in neocloud and GPU cloud environments, he said. AI infrastructure can span thousands of GPUs on shared management networks, with joint storage, high-speed interconnects, and multi-tenant tooling. He pointed out that, while a customer may rent their own dedicated servers, they are still connected to shared, provider-managed, out-of-band networks where orchestration and provisioning services, credential stores, and admin tools span infrastructure used by numerous joint customers.

Fortinet’s new FortiGate platform converges firewall, SASE technologies

When configured as a FortiSASE Outpost, the 1200G can be deployed as a local SASE point of presence (POP), extending SASE enforcement closer to users and applications in customer-controlled locations, such as on-premises sites, private data centers, or colocation facilities, while maintaining centralized cloud management, according to Fortinet. Users can maintain local enforcement where needed without building separate stacks, the vendor stated. The FortiSASE interface centrally manages configuration, policy, monitoring, lifecycle operations, and upgrades across both deployment models, maintaining consistent zero-trust policies, visibility and protection without treating the on-site POP as a separate security environment. In addition customers can keep designated traffic, logs, and processing within defined geographic or private infrastructure boundaries to meet regulatory requirements and reduce connectivity costs without changing the end-user experience, Fortinet stated. “As AI adoption, encrypted traffic, and hybrid infrastructure reshape enterprise networks, organizations need to inspect growing traffic volumes without introducing performance bottlenecks,” Fortinet stated. “They also need the flexibility to determine where security enforcement occurs based on application performance, data sovereignty, compliance, and operational requirements.”

Up to 50% of data center capacity slated for 2026 could be delayed

A primary obstacle is electricity. After a number of instances where local citizens saw their electric bill skyrocket after a data center opened up shop in their neighborhood, there has been tremendous pushback from cities and states on large scale data centers. In some instances, operators are being required to provide their own power rather than get power from the public grid, according to Currence. Although projects powered entirely by on-site generation or hybrid systems account for fewer than 10% of announced facilities, they represent nearly half of the total announced capacity, according to the report. Mindful of their public image, hyperscalers are responding quickly to these demands. Google has expanded its strategy by acquiring a large renewable energy development pipeline, while Amazon has increased direct investments in solar generation and battery storage.

When Buildability Breaks: What Prince William and New York Signal for Data Center Development

For several years, the Prince William Digital Gateway represented data center ambition at its largest scale: a proposed 2,100-acre technology corridor near Gainesville, Virginia, capable of accommodating tens of millions of square feet of digital infrastructure. Its location also made it uniquely contentious. The corridor bordered Manassas National Battlefield Park and other historic, environmental and residential resources, drawing the data center development debate beyond its usual industry and land-use constituencies. Opposition increasingly centered not only on the project’s scale, but on whether development of that magnitude belonged alongside one of the country’s most significant Civil War landscapes. In July 2026, that vision effectively ended. QTS Data Centers terminated its participation in the Digital Gateway and withdrew its remaining petitions before the Supreme Court of Virginia. The decision followed Compass Datacenters’ withdrawal in April, leaving neither of the project’s original developers pursuing the corridor. QTS said it reached the decision after “careful consideration,” while emphasizing that Virginia remains an important market for the company. From Proposed Capacity to Executable Capacity The collapse of the Digital Gateway is more than the cancellation of one unusually large development. It comes as the data center industry confronts a widening gap between announced capacity and executable capacity. Power remains the most visible constraint. But permitting discipline, environmental review, community acceptance and the durability of political support are increasingly determining whether a project can progress from land control and conceptual capacity to construction and operation. A separate development in New York underscored that shift less than two weeks after QTS withdrew. On July 14, Gov. Kathy Hochul issued Executive Order 62, establishing what the state describes as the nation’s first statewide moratorium on new hyperscale data centers. The order temporarily holds in abeyance certain incomplete state environmental permit applications for data centers capable of drawing at

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Stay Ahead, Stay ONMINE