From terabytes to insights: Real-world AI obervability architecture

Stay Ahead, Stay ONMINE

From terabytes to insights: Real-world AI obervability architecture

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Consider maintaining and developing an e-commerce platform that processes millions of transactions every minute, generating large amounts of telemetry data, including metrics, logs and traces across multiple microservices. When […]

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

Consider maintaining and developing an e-commerce platform that processes millions of transactions every minute, generating large amounts of telemetry data, including metrics, logs and traces across multiple microservices. When critical incidents occur, on-call engineers face the daunting task of sifting through an ocean of data to unravel relevant signals and insights. This is equivalent to searching for a needle in a haystack.

This makes observability a source of frustration rather than insight. To alleviate this major pain point, I started exploring a solution to utilize the Model Context Protocol (MCP) to add context and draw inferences from the logs and distributed traces. In this article, I’ll outline my experience building an AI-powered observability platform, explain the system architecture and share actionable insights learned along the way.

Why is observability challenging?

In modern software systems, observability is not a luxury; it’s a basic necessity. The ability to measure and understand system behavior is foundational to reliability, performance and user trust. As the saying goes, “What you cannot measure, you cannot improve.”

Yet, achieving observability in today’s cloud-native, microservice-based architectures is more difficult than ever. A single user request may traverse dozens of microservices, each emitting logs, metrics and traces. The result is an abundance of telemetry data:

AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

Turning energy into a strategic advantage
Architecting efficient inference for real throughput gains
Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO

Tens of terabytes of logs per day
Tens of millions of metric data points and pre-aggregates
Millions of distributed traces
Thousands of correlation IDs generated every minute

The challenge is not only the data volume, but the data fragmentation. According to New Relic’s 2023 Observability Forecast Report, 50% of organizations report siloed telemetry data, with only 33% achieving a unified view across metrics, logs and traces.

Logs tell one part of the story, metrics another, traces yet another. Without a consistent thread of context, engineers are forced into manual correlation, relying on intuition, tribal knowledge and tedious detective work during incidents.

Because of this complexity, I started to wonder: How can AI help us get past fragmented data and offer comprehensive, useful insights? Specifically, can we make telemetry data intrinsically more meaningful and accessible for both humans and machines using a structured protocol such as MCP? This project’s foundation was shaped by that central question.

Understanding MCP: A data pipeline perspective

Anthropic defines MCP as an open standard that allows developers to create a secure two-way connection between data sources and AI tools. This structured data pipeline includes:

Contextual ETL for AI: Standardizing context extraction from multiple data sources.
Structured query interface: Allows AI queries to access data layers that are transparent and easily understandable.
Semantic data enrichment: Embeds meaningful context directly into telemetry signals.

This has the potential to shift platform observability away from reactive problem solving and toward proactive insights.

System architecture and data flow

Before diving into the implementation details, let’s walk through the system architecture.

*Architecture diagram for the MCP-based AI observability system*

In the first layer, we develop the contextual telemetry data by embedding standardized metadata in the telemetry signals, such as distributed traces, logs and metrics. Then, in the second layer, enriched data is fed into the MCP server to index, add structure and provide client access to context-enriched data using APIs. Finally, the AI-driven analysis engine utilizes the structured and enriched telemetry data for anomaly detection, correlation and root-cause analysis to troubleshoot application issues.

This layered design ensures that AI and engineering teams receive context-driven, actionable insights from telemetry data.

Implementative deep dive: A three-layer system

Let’s explore the actual implementation of our MCP-powered observability platform, focusing on the data flows and transformations at each step.

Layer 1: Context-enriched data generation

First, we need to ensure our telemetry data contains enough context for meaningful analysis. The core insight is that data correlation needs to happen at creation time, not analysis time.

def process_checkout(user_id, cart_items, payment_method):
“””Simulate a checkout process with context-enriched telemetry.”””

# Generate correlation id
order_id = f”order-{uuid.uuid4().hex[:8]}”
request_id = f”req-{uuid.uuid4().hex[:8]}”

# Initialize context dictionary that will be applied
context = {
“user_id”: user_id,
“order_id”: order_id,
“request_id”: request_id,
“cart_item_count”: len(cart_items),
“payment_method”: payment_method,
“service_name”: “checkout”,
“service_version”: “v1.0.0”
}

# Start OTel trace with the same context
with tracer.start_as_current_span(
“process_checkout”,
attributes={k: str(v) for k, v in context.items()}
) as checkout_span:

# Logging using same context
logger.info(f”Starting checkout process”, extra={“context”: json.dumps(context)})

# Context Propagation
with tracer.start_as_current_span(“process_payment”):
# Process payment logic…
logger.info(“Payment processed”, extra={“context”:

json.dumps(context)})

Code 1. Context enrichment for logs and traces

This approach ensures that every telemetry signal (logs, metrics, traces) contains the same core contextual data, solving the correlation problem at the source.

Layer 2: Data access through the MCP server

Next, I built an MCP server that transforms raw telemetry into a queryable API. The core data operations here involve the following:

Indexing: Creating efficient lookups across contextual fields
Filtering: Selecting relevant subsets of telemetry data
Aggregation: Computing statistical measures across time windows

@app.post(“/mcp/logs”, response_model=List[Log])
def query_logs(query: LogQuery):
“””Query logs with specific filters”””
results = LOG_DB.copy()

# Apply contextual filters
if query.request_id:
results = [log for log in results if log[“context”].get(“request_id”) == query.request_id]

if query.user_id:
results = [log for log in results if log[“context”].get(“user_id”) == query.user_id]

# Apply time-based filters
if query.time_range:
start_time = datetime.fromisoformat(query.time_range[“start”])
end_time = datetime.fromisoformat(query.time_range[“end”])
results = [log for log in results
if start_time <= datetime.fromisoformat(log[“timestamp”]) <= end_time]

# Sort by timestamp
results = sorted(results, key=lambda x: x[“timestamp”], reverse=True)

return results[:query.limit] if query.limit else results

Code 2. Data transformation using the MCP server

This layer transforms our telemetry from an unstructured data lake into a structured, query-optimized interface that an AI system can efficiently navigate.

Layer 3: AI-driven analysis engine

The final layer is an AI component that consumes data through the MCP interface, performing:

Multi-dimensional analysis: Correlating signals across logs, metrics and traces.
Anomaly detection: Identifying statistical deviations from normal patterns.
Root cause determination: Using contextual clues to isolate likely sources of issues.

def analyze_incident(self, request_id=None, user_id=None, timeframe_minutes=30):
“””Analyze telemetry data to determine root cause and recommendations.”””

# Define analysis time window
end_time = datetime.now()
start_time = end_time – timedelta(minutes=timeframe_minutes)
time_range = {“start”: start_time.isoformat(), “end”: end_time.isoformat()}

# Fetch relevant telemetry based on context
logs = self.fetch_logs(request_id=request_id, user_id=user_id, time_range=time_range)

# Extract services mentioned in logs for targeted metric analysis
services = set(log.get(“service”, “unknown”) for log in logs)

# Get metrics for those services
metrics_by_service = {}
for service in services:
for metric_name in [“latency”, “error_rate”, “throughput”]:
metric_data = self.fetch_metrics(service, metric_name, time_range)

# Calculate statistical properties
values = [point[“value”] for point in metric_data[“data_points”]]
metrics_by_service[f”{service}.{metric_name}”] = {
“mean”: statistics.mean(values) if values else 0,
“median”: statistics.median(values) if values else 0,
“stdev”: statistics.stdev(values) if len(values) > 1 else 0,
“min”: min(values) if values else 0,
“max”: max(values) if values else 0
}

# Identify anomalies using z-score
anomalies = []
for metric_name, stats in metrics_by_service.items():
if stats[“stdev”] > 0: # Avoid division by zero
z_score = (stats[“max”] – stats[“mean”]) / stats[“stdev”]
if z_score > 2: # More than 2 standard deviations
anomalies.append({
“metric”: metric_name,
“z_score”: z_score,
“severity”: “high” if z_score > 3 else “medium”
})

return {
“summary”: ai_summary,
“anomalies”: anomalies,
“impacted_services”: list(services),
“recommendation”: ai_recommendation
}

Code 3. Incident analysis, anomaly detection and inferencing method

Impact of MCP-enhanced observability

Integrating MCP with observability platforms could improve the management and comprehension of complex telemetry data. The potential benefits include:

Faster anomaly detection, resulting in reduced minimum time to detect (MTTD) and minimum time to resolve (MTTR).
Easier identification of root causes for issues.
Less noise and fewer unactionable alerts, thus reducing alert fatigue and improving developer productivity.
Fewer interruptions and context switches during incident resolution, resulting in improved operational efficiency for an engineering team.

Actionable insights

Here are some key insights from this project that will help teams with their observability strategy.

Contextual metadata should be embedded early in the telemetry generation process to facilitate downstream correlation.
Structured data interfaces create API-driven, structured query layers to make telemetry more accessible.
Context-aware AI focuses analysis on context-rich data to improve accuracy and relevance.
Context enrichment and AI methods should be refined on a regular basis using practical operational feedback.

Conclusion

The amalgamation of structured data pipelines and AI holds enormous promise for observability. We can transform vast telemetry data into actionable insights by leveraging structured protocols such as MCP and AI-driven analyses, resulting in proactive rather than reactive systems. Lumigo identifies three pillars of observability — logs, metrics, and traces — which are essential. Without integration, engineers are forced to manually correlate disparate data sources, slowing incident response.

How we generate telemetry requires structural changes as well as analytical techniques to extract meaning.

Pronnoy Goswami is an AI and data scientist with more than a decade in the field.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Intel news: networking unit spinoff, earnings uproar, AI snub

It’s a shining example of the old cliche that those who do not learn from history are doomed to repeat it. When Paul Otellini took over as CEO in 2005, he had to cut Intel’s wasted effort on communications back then as well. Intel attempted to become a player in

WatchGuard updates tabletop firewall series with high-speed networking and AI-powered security

The T185 represents an exception with its Intel processor, reflecting its role as the highest-performance model approaching rack-mount capabilities. WatchGuard designed all models with optimized heat distribution through new enclosures, enabling fanless operation on several models for improved reliability and reduced noise. The wireless-capable models introduce Wi-Fi 7 support across

Mapping Trump’s tariffs by trade balance and geography

U.S. importers may soon see costs rise for many imported goods, as tariffs on foreign goods are set to rise. On July 31, President Donald Trump announced country-specific reciprocal tariffs would finally be implemented on Aug. 7, after a monthslong pause. The news means more than 90 countries will see

JF Expands in Southwest with Maverick Acquisition

The JF Group (JF) has acquired Arizona-based Maverick Petroleum Services. JF, a fueling infrastructure, petroleum equipment distribution, service, general contracting, and construction services provider, said in a media release that Maverick brings expertise in the installation, maintenance, and repair of petroleum handling equipment, Point-of-Sale (POS) systems, and environmental testing. As

Glencore Shakes Up Trading Team

The head of Glencore Plc’s huge coal-trading operation is leaving in the biggest shake-up of the trading unit in years, at a time when the commodity giant is struggling to revive its share price. The company, which traces its roots to the legendary commodity trader Marc Rich, has been reviewing its mining and smelting assets and recently unveiled a $1 billion cost cutting target. On Wednesday, it disappointed investors with weak results for the first half of the year that included one of the worst performances from its energy- and coal-trading unit on record. Glencore is reshuffling its trading team as Ruan van Schalkwyk, 42, a longstanding executive who runs coal trading, is retiring, according to a memo from Chief Executive Officer Gary Nagle that was seen by Bloomberg News. Glencore is the world’s largest shipper of coal. Jyothish George, currently head of copper and cobalt trading, will take on a wider role as head of metals, iron ore and coal trading, according to the memo. Several trading executives who currently report to Nagle will now report to him, including Peter Hill, the head of iron ore, and Robin Scheiner, head of alumina and aluminum. Alex Sanna, who runs oil, gas and power trading, will continue to report to Nagle. Under George, trading responsibilities are being reassigned. David Thomas, currently head of ferroalloys trading, will take over thermal coal. Paymahn Seyed-Safi will have responsibility for chrome as well as nickel; and Hill will take over responsibility for metallurgical coal, vanadium and manganese as well as iron ore. The changes come after Glencore’s trading teams reported starkly different results for the first half of the year. The company’s metals traders notched up their best half-yearly performance on record, while their energy and coal-trading peers struggled to even turn a profit. Van Schalkwyk ran ferroalloys trading

AI could cut disaster infrastructure losses by 15%, new research finds

Dive Brief: AI applications such as predictive maintenance and digital twins could prevent 15% of projected natural disaster losses to power grids, water systems and transportation infrastructure, amounting to $70 billion in savings worldwide by 2050, according to a recently released Deloitte Center for Sustainable Progress report. Governments and other stakeholders need to overcome technological limitations, financial constraints, regulatory uncertainty, data availability and security concerns before AI-enabled resilience can be widely adopted for infrastructure systems, according to the report. “Investing in AI can help deliver less frequent or shorter power outages, faster system recovery after storms, or fewer damaged or non-usable roads and bridges,” Jennifer Steinmann, Deloitte Global Sustainability Business leader, said in an email. Dive Insight: Natural disasters have caused nearly $200 billion in average annual losses to infrastructure around the world over the past 15 years, according to Deloitte. The report projects that could increase to approximately $460 billion by 2050. Climate change is expected to increase the frequency and intensity of these events, leading to higher losses, according to the report. “Investing in AI has the greatest near-term potential to help reduce damages from storms, which include tropical cyclones, tornados, thunderstorms, hailstorms, and blizzards,” Steinmann said. “These natural disasters drive the largest share of infrastructure losses, due to their high frequency, wide geographic reach, and increasing intensity.” The AI for Infrastructure Resilience report uses empirical case studies, probabilistic risk modeling and economic forecasting to show how AI can help leaders fortify infrastructure so they can plan, respond and recover more quickly from natural disasters. “AI technologies can offer preventative, detective and responsive solutions to help address natural disasters — but some interventions are more impactful than others,” Steinmann said. Investing in AI while infrastructure is in planning stages accounts for roughly two-thirds of AI’s potential to prevent

Lawmaker, AARP call for nationwide utility commission reforms to stop rising electric bills

Dive Brief: Utility commissions across the nation are “broken” and must be reformed to stop rising electric costs, a coalition of lawmakers, consumer and environmental advocates said Tuesday during a joint press conference. Speakers at the press conference said it seemed likely that the Florida Public Service Commission would “rubber stamp” a $9.8 billion base rate increase proposed by Florida Power & Light, and argued that regulators have become too deferential to utility requests. U.S. Rep. Kathy Castor, D-Fla., reintroduced legislation on Tuesday that would prohibit utility companies from using ratepayer dollars to fund political lobbying and advertising. Dive Insight: What started as a Tuesday morning press conference drawing attention to the plight of Floridians struggling to pay rising electric bills quickly escalated to calls for legislative reforms of utility commissions across the nation. “When we have a public service commission that does not look out for the best interest of the customers but rather the utility company itself, there is a problem,” Zayne Smith, senior director of advocacy at AARP Florida, said. “This is a canary in the coal mine if the current ask is granted.” Advocates on the call argued that FPL’s request to increase base rates by 2.5% would harm Florida residents who are already struggling to pay their bills amid other rising costs. Hearings in the case are set to begin this month. Documents obtained during the rate case discovery period suggest that up to a fifth of FPL customers had their power shut off between March 2024 and February 2025 due to unpaid bills, according to Bradley Marshall, an Earthjustice attorney who is representing Florida Rising, the League of United Latin American Citizens and the Environmental Confederation of Southwest Florida in the upcoming rate case proceedings. While FPL argues that the increase would maintain base rates

Crude Steadies After Volatile Session

Oil closed unchanged after a choppy session as investors assessed whether a prospective deal by the US and Russia to halt the war in Ukraine would receive international support and materially affect Russian crude flows. West Texas Intermediate swung in a roughly $1.80 range before ending the day flat below $64 a barrel, narrowly breaking a six-session losing streak. The US and Russia are aiming to reach a deal that would lock in Russia’s occupation of territory seized during its invasion, according to people familiar with the matter. Washington is working to get buy-in from Ukraine and its European allies on the agreement, which is far from certain. The US and the European Union have targeted Russia’s oil revenues in response to its invasion of Ukraine, with President Donald Trump just this week doubling levies on all Indian imports to 50% as a penalty for the nation taking Russian crude and threatening similar measures against China. Though investors remain skeptical that Europe would support a deal representing a major victory for Russian President Vladimir Putin, the renewed collaboration between Washington and Moscow has lifted expectations that the nation’s crude will continue to flow freely to its two biggest buyers. Still, the market’s focus has shifted to whether US sanctions on Russia — which have crimped Russia’s ability to sell oil and replenish the Kremlin’s war chest in recent months — will remain in place. “A possible truce would be only modestly bearish crude — assuming there is no lifting of EU and US sanctions against Russian energy — since the market does not currently price in much disruption risk,” said Bob McNally, founder of the Rapidan Energy Group and a former White House official. The proposed deal resembles a ceasefire, not a full-fledged peace agreement, he added. At the same

SLB, AIQ Join Forces to Boost ADNOC’s Energy Efficiency with Agentic AI

Schlumberger N.V., the energy tech company doing business as SLB, will team up with AIQ, the Abu Dhabi-based AI specialist for the energy sector. SLB said in a media release that the two companies will collaborate to advance AIQ’s development and deployment of its ENERGYai agentic AI solution across ADNOC’s subsurface operations. Built on 70 years of proprietary data and expertise, ENERGYai integrates large language model (LLM) technology with advanced agentic AI, SLB said. This AI is tailored for specific workflows across ADNOC’s upstream value chain. Initial tests using 15 percent of ADNOC’s data, focusing on two fields, showed a seismic agent that boosted seismic interpretation speed by 10 times and improved accuracy by 70 percent, it said. In partnership, SLB and AIQ will design and deploy new agentic AI workflows across ADNOC’s subsurface operations, including geology, seismic exploration, and reservoir modelling. SLB will provide support with its Lumi data and AI platform, and other digital technologies. A scalable version of ENERGYai is under development, which will include AI agents covering tasks within subsurface operations. Deployment will commence in the fourth quarter of 2025, SLB said. “This partnership reflects our vision to harness AI for energy optimization, and we are enthusiastic that SLB shares this outlook. The collaboration between AIQ and SLB enables the development of sophisticated AI workflows that integrate seamlessly with ADNOC’s infrastructure, driving efficiency, scalability, and innovation at every stage of the energy lifecycle”, Dennis Jol, CEO of AIQ, said. “Our ENERGYai agentic AI solution is pioneering in its sheer scale and impact, and we are proud to involve other significant industry technology players in its development and evolution”. ENERGYai will power agentic AI to automate complex, high-impact tasks, increasing efficiency, enhancing decision-making and optimizing production across ADNOC’s operations, SLB said. The partnership between AIQ and SLB demonstrates a mutual

Diamondback Energy Narrows Production Guidance as Net Income Dips in Q2

Diamondback Energy, Inc. reported a net income of $699 million for the second quarter of 2025, well below the $837 million reported in the corresponding quarter of 2024. However, the first half net income of $2.1 billion surged past the $1.6 billion reported in H1 2024. The company said in its report that production for the quarter averaged 919,000 barrels of oil equivalent per day (boe/d). Oil production averaged 495,700 barrels per day (mbo/d). Diamondback said it put 108 wells into production in the Midland basin, and a further eight wells into production in the Delaware Basin. During the first half of the year, Diamondback said that 224 operated wells entered production in the Midland Basin with 15 more wells entering production in the Delaware Basin. In the second quarter of 2025, Diamondback said it had invested $707 million in operated drilling and completions, $90 million in capital workovers, non-operated drilling, completions, and science, and $67 million in infrastructure, environmental, and midstream projects, totaling $864 million in cash capital expenditures. For the first half of 2025, the company spent $1.6 billion on operated drilling and completions, $111 million on capital workovers, non-operated drilling, completions, and science, and $124 million on infrastructure, environmental, and midstream activities, amounting to a total of $1.8 billion in cash capital expenditures, it said. Diamondback has also narrowed its full-year oil production guidance to 485 – 492 mbo/d and increased annual boe guidance by 2 percent to 890 – 910 Mboe/d, it said. Furthermore, Diamondback noted that the guidance does not reflect the pending acquisition by its publicly traded subsidiary, Viper Energy, Inc., of Sitio Royalties Corp., which is expected to close in the third quarter of 2025, subject to stockholder approval and the fulfillment or waiver of other typical closing conditions. To contact the author,

Stargate’s slow start reveals the real bottlenecks in scaling AI infrastructure

The CFO emphasized that SoftBank remains committed to its original target of $346 billion (JPY 500 billion) over 4 years for the Stargate project, noting that major sites have been selected in the US and preparations are taking place simultaneously across multiple fronts. Requests for comment to Stargate partners Nvidia, OpenAI, and Oracle remain unanswered. Infrastructure reality check for CIOs These challenges offer important lessons for enterprise IT leaders facing similar AI infrastructure decisions. Sanchit Vir Gogia, chief analyst and CEO at Greyhound Research, said that Goto’s confirmation of delays “reflects a challenge CIOs see repeatedly” in partner onboarding delays, service activation slips, and revised delivery commitments from cloud and datacenter providers. Oishi Mazumder, senior analyst at Everest Group, noted that “SoftBank’s Stargate delays show that AI infrastructure is not constrained by compute or capital, but by land, energy, and stakeholder alignment.” The analyst emphasized that CIOs must treat AI infrastructure “as a cross-functional transformation, not an IT upgrade, demanding long-term, ecosystem-wide planning.” “Scaling AI infrastructure depends less on the technical readiness of servers or GPUs and more on the orchestration of distributed stakeholders — utilities, regulators, construction partners, hardware suppliers, and service providers — each with their own cadence and constraints,” Gogia said.

Incentivizing the Digital Future: Inside America’s Race to Attract Data Centers

Across the United States, states are rolling out a wave of new tax incentives aimed squarely at attracting data centers, one of the country’s fastest-growing industries. Once clustered in only a handful of industry-friendly regions, today’s data-center boom is rapidly spreading, pushed along by profound shifts in federal policy, surging demand for artificial intelligence, and the drive toward digital transformation across every sector of the economy. Nowhere is this transformation more visible than in the intensifying state-by-state competition to land massive infrastructure investments, advanced technology jobs, and the alluring prospect of long-term economic growth. The past year alone has seen a record number of states introducing or expanding incentives for data centers, from tax credits to expedited permitting, reflecting a new era of proactive, tech-focused economic development policy. Behind these moves, federal initiatives and funding packages underscore the essential role of digital infrastructure as a national priority, encouraging states to lower barriers for data center construction and operation. As states watch their neighbors reap direct investment and job creation benefits, a real “domino effect” emerges: one state’s success becomes another’s blueprint, heightening the pressure and urgency to compete. Yet, this wave of incentives also exposes deeper questions about the local impact, community costs, and the evolving relationship between public policy and the tech industry. From federal levels to town halls, there are notable shifts in both opportunities and challenges shaping the landscape of digital infrastructure advancement. Industry Drivers: the Federal Push and Growth of AI The past year has witnessed a profound federal policy shift aimed squarely at accelerating U.S. digital infrastructure, especially for data centers in direct response both to the explosive growth of artificial intelligence and to intensifying international competition. In July 2025, the administration unveiled “America’s AI Action Plan,” accompanied by multiple executive orders that collectively redefined

AI Supercharges Hyperscale: Capacity, Geography, and Design Are Being Redrawn

From Cloud to GenAI, Hyperscalers Cement Role as Backbone of Global Infrastructure Data center capacity is undergoing a major shift toward hyperscale operators, which now control 44 percent of global capacity, according to Synergy Research Group. Non-hyperscale colocations account for another 22 percent of capacity and is expected to continue, but hyperscalers projected to hold 61 percent of the capacity by 2030. That swing also reflects the dominance of hyperscalers geographically. In a separate Synergy study revealing the world’s top 20 hyperscale data center locations, just 20 U.S. state or metro markets account for 62 percent of the world’s hyperscale capacity. Northern Virginia and the Greater Beijing areas alone make up 20 percent of the total. They’re followed by the U.S. states of Oregon and Iowa, Dublin, the U.S. state of Ohio, Dallas, and then Shanghai. Of the top 20 markets, 14 are in the U.S., five in APAC region, and only one is in Europe. This rapid shift is fueled by the explosive growth of cloud computing, artificial intelligence (AI), and especially generative AI (GenAI)—power-intensive technologies that demand the scale, efficiency, and specialized infrastructure only hyperscalers can deliver. What’s Coming for Capacity The capacity research shows on-premises data centers with 34 percent of the total capacity, a significant drop from the 56 percent capacity they accounted for just six years ago. Synergy projects that by 2030, hyperscale operators such as Google Cloud, Amazon Web Services, and Microsoft Azure will claim 61 percent of all capacity, while on-premises share will drop to just 22 percent. So, it appears on-premises data centers are both increasing and decreasing. That’s one way to put it, but it’s about perspective. Synergy’s capacity study indicates they’re growing as the volume of enterprise GPU servers increases. The shrinkage refers to share of the market: Hyperscalers are growing

In crowded observability market, Gartner calls out AI capabilities, cost optimization, DevOps integration

Support for OpenTelemetry and open standards is another differentiator for Gartner. Vendors that embrace these frameworks are better positioned to offer extensibility, avoid vendor lock-in, and enable broader ecosystem integration. This openness is paired with a growing focus on cost optimization—an increasingly important concern as telemetry data volumes increase. Leaders offer granular data retention controls, tiered storage, and usage-based pricing models to help customers Gartner also highlights the importance of the developer experience and DevOps integration. Observability leaders provide “integration with other operations, service management, and software development technologies, such as IT service management (ITSM), configuration management databases (CMDB), event and incident response management, orchestration and automation, and DevOps tools.” On the automation front, observability platforms should support initiating changes to application and infrastructure code to optimize cost, capacity or performance—or to take corrective action to mitigate failures, Gartner says. Leaders must also include application security functionality to identify known vulnerabilities and block attempts to exploit them. Gartner identifies observability leaders This year’s report highlights eight vendors in the leaders category, all of which have demonstrated strong product capabilities, solid technology execution, and innovative strategic vision. Read on to learn what Gartner thinks makes these eight vendors (listed in alphabetical order) stand out as leaders in observability: Chronosphere: Strengths include cost optimization capabilities with its control plane that closely manages the ingestion, storage, and retention of incoming telemetry using granular policy controls. The platform requires no agents and relies largely on open protocols such as OpenTelemetry and Prometheus. Gartner cautions that Chronosphere has not emphasized AI capabilities in its observability platform and currently offers digital experience monitoring via partnerships. Datadog: Strengths include extensive capabilities for managing service-level objectives across data types and providing deep visibility into system and application behavior without the need for instrumentation. Gartner notes the vendor’s licensing

LiquidStack CEO Joe Capes on GigaModular, Direct-to-Chip Cooling, and AI’s Thermal Future

In this episode of the Data Center Frontier Show, Editor-in-Chief Matt Vincent speaks with LiquidStack CEO Joe Capes about the company’s breakthrough GigaModular platform — the industry’s first scalable, modular Coolant Distribution Unit (CDU) purpose-built for direct-to-chip liquid cooling. With rack densities accelerating beyond 120 kW and headed toward 600 kW, LiquidStack is targeting the real-world requirements of AI data centers while streamlining complexity and future-proofing thermal design. “AI will keep pushing thermal output to new extremes,” Capes tells DCF. “Data centers need cooling systems that can be easily deployed, managed, and scaled to match heat rejection demands as they rise.” LiquidStack’s new GigaModular CDU, unveiled at the 2025 Datacloud Global Congress in Cannes, delivers up to 10 MW of scalable cooling capacity. It’s designed to support single-phase direct-to-chip liquid cooling — a shift from the company’s earlier two-phase immersion roots — via a skidded modular design with a pay-as-you-grow approach. The platform’s flexibility enables deployments at N, N+1, or N+2 resiliency. “We designed it to be the only CDU our customers will ever need,” Capes says. From Immersion to Direct-to-Chip LiquidStack first built its reputation on two-phase immersion cooling, which Joe Capes describes as “the highest performing, most sustainable cooling technology on Earth.” But with the launch of GigaModular, the company is now expanding into high-density, direct-to-chip cooling, helping hyperscale and colocation providers upgrade their thermal strategies without overhauling entire facilities. “What we’re trying to do with GigaModular is simplify the deployment of liquid cooling at scale — especially for direct-to-chip,” Capes explains. “It’s not just about immersion anymore. The flexibility to support future AI workloads and grow from 2.5 MW to 10 MW of capacity in a modular way is absolutely critical.” GigaModular’s components — including IE5 pump modules, dual BPHx heat exchangers, and intelligent control systems —

Oracle’s Global AI Infrastructure Strategy Takes Shape with Bloom Energy and Digital Realty

Bloom Energy: A Leading Force in On-Site Power As of mid‑2025, Bloom Energy has deployed over 400 MW of capacity at data centers worldwide, working with partners including Equinix, American Electric Power (AEP), and Quanta Computing. In total, Bloom has delivered more than 1.5 GW of power across 1,200+ global installations, a tripling of its customer base in recent years. Several key partnerships have driven this rapid adoption. A decade-long collaboration with Equinix, for instance, began with a 1 MW pilot in 2015 and has since expanded to more than 100 MW deployed across 19 IBX data centers in six U.S. states, providing supplemental power at scale. Even public utilities are leaning in: in late 2024, AEP signed a deal to procure up to 1 GW of Bloom’s solid oxide fuel cell (SOFC) systems for fast-track deployments aimed at large data centers and commercial users facing grid connection delays. More recently, on July 24, 2025, Bloom and Oracle Cloud Infrastructure (OCI) announced a strategic partnership to deploy SOFC systems at select U.S. Oracle data centers. The deployments are designed to support OCI’s gigawatt-scale AI infrastructure, delivering clean, uninterrupted electricity for high-density compute workloads. Bloom has committed to providing sufficient on-site power to fully support an entire data center within 90 days of contract signing. With scalable, modular, and low-emissions energy solutions, Bloom Energy has emerged as a key enabler of next-generation data center growth. Through its strategic partnerships with Oracle, Equinix, and AEP, and backed by a rapidly expanding global footprint, Bloom is well-positioned to meet the escalating demand for multi-gigawatt on-site generation as the AI era accelerates. Oracle and Digital Realty: Accelerating the AI Stack Oracle, which continues to trail hyperscale cloud providers like Google, AWS, and Microsoft in overall market share, is clearly betting big on AI to drive its next phase of infrastructure growth.

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle