Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation

Stay Ahead, Stay ONMINE

Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation

Introduction Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency. This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion. We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks. We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data. While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems. In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer. Building the pyramid: How Agentic Knowledge Distillation works Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid. Overview Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request. How to distill documents and build the pyramid: Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages. Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document. Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document. Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total. Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document. Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4) We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid. This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information. Using the pyramid: How it works with RAG & Agents At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response. The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language. Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used. Here is a high-level visualization of our Agentic approach to responding to user requests: Image created by author and team providing an overview of the agentic research & response process Results from the pyramid: Real-world examples To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks. Fact-finding (spear fishing): These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly. Example task: “What was IBM’s total revenue in the latest financial reporting?” Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4] Total tokens used to research and generate response This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response. Complex research and analysis: These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts. Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.” Example response: Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA. The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity. Snippet indicating total token usage for the task Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.” Example response: Part 1 of response generated by the agent on disclosed risks. Part 2 of response generated by the agent on disclosed risks. Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report. Snippet indicating total token usage for the task These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks. Benefits of the pyramid: Why use it? Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions. Some of the key benefits we observed include: Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time. Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables. Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents. Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks. Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use. Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information. Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way. Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections. Beyond the pyramid: Evaluation challenges & future directions Challenges While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid. Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains. The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process. Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes. Future directions In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work. When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication. Conclusion: Key takeaways and why the pyramid approach matters The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use

Introduction

Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency.

This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion.

We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks.

We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data.

While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems.

In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer.

Building the pyramid: How Agentic Knowledge Distillation works

AI-generated image showing a pyramid structure for document ingestion with labelled sections. — Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid.

Overview

Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request.

How to distill documents and build the pyramid:

Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages.
Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document.
Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document.
Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total.
Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document.

Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4)

We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid.

This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information.

Using the pyramid: How it works with RAG & Agents

At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response.

The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language.

Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used.

Here is a high-level visualization of our Agentic approach to responding to user requests:

Overview of the agentic research & response process — Image created by author and team providing an overview of the agentic research & response process

Results from the pyramid: Real-world examples

To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks.

Fact-finding (spear fishing):

These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly.

Example task: “What was IBM’s total revenue in the latest financial reporting?”

Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4]

Screenshot of total tokens used to research and generate response — Total tokens used to research and generate response

This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response.

Complex research and analysis:

These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts.

Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.”

Example response:

Screenshot of the response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA. — Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.

The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity.

Screenshot of snippet indicating total token usage for the task — Snippet indicating total token usage for the task

Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.”

Example response:

Screenshot of part 1 of a response generated by the agent on disclosed risks. — Part 1 of response generated by the agent on disclosed risks.

Screenshot of part 2 of a response generated by the agent on disclosed risks. — Part 2 of response generated by the agent on disclosed risks.

Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report.

Screenshot of a snippet indicating total token usage for the task — Snippet indicating total token usage for the task

These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks.

Benefits of the pyramid: Why use it?

Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions.

Some of the key benefits we observed include:

Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time.
Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables.
Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents.
Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks.
Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use.
Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information.
Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way.
Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections.

Beyond the pyramid: Evaluation challenges & future directions

Challenges

While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid.

Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains.

The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process.

Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes.

Future directions

In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work.

When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication.

Conclusion: Key takeaways and why the pyramid approach matters

The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use (this includes all the search iterations!).

We find that the LLM is much better at writing atomic insights as sentences and that these insights effectively distill information from both text-based and tabular data. This distilled information written in natural language is very easy for the LLM to understand and navigate at inference since it does not have to expend unnecessary energy reasoning about and breaking down document formatting or filtering through noise.

The ability to retrieve and aggregate information at any level of the pyramid also provides significant flexibility to address a variety of query types. This approach offers promising performance for large datasets and enables high-value use cases that require nuanced information retrieval and analysis.

Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Ovintiv raises 2026 guidance on productivity gains

Speaking to analysts and investors on July 24 after Ovintiv reported its second-quarter results, McCracken and his team said the efficiency gains stem from a cocktail of innovations around well designs, development patterns, and the usage of proppants and surfactants, among other things. “It starts with the culture, that relentless

Cisco, AMD bring enterprise-level security, visibility to Ryzen AI Halo systems

“Running more AI locally can help improve responsiveness, keep sensitive data closer to users, and reduce dependence on cloud-only approaches, but enterprises also need a way to monitor and manage these systems at scale. AMD and Cisco are addressing that gap by collaborating to pair high-performance local AI compute with

Cloudflare Internal DNS puts public and private DNS on one policy engine

“Instead of operating two separate DNS systems, customers use one API, one audit trail, one dashboard, and one policy engine for every DNS query—whether it is for a public website or an internal application,” Somoza said. Policy first. The resolver sits ahead of every lookup, not behind it. “Architecturally, Cloudflare Gateway

Atos launches sovereign cloud service to power comeback

Atos has launched a new sovereign cloud platform aimed squarely at European public sector bodies, healthcare providers and defense organizations. It’s the latest effort by European companies in their fight back against US dominance. Atos Sovereign Cloud offers a range of controls for data management, providing customers with resilience and

Magnolia expands Giddings position with $4-billion WildFire Energy acquisition

In the filing, Magnolia said WildFire’s second-quarter 2025 production is expected to average 53,000 boe/d, about 70% oil, primarily from the Eagle Ford, Austin Chalk, and Woodbine formations. Magnolia said the acquisition would strengthen its position in the Eagle Ford/Austin Chalk trend by expanding its inventory of high-return drilling locations, adding development flexibility and longer laterals, and leveraging its technical expertise to improve well performance and lower costs. “WildFire has a large, low-decline oily PDP base with historic development centered on the Eagle Ford. While there are significant future Eagle Ford development opportunities, our technical teams see extensive future potential in the Austin Chalk with further upside in the Woodbine as well as other appraisal opportunities that should expand on our success in Giddings since 2018,” said Chris Stavros, Magnolia’s chairman, president, and chief executive officer. The deal is expected to result in a pro forma position in Giddings of more than 1.25 million net acres, add more than 500 miles of gas-gathering pipelines, and offer various cost savings, the company said. “Magnolia is guiding to $100 million in run rate synergies by the end of 2027, with savings coming from the chance to deploy long laterals, shared facilities and infrastructure and additional sand sourcing for operations from WildFire’s in-basin mine. As always, successful execution will be key for the longer-term success of the deal,” Enverus’ Dittmar said. Total consideration consists of $2.65 billion in cash, 32.2 million shares of Magnolia Class A common stock, and the assumption of $600 million of outstanding debt.

Vår Energi inks deal to acquire BlueNord

Vår Energi ASA has agreed to buy BlueNord ASA as part of a proposed merger that, if completed, will expand Vår Energi’s presence beyond the Norwegian Continental Shelf (NCS), positioning the operator as Europe’s largest independent oil and gas producer. Acqusition of BlueNord would add producing assets on the Danish Continental Shelf (DCS) to Vår Energi’s current holdings, with the combined post-merger portfolio anticipated to lift long-term production to about 450,000 boe/d, with about 2.4 billion boe of reserves and resources and an estimated reserve and resource life of about 15 years. BlueNord’s portfolio includes interests in the Tyra, Halfdan, Dan, and Gorm hub areas, which are part of the Danish Underground Consortium operated by TotalEnergies SE. The assets are expected to contribute about 45,000 boe/d of net production beginning in 2026 and include about 195 million boe of net 2P reserves and 2C contingent resources, extending production beyond 2040. “The transaction marks a significant milestone in Vår Energi’s growth journey, creating the largest independent producer of oil and gas in Europe with a long-term production target of [about 450,000 b/d] and reinforcing our role as a reliable and secure supplier of energy to Europe,” said Nick Walker, Vår Energi’s chief executive officer. Vår Energi said the DCS assets complement its existing North Sea operations because of their geological, operational, and fiscal similarities to the NCS. The combination also expands the company’s exposure to European natural gas markets through access to the Nybro and Den Helder gas delivery points. The combined portfolio would maintain a production mix of about 65% oil and 35% natural gas, with operating costs projected to remain at $10-11/boe. The proposed merger remains subject to approval by BlueNord shareholders, regulatory and governmental approvals, license and partner consent, and other customary conditions. If approved, the companies said

Bahrain’s GPIC enlists Fluor for new unit at Sitra complex

Gulf Petrochemical Industries Co. (GPIC) has awarded Fluor Corp. a contract to execute front-end engineering and design (FEED) for a proposed aromatics plant to be built at GPIC’s petrochemicals complex located across 60 hectares of reclaimed land in Sitra, Bahrain. As part of the contract, Fluor will deliver a FEED study based on commercially proven process technologies for the plant’s targeted production of 1.2 million tonnes/year (tpy) of paraxylene and 500,000 tpy of benzene, the service provider said on July 21. Critical building blocks for plastics, polyester fibers, and packaging materials, paraxylene and benzene production from the plant would help meet global demand for high‑performance consumer and industrial products, as well as expand capabilities of GPIC’s current operations at Sitra, Fluor said. GPIC’s existing complex currently uses a feedstock of natural gas domestically produced in Bahrain to produce about 1.2 million tonnes/day of ammonia, 1.2 million tonnes/day of methanol, and 1.7 million tonnes/day of urea. Neither Fluor nor GPIC revealed details regarding a timeline for completion of the proposed aromatics plant. GPIC is a joint venture of Bahrain Petroleum Co. (33.3%), SABIC Agri-Nutrients Investment Co. (33.3%), and Kuwait’s Petrochemical Industries Co. (PIC; 33.3%).

Oil prices surge as Hormuz, Bab el-Mandeb risks escalate amid renewed US–Iran tensions

Oil prices jumped on Wednesday, July 22, with escalating geopolitical tensions and mounting risks to key maritime chokepoints driving the rally. International Brent crude rose nearly 5% to above $95/bbl, its highest level in almost 6 weeks, while US crude climbed more than 4% to above $88/bbl. The gains extend a strong upward trend, with prices up about 30% since the start of the month and more than 55% year to date, reversing declines seen after a mid-June memorandum of understanding (MOU) between the US and Iran. Stay updated on oil price volatility, shipping disruptions, LNG market analysis, and production output through OGJ’s Iran war content hub. The earlier agreement, aimed at de-escalating conflict and reopening the Strait of Hormuz, was declared “over” on July 8 by President Donald Trump. Since then, hostilities have intensified, with US forces carrying out an 11th consecutive night of strikes on Iran. Comments from US Secretary of State Marco Rubio further dampened expectations for near-term diplomacy, noting that while Washington remains open to talks, Iran does not appear to be engaging seriously. At the same time, security risks to global shipping have increased. The UK Maritime Trade Operations (UKMTO) has reported multiple recent attacks on vessels in the region, including incidents that forced crews to abandon ships. As a result, traffic through the Strait of Hormuz has fallen sharply, with just 13 vessels transiting Monday and 9 on Tuesday, according to MarineTraffic data. Concerns are also growing at the Bab el-Mandeb Strait, another critical oil transit route linking the Red Sea to the Gulf of Aden. Iranian-backed Houthi forces in Yemen have threatened a maritime blockade targeting Saudi Arabia, raising fears of broader supply disruptions. While vessel traffic through Bab el-Mandeb remains relatively steady—73 ships transited Tuesday—it has edged lower and signs of hesitation among

Global LNG trade hits record in 2025 as 2026 tests market resilience

Global LNG trade reached a record 437 million tonnes in 2025, up 6.3% year on year (y-o-y) and marking the fastest growth since 2022, according to the International Gas Union’s (IGU) World LNG Report 2026. The increase of roughly 25 million tonnes was driven primarily by rising US supply, alongside higher exports from Qatar, Malaysia, Angola, and Nigeria. Canada and the Mauritania–Senegal project also shipped their first LNG cargoes, expanding the pool of exporting countries. Investment kept pace with market growth. Developers sanctioned 68.4 million tonnes/year (tpy) of new liquefaction capacity in 2025—the highest annual total since 2019—bringing approvals over the 2021–25 period to about 206 million tpy, roughly double the volume sanctioned in the previous 5-year cycle. Much of the new capacity was concentrated in US Gulf Coast projects. The outlook for 2026, however, is more uncertain. The Middle East conflict has knocked Qatar and the UAE—together about 16% of global liquefaction capacity—off the market for periods this year, and missile strikes on Qatar’s Ras Laffan complex are expected to keep roughly 12.8 million tpy of capacity offline for 3-5 years. Shell PLC’s separately published LNG Outlook 2026 is blunter about the near-term picture: Depending on how quickly the Strait of Hormuz reopens, 2026 could see global LNG trade contract year-on-year—something that’s never happened before in the past decade of rapid growth Shell has tracked. The Asia Pacific has absorbed most of the supply shock so far, responding through storage draws, fuel switching, demand curtailment and increased spot buying, while a wave of US cargoes has been rerouted from Europe toward Asia to fill the gap. Despite near-term volatility, both reports highlight a strong long-term trajectory. IGU expects global LNG supply capacity, including existing and under-construction projects, to exceed 700 million tonnes by 2030, a roughly 40% increase from

INA discovers gas in northern Adriatic

Croatia’s Industrija Nafte DD (INA) has discovered more gas as part of a five-well drilling campaign in the northern Adriatic Sea offshore Croatia. The first well in the campaign, Ana-4 DIR in the existing North Adriatic field, has been completed after reaching a total depth of 1,282 m. Drilling operations were carried out by the Labin drilling rig, operated by the crew of INA’s service company CROSCO. Initial testing across three reservoirs delivered a total gas flow rate of about 160,000 cu m/d. Preparations are under way to tie the well into the surface production system to carry out an extended well test aimed at reservoir clean-up, detailed characterization of production potential, and collection of key data to support the preparation of the reserves report. The successful completion of the first well confirmed the remaining gas potential of offshore fields that INA will continue to develop in the coming years, the company said. The Labin rig is now being moved to the location of IKA JZ-6 DIR, the next well in the campaign. Construction investment in the five wells is expected to total about EUR 65 million.

When Buildability Breaks: What Prince William and New York Signal for Data Center Development

For several years, the Prince William Digital Gateway represented data center ambition at its largest scale: a proposed 2,100-acre technology corridor near Gainesville, Virginia, capable of accommodating tens of millions of square feet of digital infrastructure. Its location also made it uniquely contentious. The corridor bordered Manassas National Battlefield Park and other historic, environmental and residential resources, drawing the data center development debate beyond its usual industry and land-use constituencies. Opposition increasingly centered not only on the project’s scale, but on whether development of that magnitude belonged alongside one of the country’s most significant Civil War landscapes. In July 2026, that vision effectively ended. QTS Data Centers terminated its participation in the Digital Gateway and withdrew its remaining petitions before the Supreme Court of Virginia. The decision followed Compass Datacenters’ withdrawal in April, leaving neither of the project’s original developers pursuing the corridor. QTS said it reached the decision after “careful consideration,” while emphasizing that Virginia remains an important market for the company. From Proposed Capacity to Executable Capacity The collapse of the Digital Gateway is more than the cancellation of one unusually large development. It comes as the data center industry confronts a widening gap between announced capacity and executable capacity. Power remains the most visible constraint. But permitting discipline, environmental review, community acceptance and the durability of political support are increasingly determining whether a project can progress from land control and conceptual capacity to construction and operation. A separate development in New York underscored that shift less than two weeks after QTS withdrew. On July 14, Gov. Kathy Hochul issued Executive Order 62, establishing what the state describes as the nation’s first statewide moratorium on new hyperscale data centers. The order temporarily holds in abeyance certain incomplete state environmental permit applications for data centers capable of drawing at

Q&A: Google’s AI and computing chief talks about its shapeshifting data centers

Mark Lohmeyer: We’ve seen the rise of agents and agentic use cases. Years ago, it was the chat phase: Ask a question, get an answer. Now we’re in the agentic era, where you express your intent, agents spin off multiple sub-agents, working in parallel, preserving state. This is a radical shift in what infrastructure needs to do; make them fast, cost effective, secure, reliable. We’re delivering infrastructure optimized for the age of agents. NW: What’s the goal of the infrastructure buildout, and what should customers expect regarding costs? ML: Ultimately, it’s about enabling customers with leading-edge capabilities and models at scale cost-effectively. With agents, inference transactions increase by 50x, 100x versus non-agentic workloads. We’re driving the cost per transaction down exponentially. In our latest platforms, we reduce the cost by almost 2x for the same work. Customers serve twice the number of users at the same cost, directly driving profitability.

Google transforms its data center architecture for agent era

Google adjusted the Google Kubernetes Engine into an agent-native environment, where agents could be quickly spun up in sandboxes and containers. “From an infrastructure perspective, you need to spin up a bunch of TPUs or GPUs very rapidly. Then you need to be able to run them and spin them back down,” Lohmeyer said. Google also made drastic improvements to its silicon to support its middleware changes. It recently introduced new AI chips, with the TPU-8t for training, and TPU-8i for inference. The 8t chip has three times more computing power than the previous-generation Ironwood chip. The 8i chip has 384 megabytes of SRAM and 288GB of HBM3e memory, which is 50% more than the previous-generation chip. The platform is optimized for KV cache (key-value cache), which stores important contextual information needed by agents to make decisions, which reduces the round trips to other memory and storage systems. “Being able to store more of the KV cache directly on the chip allows you to respond much more rapidly and cost-effectively,” Lohmeyer said.

10 Reasons You Cannot Afford to Miss DCF Trends Summit 2026

The data center industry has no shortage of AI infrastructure ambition. What it lacks is certainty. Power is harder to secure. Designs are advancing faster than facilities can be built. Supply chains remain vulnerable. Liquid cooling is adding operational demands. Projects that look viable on paper can still stall on permitting, commissioning or community opposition. The question in 2026 is no longer how large the AI opportunity may become. It is what can actually be delivered, and who has learned how to deliver it. That question defines the 2026 Data Center Frontier Trends Summit, August 4–6 at the Hyatt Regency Reston. Across three days, the people building, powering, financing and operating next-generation infrastructure will examine what is working, where execution is failing and how the market is responding. This is not a conference about whether AI will create demand. It is about who will be able to meet it. The advantage will belong to those who join the conversation before its conclusions become market consensus. Here are 10 reasons to be in the room. 1. The industry has entered the execution era For several years, the market has been defined by projected demand, capacity, density and investment. The next phase will be defined by execution. AI data center announcements remain abundant. Energized, commissioned and operational capacity is harder to find. DCFTS begins with a live editorial calibration, followed by “The New Geography of AI,” featuring EdgeCore CEO Lee Kestler, Data Center Frontier founder Rich Miller and DCF Editor in Chief Matt Vincent. The focus: how power, entitled land, utility partnerships and execution speed are determining where AI capacity can be built—and who can deliver it. Demand creates opportunity. Execution determines who captures it. 2. Power will be treated as the foundation of AI strategy Power is no longer one workstream

Time to Power: Sage Geosystems CEO Cindy Taff on Geothermal’s AI Infrastructure Moment

Three years ago, the data center industry’s energy conversation was largely framed around emissions. Hyperscale operators were setting carbon-free energy targets, signing renewable power agreements, and aligning their expanding infrastructure portfolios with corporate sustainability commitments. The arrival of generative AI has not eliminated those priorities. But it has reordered them. “Three years ago, data center energy, they were really focused on low emissions, no emissions,” said Cindy Taff, CEO of Sage Geosystems. “Now the primary challenge is just enough energy.” Speaking on the Data Center Frontier Show podcast, Taff described an energy market being reshaped by the speed and physical scale of AI infrastructure development. After decades of relatively flat U.S. electricity demand, AI has introduced a new class of concentrated, rapidly arriving industrial load. The result is a shift away from thinking only about how much generating capacity exists in aggregate and toward a harder question: Can usable power be delivered at a specific site, on a predictable schedule, in the quantities an AI campus requires? For hyperscalers, neocloud providers, data center developers, utilities, and energy companies, that distinction is becoming central to project execution. “I think time to power is the most precious metric right now versus cost or total capacity,” Taff said. Capacity on Paper Is Not Power at the Site Announcements of new generation can create the appearance of an energy system capable of meeting rising data center demand. But a megawatt located far from a planned campus, trapped behind a transmission constraint, or unavailable until the next decade has limited value to a developer trying to energize an AI facility within several years. “Aggregate capacity is not going to solve the problem if the power really isn’t where and when you need it,” Taff said. Data centers are large physical facilities tied to specific parcels,

Tech Explainer: Data Center Cooling – Air, Evaporative, Liquid, and Hybrid Approaches

Data Center Cooling Glossary The following definitions reflect common terminology used in Department of Energy guidance, ASHRAE TC 9.9 materials, Berkeley Lab resources and Green Grid efficiency metrics. Adiabatic Cooling — A cooling process that uses water evaporation to lower the temperature of air before it reaches a heat exchanger or cooling coil. It can reduce compressor demand but consumes water when evaporative assistance is active. Air-Cooled Data Center — A facility in which heat is removed from IT equipment primarily by moving conditioned air through servers, even if that heat is later transferred to water or refrigerant elsewhere in the cooling system. Air Handler — Equipment that moves, filters and conditions air before delivering it to a data hall or other controlled space. Air-Side Economizer — A system that uses suitable outdoor air, either directly or mixed with return air, to reduce or avoid compressor-based refrigeration. Airflow Management — The practice of delivering conditioned air where it is needed while preventing hot exhaust air from recirculating into server inlets. Approach Temperature — The temperature difference between the two fluids leaving a heat exchanger at their closest thermal point. In a cooling tower, it commonly refers to the difference between leaving-water temperature and entering-air wet-bulb temperature. A smaller approach generally indicates more effective heat transfer. ASHRAE TC 9.9 — The ASHRAE technical committee focused on mission-critical facilities, data centers, technology spaces and electronic equipment. It is a major source of environmental and thermal guidance for data center operators and equipment manufacturers. Blanking Panel — A panel installed in unused rack spaces to prevent hot exhaust air from recirculating to server intakes. British Thermal Unit, or BTU — A unit of heat energy commonly used to express the heating or cooling capacity of equipment. Cabinet — An enclosure, also commonly called

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle