No More Tableau Downtime: Metadata API for Proactive Data Health

Stay Ahead, Stay ONMINE

No More Tableau Downtime: Metadata API for Proactive Data Health

In today’s world, the reliability of data solutions is everything. When we build dashboards and reports, one expects that the numbers reflected there are correct and up-to-date. Based on these numbers, insights are drawn and actions are taken. For any unforeseen reason, if the dashboards are broken or if the numbers are incorrect — then it becomes a fire-fight to fix everything. If the issues are not fixed in time, then it damages the trust placed on the data team and their solutions. But why would dashboards be broken or have wrong numbers? If the dashboard was built correctly the first time, then 99% of the time the issue comes from the data that feeds the dashboards — from the data warehouse. Some possible scenarios are: Few ETL pipelines failed, so the new data is not yet in A table is replaced with another new one Some columns in the table are dropped or renamed Schemas in data warehouse have changed And many more. There is still a chance that the issue is on the Tableau site, but in my experience, most of the times, it is always due to some changes in data warehouse. Even though we know the root cause, it’s not always straightforward to start working on a fix. There is no central place where you can check which Tableau data sources rely on specific tables. If you have the Tableau Data Management add-on, it could help, but from what I know, its hard to find dependencies of custom sql queries used in data sources. Nevertheless, the add-on is too expensive and most companies don’t have it. The real pain begins when you have to go through all the data sources manually to start fixing it. On top of it, you have a string of users on your head impatiently waiting for a quick-fix. The fix itself might not be difficult, it would just be a time-consuming one. What if we could anticipate these issues and identify impacted data sources before anyone notices a problem? Wouldn’t that just be great? Well, there is a way now with the Tableau Metadata API. The Metadata API uses GraphQL, a query language for APIs that returns only the data that you’re interested in. For more info on what’s possible with GraphQL, do check out GraphQL.org. In this blog post, I’ll show you how to connect to the Tableau Metadata API using Python’s Tableau Server Client (TSC) library to proactively identify data sources using specific tables, so that you can act fast before any issues arise. Once you know which Tableau data sources are affected by a specific table, you can make some updates yourself or alert the owners of those data sources about the upcoming changes so they can be prepared for it. Connecting to the Tableau Metadata API Lets connect to the Tableau Server using TSC. We need to import in all the libraries we would need for the exercise! ### Import all required libraries import tableauserverclient as t import pandas as pd import json import ast import re In order to connect to the Metadata API, you will have to first create a personal access token in your Tableau Account settings. Then update the & with the token you just created. Also update with your Tableau site. If the connection is established successfully, then “Connected” will be printed in the output window. ### Connect to Tableau server using personal access token tableau_auth = t.PersonalAccessTokenAuth(“”, “”, site_id=””) server = t.Server(“https://dub01.online.tableau.com/”, use_server_version=True) with server.auth.sign_in(tableau_auth): print(“Connected”) Lets now get a list of all data sources that are published on your site. There are many attributes you can fetch, but for the current use case, lets keep it simple and only get the id, name and owner contact information for every data source. This will be our master list to which we will add in all other information. ############### Get all the list of data sources on your Site all_datasources_query = “”” { publishedDatasources { name id owner { name email } } }””” with server.auth.sign_in(tableau_auth): result = server.metadata.query( all_datasources_query ) Since I want this blog to be focussed on how to proactively identify which data sources are affected by a specific table, I’ll not be going into the nuances of Metadata API. To better understand how the query works, you can refer to a very detailed Tableau’s own Metadata API documentation. One thing to note is that the Metadata API returns data in a JSON format. Depending on what you are querying, you’ll end up with multiple nested json lists and it can get very tricky to convert this into a pandas dataframe. For the above metadata query, you will end up with a result which would like below (this is mock data just to give you an idea of what the output looks like): { “data”: { “publishedDatasources”: [ { “name”: “Sales Performance DataSource”, “id”: “f3b1a2c4-1234-5678-9abc-1234567890ab”, “owner”: { “name”: “Alice Johnson”, “email”: “[email protected]” } }, { “name”: “Customer Orders DataSource”, “id”: “a4d2b3c5-2345-6789-abcd-2345678901bc”, “owner”: { “name”: “Bob Smith”, “email”: “[email protected]” } }, { “name”: “Product Returns and Profitability”, “id”: “c5e3d4f6-3456-789a-bcde-3456789012cd”, “owner”: { “name”: “Alice Johnson”, “email”: “[email protected]” } }, { “name”: “Customer Segmentation Analysis”, “id”: “d6f4e5a7-4567-89ab-cdef-4567890123de”, “owner”: { “name”: “Charlie Lee”, “email”: “[email protected]” } }, { “name”: “Regional Sales Trends (Custom SQL)”, “id”: “e7a5f6b8-5678-9abc-def0-5678901234ef”, “owner”: { “name”: “Bob Smith”, “email”: “[email protected]” } } ] } } We need to convert this JSON response into a dataframe so that its easy to work with. Notice that we need to extract the name and email of the owner from inside the owner object. ### We need to convert the response into dataframe for easy data manipulation col_names = result[‘data’][‘publishedDatasources’][0].keys() master_df = pd.DataFrame(columns=col_names) for i in result[‘data’][‘publishedDatasources’]: tmp_dt = {k:v for k,v in i.items()} master_df = pd.concat([master_df, pd.DataFrame.from_dict(tmp_dt, orient=’index’).T]) # Extract the owner name and email from the owner object master_df[‘owner_name’] = master_df[‘owner’].apply(lambda x: x.get(‘name’) if isinstance(x, dict) else None) master_df[‘owner_email’] = master_df[‘owner’].apply(lambda x: x.get(’email’) if isinstance(x, dict) else None) master_df.reset_index(inplace=True) master_df.drop([‘index’,’owner’], axis=1, inplace=True) print(‘There are ‘, master_df.shape[0] , ‘ datasources in your site’) This is how the structure of master_df would look like: Sample output of code Once we have the main list ready, we can go ahead and start getting the names of the tables embedded in the data sources. If you are an avid Tableau user, you know that there are two ways to selecting tables in a Tableau data source — one is to directly choose the tables and establish a relation between them and the other is to use a custom sql query with one or more tables to achieve a new resultant table. Therefore, we need to address both the cases. Processing of Custom SQL query tables Below is the query to get the list of all custom SQLs used in the site along with their data sources. Notice that I have filtered the list to get only first 500 custom sql queries. In case there are more in your org, you will have to use an offset to get the next set of custom sql queries. There is also an option of using cursor method in Pagination when you want to fetch large list of results (refer here). For the sake of simplicity, I just use the offset method as I know, as there are less than 500 custom sql queries used on the site. # Get the data sources and the table names from all the custom sql queries used on your Site custom_table_query = “”” { customSQLTablesConnection(first: 500){ nodes { id name downstreamDatasources { name } query } } } “”” with server.auth.sign_in(tableau_auth): custom_table_query_result = server.metadata.query( custom_table_query ) Based on our mock data, this is how our output would look like: { “data”: { “customSQLTablesConnection”: { “nodes”: [ { “id”: “csql-1234”, “name”: “RegionalSales_CustomSQL”, “downstreamDatasources”: [ { “name”: “Regional Sales Trends (Custom SQL)” } ], “query”: “SELECT r.region_name, SUM(s.sales_amount) AS total_sales FROM ecommerce.sales_data.Sales s JOIN ecommerce.sales_data.Regions r ON s.region_id = r.region_id GROUP BY r.region_name” }, { “id”: “csql-5678”, “name”: “ProfitabilityAnalysis_CustomSQL”, “downstreamDatasources”: [ { “name”: “Product Returns and Profitability” } ], “query”: “SELECT p.product_category, SUM(s.profit) AS total_profit FROM ecommerce.sales_data.Sales s JOIN ecommerce.sales_data.Products p ON s.product_id = p.product_id GROUP BY p.product_category” }, { “id”: “csql-9101”, “name”: “CustomerSegmentation_CustomSQL”, “downstreamDatasources”: [ { “name”: “Customer Segmentation Analysis” } ], “query”: “SELECT c.customer_id, c.location, COUNT(o.order_id) AS total_orders FROM ecommerce.sales_data.Customers c JOIN ecommerce.sales_data.Orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.location” }, { “id”: “csql-3141”, “name”: “CustomerOrders_CustomSQL”, “downstreamDatasources”: [ { “name”: “Customer Orders DataSource” } ], “query”: “SELECT o.order_id, o.customer_id, o.order_date, o.sales_amount FROM ecommerce.sales_data.Orders o WHERE o.order_status = ‘Completed'” }, { “id”: “csql-3142”, “name”: “CustomerProfiles_CustomSQL”, “downstreamDatasources”: [ { “name”: “Customer Orders DataSource” } ], “query”: “SELECT c.customer_id, c.customer_name, c.segment, c.location FROM ecommerce.sales_data.Customers c WHERE c.active_flag = 1” }, { “id”: “csql-3143”, “name”: “CustomerReturns_CustomSQL”, “downstreamDatasources”: [ { “name”: “Customer Orders DataSource” } ], “query”: “SELECT r.return_id, r.order_id, r.return_reason FROM ecommerce.sales_data.Returns r” } ] } } } Just like before when we were creating the master list of data sources, here also we have nested json for the downstream data sources where we would need to extract only the “name” part of it. In the “query” column, the entire custom sql is dumped. If we use regex pattern, we can easily search for the names of the table used in the query. We know that the table names always come after FROM or a JOIN clause and they generally follow the format … The is optional and most of the times not used. There were some queries I found which used this format and I ended up only getting the database and schema names, and not the complete table name. Once we have extracted the names of the data sources and the names of the tables, we need to merge the rows per data source as there can be multiple custom sql queries used in a single data source. ### Convert the custom sql response into dataframe col_names = custom_table_query_result[‘data’][‘customSQLTablesConnection’][‘nodes’][0].keys() cs_df = pd.DataFrame(columns=col_names) for i in custom_table_query_result[‘data’][‘customSQLTablesConnection’][‘nodes’]: tmp_dt = {k:v for k,v in i.items()} cs_df = pd.concat([cs_df, pd.DataFrame.from_dict(tmp_dt, orient=’index’).T]) # Extract the data source name where the custom sql query was used cs_df[‘data_source’] = cs_df.downstreamDatasources.apply(lambda x: x[0][‘name’] if x and ‘name’ in x[0] else None) cs_df.reset_index(inplace=True) cs_df.drop([‘index’,’downstreamDatasources’], axis=1,inplace=True) ### We need to extract the table names from the sql query. We know the table name comes after FROM or JOIN clause # Note that the name of table can be of the format .. # Depending on the format of how table is called, you will have to modify the regex expression def extract_tables(sql): # Regex to match database.schema.table or schema.table, avoid alias pattern = r'(?:FROM|JOIN)s+((?:[w+]|w+).(?:[w+]|w+)(?:.(?:[w+]|w+))?)b’ matches = re.findall(pattern, sql, re.IGNORECASE) return list(set(matches)) # Unique table names cs_df[‘customSQLTables’] = cs_df[‘query’].apply(extract_tables) cs_df = cs_df[[‘data_source’,’customSQLTables’]] # We need to merge datasources as there can be multiple custom sqls used in the same data source cs_df = cs_df.groupby(‘data_source’, as_index=False).agg({ ‘customSQLTables’: lambda x: list(set(item for sublist in x for item in sublist)) # Flatten & make unique }) print(‘There are ‘, cs_df.shape[0], ‘datasources with custom sqls used in it’) After we perform all the above operations, this is how the structure of cs_df would look like: Sample output of code Processing of regular Tables in Data Sources Now we need to get the list of all the regular tables used in a datasource which are not a part of custom SQL. There are two ways to go about it. Either use the publishedDatasources object and check for upstreamTables or use DatabaseTable and check for upstreamDatasources. I’ll go by the first method because I want the results at a data source level (basically, I want some code ready to reuse when I want to check a specific data source in further detail). Here again, for the sake of simplicity, instead of going for pagination, I’m looping through each datasource to ensure I have everything. We get the upstreamTables inside of the field object so that has to be cleaned out. ############### Get the data sources with the regular table names used in your site ### Its best to extract the tables information for every data source and then merge the results. # Since we only get the table information nested under fields, in case there are hundreds of fields # used in a single data source, we will hit the response limits and will not be able to retrieve all the data. data_source_list = master_df.name.tolist() col_names = [‘name’, ‘id’, ‘extractLastUpdateTime’, ‘fields’] ds_df = pd.DataFrame(columns=col_names) with server.auth.sign_in(tableau_auth): for ds_name in data_source_list: query = “”” { publishedDatasources (filter: { name: “”””+ ds_name + “””” }) { name id extractLastUpdateTime fields { name upstreamTables { name } } } } “”” ds_name_result = server.metadata.query( query ) for i in ds_name_result[‘data’][‘publishedDatasources’]: tmp_dt = {k:v for k,v in i.items() if k != ‘fields’} tmp_dt[‘fields’] = json.dumps(i[‘fields’]) ds_df = pd.concat([ds_df, pd.DataFrame.from_dict(tmp_dt, orient=’index’).T]) ds_df.reset_index(inplace=True) This is how the structure of ds_df would look: Sample output of code We can need to flatten out the fields object and extract the field names as well as the table names. Since the table names will be repeating multiple times, we would have to deduplicate to keep only the unique ones. # Function to extract the values of fields and upstream tables in json lists def extract_values(json_list, key): values = [] for item in json_list: values.append(item[key]) return values ds_df[“fields”] = ds_df[“fields”].apply(ast.literal_eval) ds_df[‘field_names’] = ds_df.apply(lambda x: extract_values(x[‘fields’],’name’), axis=1) ds_df[‘upstreamTables’] = ds_df.apply(lambda x: extract_values(x[‘fields’],’upstreamTables’), axis=1) # Function to extract the unique table names def extract_upstreamTable_values(table_list): values = set()a for inner_list in table_list: for item in inner_list: if ‘name’ in item: values.add(item[‘name’]) return list(values) ds_df[‘upstreamTables’] = ds_df.apply(lambda x: extract_upstreamTable_values(x[‘upstreamTables’]), axis=1) ds_df.drop([“index”,”fields”], axis=1, inplace=True) Once we do the above operations, the final structure of ds_df would look something like this: Sample output of code We have all the pieces and now we just have to merge them together: ###### Join all the data together master_data = pd.merge(master_df, ds_df, how=”left”, on=[“name”,”id”]) master_data = pd.merge(master_data, cs_df, how=”left”, left_on=”name”, right_on=”data_source”) # Save the results to analyse further master_data.to_excel(“Tableau Data Sources with Tables.xlsx”, index=False) This is our final master_data: Sample Output of code Table-level Impact Analysis Let’s say there were some schema changes on the “Sales” table and you want to know which data sources will be impacted. Then you can simply write a small function which checks if a table is present in either of the two columns — upstreamTables or customSQLTables like below. def filter_rows_with_table(df, col1, col2, target_table): “”” Filters rows in df where target_table is part of any value in either col1 or col2 (supports partial match). Returns full rows (all columns retained). “”” return df[ df.apply( lambda row: (isinstance(row[col1], list) and any(target_table in item for item in row[col1])) or (isinstance(row[col2], list) and any(target_table in item for item in row[col2])), axis=1 ) ] # As an example filter_rows_with_table(master_data, ‘upstreamTables’, ‘customSQLTables’, ‘Sales’) Below is the output. You can see that 3 data sources will be impacted by this change. You can also alert the data source owners Alice and Bob in advance about this so they can start working on a fix before something breaks on the Tableau dashboards. Sample output of code You can check out the complete version of the code in my Github repository here. This is just one of the potential use-cases of the Tableau Metadata API. You can also extract the field names used in custom sql queries and add to the dataset to get a field-level impact analysis. One can also monitor the stale data sources with the extractLastUpdateTime to see if those have any issues or need to be archived if they are not used any more. We can also use the dashboards object to fetch information at a dashboard level. Final Thoughts If you have come this far, kudos. This is just one use case of automating Tableau data management. It’s time to reflect on your own work and think which of those other tasks you could automate to make your life easier. I hope this mini-project served as an enjoyable learning experience to understand the power of Tableau Metadata API. If you liked reading this, you might also like another one of my blog posts about Tableau, on some of the challenges I faced when dealing with big . Also do check out my previous blog where I explored building an interactive, database-powered app with Python, Streamlit, and SQLite. Before you go… Follow me so you don’t miss any new posts I write in future; you will find more of my articles on my . You can also connect with me on LinkedIn or Twitter!

In today’s world, the reliability of data solutions is everything. When we build dashboards and reports, one expects that the numbers reflected there are correct and up-to-date. Based on these numbers, insights are drawn and actions are taken. For any unforeseen reason, if the dashboards are broken or if the numbers are incorrect — then it becomes a fire-fight to fix everything. If the issues are not fixed in time, then it damages the trust placed on the data team and their solutions.

But why would dashboards be broken or have wrong numbers? If the dashboard was built correctly the first time, then 99% of the time the issue comes from the data that feeds the dashboards — from the data warehouse. Some possible scenarios are:

Few ETL pipelines failed, so the new data is not yet in
A table is replaced with another new one
Some columns in the table are dropped or renamed
Schemas in data warehouse have changed
And many more.

There is still a chance that the issue is on the Tableau site, but in my experience, most of the times, it is always due to some changes in data warehouse. Even though we know the root cause, it’s not always straightforward to start working on a fix. There is no central place where you can check which Tableau data sources rely on specific tables. If you have the Tableau Data Management add-on, it could help, but from what I know, its hard to find dependencies of custom sql queries used in data sources.

Nevertheless, the add-on is too expensive and most companies don’t have it. The real pain begins when you have to go through all the data sources manually to start fixing it. On top of it, you have a string of users on your head impatiently waiting for a quick-fix. The fix itself might not be difficult, it would just be a time-consuming one.

What if we could anticipate these issues and identify impacted data sources before anyone notices a problem? Wouldn’t that just be great? Well, there is a way now with the Tableau Metadata API. The Metadata API uses GraphQL, a query language for APIs that returns only the data that you’re interested in. For more info on what’s possible with GraphQL, do check out GraphQL.org.

In this blog post, I’ll show you how to connect to the Tableau Metadata API using Python’s Tableau Server Client (TSC) library to proactively identify data sources using specific tables, so that you can act fast before any issues arise. Once you know which Tableau data sources are affected by a specific table, you can make some updates yourself or alert the owners of those data sources about the upcoming changes so they can be prepared for it.

Connecting to the Tableau Metadata API

Lets connect to the Tableau Server using TSC. We need to import in all the libraries we would need for the exercise!

### Import all required libraries
import tableauserverclient as t
import pandas as pd
import json
import ast
import re

In order to connect to the Metadata API, you will have to first create a personal access token in your Tableau Account settings. Then update the & with the token you just created. Also update with your Tableau site. If the connection is established successfully, then “Connected” will be printed in the output window.

### Connect to Tableau server using personal access token
tableau_auth = t.PersonalAccessTokenAuth("", "", 
                                           site_id="")
server = t.Server("https://dub01.online.tableau.com/", use_server_version=True)

with server.auth.sign_in(tableau_auth):
        print("Connected")

Lets now get a list of all data sources that are published on your site. There are many attributes you can fetch, but for the current use case, lets keep it simple and only get the id, name and owner contact information for every data source. This will be our master list to which we will add in all other information.

############### Get all the list of data sources on your Site

all_datasources_query = """ {
  publishedDatasources {
    name
    id
    owner {
    name
    email
    }
  }
}"""
with server.auth.sign_in(tableau_auth):
    result = server.metadata.query(
        all_datasources_query
    )

Since I want this blog to be focussed on how to proactively identify which data sources are affected by a specific table, I’ll not be going into the nuances of Metadata API. To better understand how the query works, you can refer to a very detailed Tableau’s own Metadata API documentation.

One thing to note is that the Metadata API returns data in a JSON format. Depending on what you are querying, you’ll end up with multiple nested json lists and it can get very tricky to convert this into a pandas dataframe. For the above metadata query, you will end up with a result which would like below (this is mock data just to give you an idea of what the output looks like):

{
  "data": {
    "publishedDatasources": [
      {
        "name": "Sales Performance DataSource",
        "id": "f3b1a2c4-1234-5678-9abc-1234567890ab",
        "owner": {
          "name": "Alice Johnson",
          "email": "[email protected]"
        }
      },
      {
        "name": "Customer Orders DataSource",
        "id": "a4d2b3c5-2345-6789-abcd-2345678901bc",
        "owner": {
          "name": "Bob Smith",
          "email": "[email protected]"
        }
      },
      {
        "name": "Product Returns and Profitability",
        "id": "c5e3d4f6-3456-789a-bcde-3456789012cd",
        "owner": {
          "name": "Alice Johnson",
          "email": "[email protected]"
        }
      },
      {
        "name": "Customer Segmentation Analysis",
        "id": "d6f4e5a7-4567-89ab-cdef-4567890123de",
        "owner": {
          "name": "Charlie Lee",
          "email": "[email protected]"
        }
      },
      {
        "name": "Regional Sales Trends (Custom SQL)",
        "id": "e7a5f6b8-5678-9abc-def0-5678901234ef",
        "owner": {
          "name": "Bob Smith",
          "email": "[email protected]"
        }
      }
    ]
  }
}

We need to convert this JSON response into a dataframe so that its easy to work with. Notice that we need to extract the name and email of the owner from inside the owner object.

### We need to convert the response into dataframe for easy data manipulation

col_names = result['data']['publishedDatasources'][0].keys()
master_df = pd.DataFrame(columns=col_names)

for i in result['data']['publishedDatasources']:
    tmp_dt = {k:v for k,v in i.items()}
    master_df = pd.concat([master_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the owner name and email from the owner object
master_df['owner_name'] = master_df['owner'].apply(lambda x: x.get('name') if isinstance(x, dict) else None)
master_df['owner_email'] = master_df['owner'].apply(lambda x: x.get('email') if isinstance(x, dict) else None)

master_df.reset_index(inplace=True)
master_df.drop(['index','owner'], axis=1, inplace=True)
print('There are ', master_df.shape[0] , ' datasources in your site')

This is how the structure of master_df would look like:

Once we have the main list ready, we can go ahead and start getting the names of the tables embedded in the data sources. If you are an avid Tableau user, you know that there are two ways to selecting tables in a Tableau data source — one is to directly choose the tables and establish a relation between them and the other is to use a custom sql query with one or more tables to achieve a new resultant table. Therefore, we need to address both the cases.

Processing of Custom SQL query tables

Below is the query to get the list of all custom SQLs used in the site along with their data sources. Notice that I have filtered the list to get only first 500 custom sql queries. In case there are more in your org, you will have to use an offset to get the next set of custom sql queries. There is also an option of using cursor method in Pagination when you want to fetch large list of results (refer here). For the sake of simplicity, I just use the offset method as I know, as there are less than 500 custom sql queries used on the site.

# Get the data sources and the table names from all the custom sql queries used on your Site

custom_table_query = """  {
  customSQLTablesConnection(first: 500){
    nodes {
        id
        name
        downstreamDatasources {
        name
        }
        query
    }
  }
}
"""

with server.auth.sign_in(tableau_auth):
    custom_table_query_result = server.metadata.query(
        custom_table_query
    )

Based on our mock data, this is how our output would look like:

{
  "data": {
    "customSQLTablesConnection": {
      "nodes": [
        {
          "id": "csql-1234",
          "name": "RegionalSales_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Regional Sales Trends (Custom SQL)"
            }
          ],
          "query": "SELECT r.region_name, SUM(s.sales_amount) AS total_sales FROM ecommerce.sales_data.Sales s JOIN ecommerce.sales_data.Regions r ON s.region_id = r.region_id GROUP BY r.region_name"
        },
        {
          "id": "csql-5678",
          "name": "ProfitabilityAnalysis_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Product Returns and Profitability"
            }
          ],
          "query": "SELECT p.product_category, SUM(s.profit) AS total_profit FROM ecommerce.sales_data.Sales s JOIN ecommerce.sales_data.Products p ON s.product_id = p.product_id GROUP BY p.product_category"
        },
        {
          "id": "csql-9101",
          "name": "CustomerSegmentation_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Segmentation Analysis"
            }
          ],
          "query": "SELECT c.customer_id, c.location, COUNT(o.order_id) AS total_orders FROM ecommerce.sales_data.Customers c JOIN ecommerce.sales_data.Orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.location"
        },
        {
          "id": "csql-3141",
          "name": "CustomerOrders_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "query": "SELECT o.order_id, o.customer_id, o.order_date, o.sales_amount FROM ecommerce.sales_data.Orders o WHERE o.order_status = 'Completed'"
        },
        {
          "id": "csql-3142",
          "name": "CustomerProfiles_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "query": "SELECT c.customer_id, c.customer_name, c.segment, c.location FROM ecommerce.sales_data.Customers c WHERE c.active_flag = 1"
        },
        {
          "id": "csql-3143",
          "name": "CustomerReturns_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "query": "SELECT r.return_id, r.order_id, r.return_reason FROM ecommerce.sales_data.Returns r"
        }
      ]
    }
  }
}

Just like before when we were creating the master list of data sources, here also we have nested json for the downstream data sources where we would need to extract only the “name” part of it. In the “query” column, the entire custom sql is dumped. If we use regex pattern, we can easily search for the names of the table used in the query.

We know that the table names always come after FROM or a JOIN clause and they generally follow the format ..

. The is optional and most of the times not used. There were some queries I found which used this format and I ended up only getting the database and schema names, and not the complete table name. Once we have extracted the names of the data sources and the names of the tables, we need to merge the rows per data source as there can be multiple custom sql queries used in a single data source.

### Convert the custom sql response into dataframe
col_names = custom_table_query_result['data']['customSQLTablesConnection']['nodes'][0].keys()
cs_df = pd.DataFrame(columns=col_names)

for i in custom_table_query_result['data']['customSQLTablesConnection']['nodes']:
    tmp_dt = {k:v for k,v in i.items()}

    cs_df = pd.concat([cs_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the data source name where the custom sql query was used
cs_df['data_source'] = cs_df.downstreamDatasources.apply(lambda x: x[0]['name'] if x and 'name' in x[0] else None)
cs_df.reset_index(inplace=True)
cs_df.drop(['index','downstreamDatasources'], axis=1,inplace=True)

### We need to extract the table names from the sql query. We know the table name comes after FROM or JOIN clause
# Note that the name of table can be of the format ..
# Depending on the format of how table is called, you will have to modify the regex expression

def extract_tables(sql):
    # Regex to match database.schema.table or schema.table, avoid alias
    pattern = r'(?:FROM|JOIN)s+((?:[w+]|w+).(?:[w+]|w+)(?:.(?:[w+]|w+))?)b'
    matches = re.findall(pattern, sql, re.IGNORECASE)
    return list(set(matches))  # Unique table names

cs_df['customSQLTables'] = cs_df['query'].apply(extract_tables)
cs_df = cs_df[['data_source','customSQLTables']]

# We need to merge datasources as there can be multiple custom sqls used in the same data source
cs_df = cs_df.groupby('data_source', as_index=False).agg({
    'customSQLTables': lambda x: list(set(item for sublist in x for item in sublist))  # Flatten & make unique
})

print('There are ', cs_df.shape[0], 'datasources with custom sqls used in it')

After we perform all the above operations, this is how the structure of cs_df would look like:

Processing of regular Tables in Data Sources

Now we need to get the list of all the regular tables used in a datasource which are not a part of custom SQL. There are two ways to go about it. Either use the publishedDatasources object and check for upstreamTables or use DatabaseTable and check for upstreamDatasources. I’ll go by the first method because I want the results at a data source level (basically, I want some code ready to reuse when I want to check a specific data source in further detail). Here again, for the sake of simplicity, instead of going for pagination, I’m looping through each datasource to ensure I have everything. We get the upstreamTables inside of the field object so that has to be cleaned out.

############### Get the data sources with the regular table names used in your site

### Its best to extract the tables information for every data source and then merge the results.
# Since we only get the table information nested under fields, in case there are hundreds of fields 
# used in a single data source, we will hit the response limits and will not be able to retrieve all the data.

data_source_list = master_df.name.tolist()

col_names = ['name', 'id', 'extractLastUpdateTime', 'fields']
ds_df = pd.DataFrame(columns=col_names)

with server.auth.sign_in(tableau_auth):
    for ds_name in data_source_list:
        query = """ {
            publishedDatasources (filter: { name: """"+ ds_name + """" }) {
            name
            id
            extractLastUpdateTime
            fields {
                name
                upstreamTables {
                    name
                }
            }
            }
        } """
        ds_name_result = server.metadata.query(
        query
        )
        for i in ds_name_result['data']['publishedDatasources']:
            tmp_dt = {k:v for k,v in i.items() if k != 'fields'}
            tmp_dt['fields'] = json.dumps(i['fields'])
        ds_df = pd.concat([ds_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

ds_df.reset_index(inplace=True)

This is how the structure of ds_df would look:

We can need to flatten out the fields object and extract the field names as well as the table names. Since the table names will be repeating multiple times, we would have to deduplicate to keep only the unique ones.

# Function to extract the values of fields and upstream tables in json lists
def extract_values(json_list, key):
    values = []
    for item in json_list:
        values.append(item[key])
    return values

ds_df["fields"] = ds_df["fields"].apply(ast.literal_eval)
ds_df['field_names'] = ds_df.apply(lambda x: extract_values(x['fields'],'name'), axis=1)
ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_values(x['fields'],'upstreamTables'), axis=1)

# Function to extract the unique table names 
def extract_upstreamTable_values(table_list):
    values = set()a
    for inner_list in table_list:
        for item in inner_list:
            if 'name' in item:
                values.add(item['name'])
    return list(values)

ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_upstreamTable_values(x['upstreamTables']), axis=1)
ds_df.drop(["index","fields"], axis=1, inplace=True)

Once we do the above operations, the final structure of ds_df would look something like this:

We have all the pieces and now we just have to merge them together:

###### Join all the data together
master_data = pd.merge(master_df, ds_df, how="left", on=["name","id"])
master_data = pd.merge(master_data, cs_df, how="left", left_on="name", right_on="data_source")

# Save the results to analyse further
master_data.to_excel("Tableau Data Sources with Tables.xlsx", index=False)

This is our final master_data:

Table-level Impact Analysis

Let’s say there were some schema changes on the “Sales” table and you want to know which data sources will be impacted. Then you can simply write a small function which checks if a table is present in either of the two columns — upstreamTables or customSQLTables like below.

def filter_rows_with_table(df, col1, col2, target_table):
    """
    Filters rows in df where target_table is part of any value in either col1 or col2 (supports partial match).
    Returns full rows (all columns retained).
    """
    return df[
        df.apply(
            lambda row: 
                (isinstance(row[col1], list) and any(target_table in item for item in row[col1])) or
                (isinstance(row[col2], list) and any(target_table in item for item in row[col2])),
            axis=1
        )
    ]
# As an example 
filter_rows_with_table(master_data, 'upstreamTables', 'customSQLTables', 'Sales')

Below is the output. You can see that 3 data sources will be impacted by this change. You can also alert the data source owners Alice and Bob in advance about this so they can start working on a fix before something breaks on the Tableau dashboards.

You can check out the complete version of the code in my Github repository here.

This is just one of the potential use-cases of the Tableau Metadata API. You can also extract the field names used in custom sql queries and add to the dataset to get a field-level impact analysis. One can also monitor the stale data sources with the extractLastUpdateTime to see if those have any issues or need to be archived if they are not used any more. We can also use the dashboards object to fetch information at a dashboard level.

Final Thoughts

If you have come this far, kudos. This is just one use case of automating Tableau data management. It’s time to reflect on your own work and think which of those other tasks you could automate to make your life easier. I hope this mini-project served as an enjoyable learning experience to understand the power of Tableau Metadata API. If you liked reading this, you might also like another one of my blog posts about Tableau, on some of the challenges I faced when dealing with big .

Also do check out my previous blog where I explored building an interactive, database-powered app with Python, Streamlit, and SQLite.

Before you go…

Follow me so you don’t miss any new posts I write in future; you will find more of my articles on my . You can also connect with me on LinkedIn or Twitter!

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

US lets China buy semiconductor design software again

The reversal marks a dramatic shift from the aggressive stance the Trump administration took in May, when it imposed sweeping restrictions on electronic design automation (EDA) software — the critical tools needed to design advanced semiconductors. A short-lived stoppage The restrictions had targeted what analysts called the “upstream” of chip

Hardcoded root credentials in Cisco Unified CM trigger max-severity alert

The affected products-Cisco Unified CM and Unified CM SME–are core components of enterprise telephony infrastructure, widely deployed across government agencies, financial institutions, and large corporations to manage voice, video, and messaging at scale. A flaw in these systems could allow attackers to compromise an organization’s communications, letting them log in

HPE finalizes Juniper acquisition, forms new AI-centric networking unit

“We have agreed with the DOJ to offer a license, through an auction, to specific aspects of Juniper Mist, which is just the AI operations part,” HPE CEO Antonio Neri explained during the press conference. The distinction is technically significant. Competitors will gain access to Mist’s anomaly detection and predictive failure

Solving multicloud networking challenges to scale businesses with AI

Artificial Intelligence (AI) and its demand for massive computational power have spurred the growth of cloud and edge computing. As AI pilot projects take off and scale, there has been an increasing demand for more flexible and high-performing infrastructure. This has led to the rise of multi-cloud networking as organisations

Oil Slips as US Plans Iran Talks

Oil declined after Axios reported the US plans to restart nuclear talks with Iran, reducing the risk of another flare-up in the Middle East conflict. West Texas Intermediate crude slumped 0.7% to settle at $67 a barrel, while Brent settled below $69 after the news service said US Middle East envoy Steven Witkoff plans to meet with Iranian Foreign Minister Abbas Araghchi in Oslo next week. That followed a statement from Iran’s top diplomat that the country would continue to engage with the UN’s nuclear watchdog. Crude prices have been buffeted by geopolitical events in recent weeks, first surging after the escalation that included direct US strikes in Iran then declining after Tehran’s retaliation was dismissed as largely symbolic. Renewed negotiations over Iran’s nuclear program would further reduce oil’s already-diminished risk premium. Oil’s slump on Thursday also may have been amplified by low liquidity ahead of Friday’s July Fourth holiday in the US. The Middle East developments squelched some earlier strength in prices that was driven by US jobs data showing stronger-than-expected additions in June. Equity markets rose and the dollar gained, making commodities priced in the currency less appealing. The US also took fresh steps to restrict the trade of Iranian oil, including sanctions on companies and a “shadow fleet” of vessels that help Iran export its crude. Oil had rallied on Wednesday against the backdrop of a market flashing pockets of strength. Diesel’s premium to crude in the US earlier hit the biggest in 15 months after stockpiles of the fuel continued to decline. Spreads on the nearest crude contracts are also pointing to tight supplies, with stockpiles at the key storage hub of Cushing, Oklahoma, sliding. The continued outlook for supply dynamics, however, depends on a meeting between the Organization of the Petroleum Exporting Countries and its

Crude’s Drop, Strong Ruble Cut Russian Oil Revenue to 2-Year Low

Russia’s oil revenue in June slumped to a two-year low as global crude prices fell and the ruble strengthened, meaning each barrel brought fewer rubles to the Kremlin. Proceeds from the oil industry shrank by almost 30% to 415.6 billion rubles ($5.27 billion), according to Bloomberg calculations based on Finance Ministry data published Thursday. That’s the lowest since June 2023. Russia’s combined revenue from oil and gas taxes fell by a third compared with a year earlier, to almost 495 billion rubles, the calculations show. That’s the lowest since January 2023. A stronger currency means Russia and its oil producers get fewer rubles for every barrel they pump and sell. That erodes the profitability of the companies and strains the federal budget, which depends on oil and gas taxes for about a third of its revenues. Any substantial decline in the tax take from the industry directly affects the nation’s coffers, which are already burdened by multibillion-dollar spending on the war in Ukraine. A stronger ruble also reduces the incentive to export. At the end of April, the Finance Ministry revised its expectation for this year’s budget deficit, forecasting a shortfall much deeper than previously estimated as US tariff policies and OPEC+ supply hikes caused oil prices to nosedive. To cover deficits, the nation taps into its wealth fund, designed to stabilize the economy. The ministry calculated Russia’s June oil revenue based on a Urals price of $52.08 a barrel in May. That’s the lowest price for the nation’s key export grade since March 2023, data compiled by Bloomberg show. The currency traded at an average 80.4603 rubles per US dollar in May, the strongest in two years, driven by record-high interest rates and expectations of an improvement in relations with Washington. As a result, the country’s oil companies received only 4,190 rubles for

House passes Senate version of megabill, sending it to Trump’s desk

The Republican budget megabill, which makes steep cuts to the Inflation Reduction Act’s clean energy tax credits, now heads to President Donald Trump’s desk after passing both houses of Congress. The House passed the Senate’s version of the bill 218-214 on Thursday, after Republicans debated through the night and House Minority Leader Hakeem Jeffries, D-N.Y., used the leadership prerogative of a “magic minute” to speak in opposition to the bill for a record-breaking 8 hours and 44 minutes. Jeffries at one point referenced a June letter 13 House Republicans sent to the Senate, in which the Republicans stated they were “proud to have worked to ensure that the bill did not include a full repeal of the clean energy tax credits, [but] remain deeply concerned by several provisions” cutting back those credits the House version did include. “Every single one of these signatories voted for a House bill that undermines the clean energy economy, noted that it would hurt their own constituents, voted for the bill anyway, then begged the Senate to make a difference,” Jeffries said. “That is not how we should be legislating in the United States House of Representatives … It limped out of the House by a single vote, so every single signatory on this letter could have stopped the bill.” The bill restricts the ability of projects to qualify for the tech-neutral clean electricity 45Y production tax credit and 48E investment tax credit, shortens the timeline for those credits, and ends the 25D residential solar credit after this year. The 25E, 30D, 30C and 45W electric vehicle credits will terminate after Sept. 30. While clean energy advocates and congressional Democrats maintain that the final version of the bill goes too far in slashing IRA credits, some Republicans wanted to see more significant cuts. Before the House

Airloom Energy to pilot novel wind power tech at Wyoming site

Dive Brief: Airloom Energy has broken ground on a utility-scale pilot of the novel wind power technology that it says offers better energy density and siting flexibility at significantly lower capital cost. The Bill Gates-backed startup says the southeastern Wyoming project will validate its low-slung turbine design, which captures wind energy using 30-foot, vertically oriented airfoils that move around an ovular trackway about 80 feet above the ground. Airloom plans to begin generating power at the site near Rock River later this year, and begin commercial-scale demonstrations in 2027, according to a timeline posted on its website. Dive Insight: The Wyoming Energy Authority awarded Airloom $5 million in November to “design, build and test a 1 MW demonstration device that validates the company’s innovative, low-profile design.” That description matches what Airloom said about the pilot project in its announcement last week, but Airloom CEO Neal Rickner told TechCrunch on June 25 that the pilot would generate about 150 kW of electricity while using the same parts as future megawatt-scale turbines. Airloom did not respond to requests for comment. The key difference between the pilot- and commercial-scale plants, Rickner told TechCrunch, is the physical footprint: The straightaways run about 100 meters on the former and 500 meters on the latter. Despite the relatively large area inside the trackway, Airloom says its technology offers higher energy densities than traditional wind, which the U.S. Department of Agriculture says requires about 10 times less land per megawatt than solar PV. Whereas a traditional wind turbine has a circular swept area, Airloom’s design sweeps a rectangular area, capturing more wind and conserving more energy, it says. Airloom touts the undisturbed land inside the trackway as fit for complementary uses like livestock grazing, crop production or solar arrays. The design has other advantages over traditional turbines,

Trinidad Aims to Boost Gas Output with BP, Woodside Projects

Trinidad and Tobago is working to reverse a major slump in gas production and revive a key export industry in the Caribbean nation to boost its economic growth. Energy Minister Roodal Moonilal sees strong potential in the country’s offshore waters at a time when auditors said more gas output must be added to satisfy demand. During the opening of the Society of Petroleum Engineers 2025 Mature Basin Energy Symposium Tuesday, Moonilal said Woodside Energy’s Calypso offshore project is expected to add 700 million standard cubic feet per day to domestic gas production. “The current and upcoming gas projects will, however, only provide relief in the short to medium term,” he added. Calypso, with reserves of 3.5 trillion cubic feet, is “one of the main projects we would be seeking to accelerate,” Moonilal told Bloomberg in an interview on June 27. Trinidad and Tobago has seen a 33 percent drop in gas output since 2015. The need to ramp up production comes as the country aims to fill a supply gap and deals with a crippling shortage of foreign exchange. Natural gas reserves auditors DeGoyler and MacNaughton estimated in its latest report that 58.25 trillion cubic feet of prospective resources were offshore at the end of 2023. The country’s gas production has been declining, and it produced 3.8 billion cubic feet in 2015, Moonilal said. Trinidad and Tobago produced around 2.5 billion cubic feet in the first quarter of 2025, 46 percent of which is exported as liquefied natural gas, the energy minister told Bloomberg Wednesday. The auditors advised the country to fast track gas projects and exploration efforts to “convert prospective resources into reserves and contingent resources, to meet gas demand,” he added. Several projects from BP’s subsidiary in Trinidad and Tobago are planned with the oil and gas major set to

Mozambique President Urges Total LNG Restart Despite Risks

Mozambican President Daniel Chapo said his government and private companies will have to collectively ensure the necessary security is in place to enable TotalEnergies SE to restart construction of a $20 billion gas project that has stalled due to a militant insurgency — and even then risks will remain. The project in the northern Cabo Delgado province along with others that are at earlier stages of development are seen as crucial to the future of the southern African nation, which ranks among the world’s poorest. The French oil major halted work, evacuated workers and declared force majeure in 2021 following an escalation in attacks by Islamic State-linked militants. “Regarding security, it’s relatively stable compared to this past four years, but continuity of this stability doesn’t depend only on the government of the republic but on all partners in that area,” Chapo said in an interview with Bloomberg in Seville, Spain, on Tuesday. “If we’re waiting for Cabo Delgado to be a heaven, we won’t lift force majeure.” Total’s facility, which will take another four years to complete, will liquefy and export the extensive gas reserves off northeast Mozambique that were discovered 15 years ago. Since then, only one floating plant operated by Eni SpA has come on line. A final investment decision on a third planned venture, Exxon Mobil Corp.’s $27 billion Rovuma LNG project, is expected next year. Mozambique called on nations in the region to help secure Cabo Delgado after its attempts to use mercenaries to halt the violence failed. Rwandan troops, brought in months after the attacks that led to Total’s evacuation, provided an effective response. The European Council gave the Rwanda Defence Force additional funding in November to help fight the insurgency, but the terms of its deployment remain uncertain. “How long will depend a lot on security

CoreWeave achieves a first with Nvidia GB300 NVL72 deployment

The deployment, Kimball said, “brings Dell quality to the commodity space. Wins like this really validate what Dell has been doing in reshaping its portfolio to accommodate the needs of the market — both in the cloud and the enterprise.” Although concerns were voiced last year that Nvidia’s next-generation Blackwell data center processors had significant overheating problems when they were installed in high-capacity server racks, he said that a repeat performance is unlikely. Nvidia, said Kimball “has been very disciplined in its approach with its GPUs and not shipping silicon until it is ready. And Dell almost doubles down on this maniacal quality focus. I don’t mean to sound like I have blind faith, but I’ve watched both companies over the last several years be intentional in delivering product in volume. Especially as the competitive market starts to shape up more strongly, I expect there is an extremely high degree of confidence in quality.” CoreWeave ‘has one purpose’ He said, “like Lambda Labs, Crusoe and others, [CoreWeave] seemingly has one purpose (for now): deliver GPU capacity to the market. While I expect these cloud providers will expand in services, I think for now the type of customer employing services is on the early adopter side of AI. From an enterprise perspective, I have to think that organizations well into their AI journey are the consumers of CoreWeave.” “CoreWeave is also being utilized by a lot of the model providers and tech vendors playing in the AI space,” Kimball pointed out. “For instance, it’s public knowledge that Microsoft, OpenAI, Meta, IBM and others use CoreWeave GPUs for model training and more. It makes sense. These are the customers that truly benefit from the performance lift that we see from generation to generation.”

Oracle to power OpenAI’s AGI ambitions with 4.5GW expansion

“For CIOs, this shift means more competition for AI infrastructure. Over the next 12–24 months, securing capacity for AI workloads will likely get harder, not easier. Though cost is coming down but demand is increasing as well, due to which CIOs must plan earlier and build stronger partnerships to ensure availability,” said Pareekh Jain, CEO at EIIRTrend & Pareekh Consulting. He added that CIOs should expect longer wait times for AI infrastructure. To mitigate this, they should lock in capacity through reserved instances, diversify across regions and cloud providers, and work with vendors to align on long-term demand forecasts. “Enterprises stand to benefit from more efficient and cost-effective AI infrastructure tailored to specialized AI workloads, significantly lower their overall future AI-related investments and expenses. Consequently, CIOs face a critical task: to analyze and predict the diverse AI workloads that will prevail across their organizations, business units, functions, and employee personas in the future. This foresight will be crucial in prioritizing and optimizing AI workloads for either in-house deployment or outsourced infrastructure, ensuring strategic and efficient resource allocation,” said Neil Shah, vice president at Counterpoint Research. Strategic pivot toward AI data centers The OpenAI-Oracle deal comes in stark contrast to developments earlier this year. In April, AWS was reported to be scaling back its plans for leasing new colocation capacity — a move that AWS Vice President for global data centers Kevin Miller described as routine capacity management, not a shift in long-term expansion plans. Still, these announcements raised questions around whether the hyperscale data center boom was beginning to plateau. “This isn’t a slowdown, it’s a strategic pivot. The era of building generic data center capacity is over. The new global imperative is a race for specialized, high-density, AI-ready compute. Hyperscalers are not slowing down; they are reallocating their capital to

Arista Buys VeloCloud to reboot SD-WANs amid AI infrastructure shift

What this doesn’t answer is how Arista Networks plans to add newer, security-oriented Secure Access Service Edge (SASE) capabilities to VeloCloud’s older SD-WAN technology. Post-acquisition, it still has only some of the building blocks necessary to achieve this. Mapping AI However, in 2025 there is always more going on with networking acquisitions than simply adding another brick to the wall, and in this case it’s the way AI is changing data flows across networks. “In the new AI era, the concepts of what comprises a user and a site in a WAN have changed fundamentally. The introduction of agentic AI even changes what might be considered a user,” wrote Arista Networks CEO, Jayshree Ullal, in a blog highlighting AI’s effect on WAN architectures. “In addition to people accessing data on demand, new AI agents will be deployed to access data independently, adapting over time to solve problems and enhance user productivity,” she said. Specifically, WANs needed modernization to cope with the effect AI traffic flows are having on data center traffic. Sanjay Uppal, now VP and general manager of the new VeloCloud Division at Arista Networks, elaborated. “The next step in SD-WAN is to identify, secure and optimize agentic AI traffic across that distributed enterprise, this time from all end points across to branches, campus sites, and the different data center locations, both public and private,” he wrote. “The best way to grab this opportunity was in partnership with a networking systems leader, as customers were increasingly looking for a comprehensive solution from LAN/Campus across the WAN to the data center.”

Data center capacity continues to shift to hyperscalers

However, even though colocation and on-premises data centers will continue to lose share, they will still continue to grow. They just won’t be growing as fast as hyperscalers. So, it creates the illusion of shrinkage when it’s actually just slower growth. In fact, after a sustained period of essentially no growth, on-premises data center capacity is receiving a boost thanks to genAI applications and GPU infrastructure. “While most enterprise workloads are gravitating towards cloud providers or to off-premise colo facilities, a substantial subset are staying on-premise, driving a substantial increase in enterprise GPU servers,” said John Dinsdale, a chief analyst at Synergy Research Group.

Oracle inks $30 billion cloud deal, continuing its strong push into AI infrastructure.

He pointed out that, in addition to its continued growth, OCI has a remaining performance obligation (RPO) — total future revenue expected from contracts not yet reported as revenue — of $138 billion, a 41% increase, year over year. The company is benefiting from the immense demand for cloud computing largely driven by AI models. While traditionally an enterprise resource planning (ERP) company, Oracle launched OCI in 2016 and has been strategically investing in AI and data center infrastructure that can support gigawatts of capacity. Notably, it is a partner in the $500 billion SoftBank-backed Stargate project, along with OpenAI, Arm, Microsoft, and Nvidia, that will build out data center infrastructure in the US. Along with that, the company is reportedly spending about $40 billion on Nvidia chips for a massive new data center in Abilene, Texas, that will serve as Stargate’s first location in the country. Further, the company has signaled its plans to significantly increase its investment in Abu Dhabi to grow out its cloud and AI offerings in the UAE; has partnered with IBM to advance agentic AI; has launched more than 50 genAI use cases with Cohere; and is a key provider for ByteDance, which has said it plans to invest $20 billion in global cloud infrastructure this year, notably in Johor, Malaysia. Ellison’s plan: dominate the cloud world CTO and co-founder Larry Ellison announced in a recent earnings call Oracle’s intent to become No. 1 in cloud databases, cloud applications, and the construction and operation of cloud data centers. He said Oracle is uniquely positioned because it has so much enterprise data stored in its databases. He also highlighted the company’s flexible multi-cloud strategy and said that the latest version of its database, Oracle 23ai, is specifically tailored to the needs of AI workloads. Oracle

Datacenter industry calls for investment after EU issues water consumption warning

CISPE’s response to the European Commission’s report warns that the resulting regulatory uncertainty could hurt the region’s economy. “Imposing new, standalone water regulations could increase costs, create regulatory fragmentation, and deter investment. This risks shifting infrastructure outside the EU, undermining both sustainability and sovereignty goals,” CISPE said in its latest policy recommendation, Advancing water resilience through digital innovation and responsible stewardship. “Such regulatory uncertainty could also reduce Europe’s attractiveness for climate-neutral infrastructure investment at a time when other regions offer clear and stable frameworks for green data growth,” it added. CISPE’s recommendations are a mix of regulatory harmonization, increased investment, and technological improvement. Currently, water reuse regulation is directed towards agriculture. Updated regulation across the bloc would encourage more efficient use of water in industrial settings such as datacenters, the asosciation said. At the same time, countries struggling with limited public sector budgets are not investing enough in water infrastructure. This could only be addressed by tapping new investment by encouraging formal public-private partnerships (PPPs), it suggested: “Such a framework would enable the development of sustainable financing models that harness private sector innovation and capital, while ensuring robust public oversight and accountability.” Nevertheless, better water management would also require real-time data gathered through networks of IoT sensors coupled to AI analytics and prediction systems. To that end, cloud datacenters were less a drain on water resources than part of the answer: “A cloud-based approach would allow water utilities and industrial users to centralize data collection, automate operational processes, and leverage machine learning algorithms for improved decision-making,” argued CISPE.

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Stay Ahead, Stay ONMINE