SQL Troubles: record count

Showing posts with label record count. Show all posts

01 March 2025

💎🤖💫SQL Reloaded: Copilot Stories (Part VII: Using the sp_MSForEachTable Undocumented Stored Procedures in SQL Server)

Probably few know that SQL Server makes available a set of undocumented stored procedures that can be useful in certain scenarios that don't involve code that will deployed into production. One of them is sp_MSForEachTable on which I wrote several posts until now. The stored procedure executes one or more scripts based on a set of constraints against the subset of available tables. The first surprising fact is that Copilot does know the respective stored procedure and seems to correctly generate code for it!

Disclaimer: The below code was generated by Copilot via prompts used for demonstration purposes!

Please note that the following code is provided as it is and you can use it on your own risk! Don't run the code in a productive environment and make sure that you have a usable backup! Please consider all caution when you use a feature marked as undocumented!

Let's start with a simple script and increase the complexity of the queries as it goes!

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure in SQL Server

Output:

-- query that uses this stored procedure to update a specific column in all tables:

EXEC sp_MSForEachTable 'UPDATE ? SET YourColumn = ''NewValue'' WHERE YourCondition'

/*
? is a placeholder for the table name.
YourColumn is the column you want to update.
NewValue is the new value you want to set.
YourCondition is the condition that must be met for the update to occur.*/

As the description says, "?" acts as placeholder for table's name while the other values need to be provided explicitly. The code can be used to overwrite the value of a column that exists in some of the tables. For example, when the data come over an interface from various system, it may happen that there is a mix of upper and lower case values that designate the same value. This might not be an issue for Dynamics AX but can create issues further on.

Here's the actual code for performing the update:

-- update DataAreaId for DMEMF company
EXEC sp_MSForEachTable 'UPDATE ? SET DataAreaId = Upper(DataAreaId) WHERE DataAreaId = ''demf'''

Executing the code blindly can create serious damage, so always test first the code in a test environment and have a backup available! One needs to be careful if there are fields which have same name(s) but different meaning! It's important to perform a thorough analysis in this direction.

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure to select records from a set of tables.

Output:

-- return all columns
EXEC sp_MSForEachTable 'SELECT * FROM ?'

-- return specific columns based on condition
EXEC sp_MSForEachTable 'SELECT Column1, Column2 FROM ? WHERE YourCondition'

The second query will result in errors if the columns don't exist, though the execution runs further! The errors can be filtered out in Excel by sorting them accordingly.
The query works in general for combinations of columns available in many tables. It's the case of DataAreaId and RecId available in Dynamics 365 for Finance and Operations for a high percentage of the tables. RecId should be available for all the tables.

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure to count all the records from a table

Output:

-- retrieving the number of records for all tables
EXEC sp_MSForEachTable 'SELECT ''?'' AS TableName, COUNT(*) AS RecordCount FROM ?'

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure to select all the records from tables containing the word invent.

Output:

DECLARE @sql NVARCHAR(MAX)

SET @sql = ''
EXEC sp_MSForEachTable 
    @command1 = 'IF ''?'' LIKE ''%invent%'' BEGIN SET @sql = @sql + ''SELECT * FROM '' + ''?'' + ''; '' END',
    @replacechar = '?',
    @command2 = 'EXEC sp_executesql @sql'

PRINT @sql

Here's the working version of the query:

-- retrieve the records for tables having 'invent' intheir name
EXEC sp_MSForEachTable 
    @command1 = 'DECLARE  @sql nvarchar(max) = ''''
		IF ''?'' LIKE ''%invent%'' BEGIN SET @sql = @sql + ''SELECT * FROM '' + ''?'' + ''; '' END;
                    EXEC sp_executesql @sql;'

A few days later, using the same prompt in Edge I got the following answer, simpler and correct:

-- second version run on Edge Copilot
EXEC sp_MSForEachTable 
    @command1 = 'IF ''?'' LIKE ''%invent%'' 
                BEGIN 
                    PRINT ''?'' 
                    EXEC(''SELECT * FROM ?'') 
                END'

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure when the DataAreaId field exists in the table.

Output:

DECLARE @sql NVARCHAR(MAX)

SET @sql = ''
EXEC sp_MSForEachTable 
    @command1 = 'IF EXISTS (SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = PARSENAME(''?'', 1) AND COLUMN_NAME = ''DataAreaId'') 
                 BEGIN 
                     SET @sql = @sql + ''SELECT * FROM '' + ''?'' + ''; '' 
                 END',
    @replacechar = '?',
    @command2 = 'EXEC sp_executesql @sql'

PRINT @sql

The query uses the metadata available to retrieve the tables in scope.

EXEC sp_MSForEachTable 
    @command1 = ' DECLARE  @sql nvarchar(max) = ''''
	IF EXISTS (SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = PARSENAME(''?'', 1) AND COLUMN_NAME = ''DataAreaId'') 
        BEGIN 
            SET @sql = @sql + ''SELECT * FROM '' + ''?'' + ''; '' 
			EXEC sp_executesql @sql
        END'

Here's the answer in Edge:

EXEC sp_MSForEachTable 
    @command1 = 'IF EXISTS (SELECT * FROM sys.columns WHERE object_id = OBJECT_ID(''?'' ) AND name = ''DataAreaId'') 
                BEGIN 
                    PRINT ''?'' 
                    EXEC(''SELECT * FROM ?'') 
                END'

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure to select records from a set of tables while it checks whether the fields exist in the table.

Output:

DECLARE @sql NVARCHAR(MAX)

SET @sql = ''
EXEC sp_MSForEachTable 
    @command1 = 'DECLARE @table NVARCHAR(128)
                 SET @table = PARSENAME(''?'', 1)
                 IF EXISTS (SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = @table AND COLUMN_NAME = ''Field1'')
                 AND EXISTS (SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = @table AND COLUMN_NAME = ''Field2'')
                 BEGIN
                     SET @sql = @sql + ''SELECT Field1, Field2 FROM '' + ''?'' + ''; ''
                 END',
    @replacechar = '?',
    @command2 = 'EXEC sp_executesql @sql'

PRINT @sql

The query used the metadata available in the INFORMATION_SCHHEMA.COLUMNS. Unfortunately, the code doesn't work as expected. The second command is not necessary for performing the task, so the code can be written as follows, and now it works:

EXEC sp_MSForEachTable 
    @command1 = 'DECLARE @table NVARCHAR(128), @sql nvarchar(max) = ''''
                 SET @table = PARSENAME(''?'', 1)
                 IF EXISTS (SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = @table AND COLUMN_NAME = ''DataAreaId'')
                 AND EXISTS (SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = @table AND COLUMN_NAME = ''ItemId'')
                 BEGIN
                     SET @sql = @sql + ''SELECT DataAreaId, ItemId FROM '' + ''?'' + ''; ''
                 END
		 EXEC sp_executesql @sql'

The query is easy to modify to include more columns after the same pattern, if such tables exist after all (e.g. in Dynamics 365 one can search for the combination ItemId, DataAreaId and one or more Inventory dimension like InventSiteId, InventLocationId, etc.).

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure to count all the records for a subset of tables.

Output:

DECLARE @sql NVARCHAR(MAX)

SET @sql = ''
EXEC sp_MSForEachTable 
    @command1 = 'IF ''?'' LIKE ''%subset%'' 
                 BEGIN 
                     SET @sql = @sql + ''SELECT ''''?'''' AS TableName, COUNT(*) AS RecordCount FROM '' + ''?'' + ''; '' 
                 END',
    @replacechar = '?',
    @command2 = 'EXEC sp_executesql @sql'

PRINT @sql

Unfortunately, that's not what I was expecting. So, let's give it a second try:

Prompt:
Write a query that uses the sp_MSForEachTable stored procedure to count all the records for a subset of the tables available in the SYS.TABLES.

Output:

EXEC sp_MSForEachTable 
    @command1 = 'IF OBJECT_ID(''?'' , ''U'') IN (SELECT OBJECT_ID FROM SYS.TABLES WHERE NAME IN (''YourTable1'', ''YourTable2'', ''YourTable3'')) 
                BEGIN 
                    PRINT ''?'' 
                    EXEC(''SELECT COUNT(*) FROM ?'') 
                END'

Replacing the tables names accordingly, the query becomes

EXEC sp_MSForEachTable 
    @command1 = 'IF OBJECT_ID(''?'' , ''U'') IN (SELECT OBJECT_ID FROM SYS.TABLES WHERE NAME IN (''InventTable'', ''InventDim'', ''InventDimCombination'')) 
                BEGIN 
                    PRINT ''?'' 
                    EXEC(''SELECT COUNT(*) FROM ?'') 
                END'

Comments:
1) Edge Copilot returned the expected solution directly and I wonder what the reason could be? It can be the fact that I run the tests on different machines at different points in time, though more likely the reason is that I used two different Copilots - from Edge, respectively Microsoft 365. The expectation was to get better answers in M365 Copilot, which wasn't the case in the above tests!

The two Copilots have different architectures and were trained with different sets of data, respectively have different sources. One important difference is that Edge Copilot primarily relies on web content and users' interactions within the browser, while M365 Copilot relies on data from the D365 services. Both use LLM and NLP, while M365 Copilot uses Microsoft Graph and Syntex.

One thing I remarked during the tests is that sometimes I get different answers to the same question also within the same Copilot. However, that's expected given that there are multiple ways of addressing the same problem! It would still be interesting to understand how a solution is chosen.

2) I seldom used sp_MSForEachTable with more than one parameter, preferring to prepare the needed statement beforehand, as in the working solutions above.

Happy coding!

Refer4ences:
[1] SQLShack (2017) An introduction to sp_MSforeachtable; run commands iteratively through all tables in a database [link]

Acronyms:
LLM - large language model
M365 - Microsoft 365
NLP - natural language processing

Previous Post <<||>> Next Post

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

In SQL Server and other RDBMS databases there are many scenarios in which one needs information from a fact table based on a dimension table without requiring information from the dimension table.

Correlated Subquery via EXISTS

Before considering the main example, let's start with a simple subquery:

// subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE ProductKey IN (
    SELECT DISTINCT ProductKey
    FROM Products 
    WHERE ProductSubcategoryName = 'MP4&MP3'
    ) 

// subquery in KQL
NewSales
| where ProductKey in (
    (Products
    | where (ProductSubcategoryName == "MP4&MP3")
    | project ProductKey
    | distinct *))
| project CustomerKey, ProductKey, ProductName

Of course, the ProductKey is unique by design, though there can be dimension, fact tables or subqueries in which the value is not unique.

Now let's consider the correlated subquery pattern, which should provide the same outcome as above, though in RDBMS there are scenarios in which it provides better performance, especially when the number of values from subquery is high.

// correlated subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE EXISTS (
    SELECT Products.ProductKey
    FROM Products 
    WHERE NewSales.ProductKey = Products.ProductKey)

Unfortunately, trying to translate the code via explain leads to the following error, which confirms that the syntax is not supported in KQL (see [1]):

"Error: Reference to missing column 'NewSales.ProductKey'"

Fortunately, in this case one can use the first version of the query.

Correlated Subquery via CROSS APPLY

Before creating the main query, let's look at the inner query and check whether it gets correctly translate to KQL:

// subquery logic
--
explain
SELECT sum(TotalCost) TotalCost 
FROM NewSales 
WHERE DateKey > '20240101' and DateKey <'20240201'

Now, let's bring the logic within the CROSS APPLY:

// correlated subquery in SQL
--
explain
SELECT ProductKey
, ProductName
, TotalCost
FROM Products
    CROSS APPLY (
        SELECT sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
          AND Products.ProductKey = NewSales.ProductKey
    ) DAT

Running the above code leads to the following error:

"Sql node of type 'Microsoft.SqlServer.TransactSql.ScriptDom.UnqualifiedJoin' is not implemented"

Unfortunately, many SQL queries are written following this pattern, especially when an OUTER CROSS APPLY is used, retrieving thus all the records from the dimension table.

In this case one can rewrite the query via a RIGHT JOIN:

// correlated subquery in SQL
--
explain
SELECT PRD.ProductKey
, PRD.ProductName
, SAL.TotalCost
FROM Products PRD
    LEFT JOIN (
        SELECT ProductKey
        , sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
        GROUP BY ProductKey
    ) SAL
      ON PRD.ProductKey = SAL.ProductKey

// direct translation of the query
Products
| join kind=leftouter 
    (NewSales
        | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
        | summarize TotalCost=sum(TotalCost) by ProductKey
        | project ProductKey, TotalCost
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()


// query after restructuring
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| summarize TotalCost=sum(TotalCost) by ProductKey
| join kind=rightouter
    (
        Products
        | project ProductKey, ProductName
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()

During transformations it's important to check whether the number of records changes between the various versions of the query (including the most general version in which filtering constraints were applied).

Especially when SQL solutions are planned to be migrated to KQL, it's important to know which query patterns can be used in KQL.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] GitHib (2024) Module Ex-01 - Advanced KQL [link]

06 February 2025

🌌🏭KQL Reloaded: First Steps (Part VI: Actual vs. Estimated Count)

More examples are available nowadays for developers, at least compared with 10-20 years ago when besides the scarce documentation, the blogs and source code from books were the only ways to find how a function or a piece of standard functionality works. Copying code without understanding it may lead to unexpected results, with all the consequences resulting from this.

A recent example in this direction in KQL are the dcount and dcountif functions, which according to the documentation calculates an estimate of the number of distinct values that are taken by a scalar expression in the summary group. An estimate is not the actual number of records, trading performance for accuracy. The best example are the following pieces of code:

// counting records 
NewSales
| summarize record_count = count() // values availanle 
    , aprox_distinct_count = dcount(CustomerKey) // estimated disting values
    , distinct_count = count_distinct(CustomerKey) // actual number of records
    , aprox_distict_count_by_value  = dcountif(CustomerKey, SalesAmount <> 0) //estimated count of records with not null amounts
    , distict_count_by_value  = count_distinctif(CustomerKey, SalesAmount <> 0) // count of records with not null amounts
    , aprox_distict_count_by_value2  = dcountif(CustomerKey, SalesAmount == 0) //estimated count of records with null amounts
    , distict_count_by_value2  = count_distinctif(CustomerKey, SalesAmount == 0) // count of records with not amounts
| extend error_aprox_distinct_count = distinct_count - aprox_distinct_count
    , error_aprox_distict_count_by_value = distict_count_by_value - aprox_distict_count_by_value

Output:

record_count	aprox_distinct_count	distinct_count	aprox_distict_count_by_value	distict_count_by_value	aprox_distict_count_by_value2	distict_count_by_value2
2832193	18497	18484	18497	18484	10251	10219
error_aprox_distinct_count		error_aprox_distict_count_by_value
-13		-13

It's interesting that the same difference is observable also when a narrower time interval is chosen (e.g. 1 month). When using estimate it's important to understand also how big is the error between the actual value and the estimate, and that's the purpose of the last two lines added to the query. In many scenarios the difference might be neglectable until is not.

One can wonder whether the two functions are deterministic, in other words whether they return the same results if given the same input values. It would be also useful to understand what's the performance of the two estimative functions especially when further constraints are applied.

Moreover, the functions accept a third parameter which allows control over the trade between speed and accuracy (see provided table).

// counting records 
NewSales
| summarize record_count = count() // values availanle 
    , distinct_count = count_distinct(CustomerKey) // actual number of records
    , aprox_distinct_count = dcount(CustomerKey) // estimated disting values (default)
    , aprox_distinct_count0 = dcount(CustomerKey, 0) // 0-based accuracy
    , aprox_distinct_count1 = dcount(CustomerKey, 1) // 1-based accuracy (default)
    , aprox_distinct_count2 = dcount(CustomerKey, 2) // 2-based accuracy
    , aprox_distinct_count3 = dcount(CustomerKey, 3) // 3-based accuracy
    , aprox_distinct_count4 = dcount(CustomerKey, 4) // 4-based accuracy

Output:

record_count	distinct_count	aprox_distinct_count	aprox_distinct_count0	aprox_distinct_count1	aprox_distinct_count2	aprox_distinct_count3	aprox_distinct_count4
2832193	18484	18497	18793	18497	18500	18470	18487

It will be interesting to see which one of these parameters are used in practice. The problems usually start when different approximation parameters are used alternatively with no previous agreement. How could one argument in the favor of one parameter over the others?

A natural question: how big will be the error introduced by each parameter? Usually, when approximating values, one needs to specify also the expected error somehow. The documentation provide some guiding value, though are these values enough? Do similar estimate functions make sense also for the other aggregate functions?

In exchange, the count_distinct and count_distinctif seem to be still in preview, with all the consequences derived from this. They are supposed to be more resource-intensive than the estimative counterparts. Conversely, the values returned can be still rounded in dashboards up to the meaningful unit (e.g. thousands), and this usually depends on the context. The question whether the values can be rounded can be put also in the context and the estimative counterparts. It would be interesting to check how far away are the rounded values from each other in the context of the two sets of functions.

In practice, counting is useful for calculating percentages (e.g. how many customers come from a certain zone compared to the total), which are more useful and easier to grasp than big numbers:

// calculating percentages from totals
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey)
    , count_customers_US = count_distinctif(CustomerKey, RegionCountryName == 'United States')
    , count_customers_CA = count_distinctif(CustomerKey, RegionCountryName == 'Canada')
    , count_customers_other = count_distinctif(CustomerKey, not(RegionCountryName in ('United States', 'Canada')))
| extend percent_customers_US = iif(count_customers<>0, round(100.00 * count_customers_US/count_customers, 2), 0.00)
    , percent_customers_CA = iif(count_customers<>0, round(100.00 * count_customers_CA/count_customers, 2), 0.00)
    , percent_customers_other = iif(count_customers<>0, round(100.00 * count_customers_other/count_customers,2), 0.00)

Output:

count_customers	count_customers_US	count_customers_CA	count_customers_other	percent_customers_US	percent_customers_CA	percent_customers_other
10317	3912	789	5616	37.92	7.65	54.43

Note:
When showing percentages it's important to provide also the "context", the actual count or amount. This allows to understand the scale associated with the percentages.

Happy coding!

Previous Post <<||>> Next Post

Resources:
[R1] Arcane Code (2022) Fun With KQL – DCount, by R C Cain [link]

[R2] M Morowczynsk et al (2024) "The Definitive Guide to KQL" [sample]
[R3] M Zorich (2022) Too much noise in your data? Summarize it! [link]

04 February 2025

🌌🏭KQL Reloaded: First Steps (Part III: Basic Joins)

As data is usually dispersed over multiple tables, in any query languages is quintessential to know how to bring the data together for further analysis, joins allowing to achieve this is SQL as well in KQL. One usually starts with the main table and joins the further tables to it. In a data warehouse, the main table is a fact table to which the dimension tables are further joined to add details, respectively to filter on the dimensions.

Before starting to build any logic, it's important to get an idea of tables' cardinality and which are the keys on which the data to be join. Some models, like the one in the ContosoSales model, are straightforward and the columns participating in the joins have the same names, or at least similar names.

When writing and testing queries, one can start with a small subset of the data, join the tables together, and at the end validate the logic as needed. Selecting a small interval (e.g. a month, week, or even a day) and the records for 1-2 representative records from a dimensional table (e.g. Customers) helps in the process for minimizing the data movement and

// check how many records are available 
// Sales table: 28980 based on the filter
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()

// Products: 2517 records
Products
| summarize record_count = count()

// Customers: 18484 records
Customers 
| summarize record_count = count()

In a second step one can look at the individual values, identify uniqque indentifiers, the columns that participate in joins, respectively other columns of interest. If needed, one can five deeper by checking the number for individual values. Selecting 10-100 records is usually enough for getting an idea about the dataset, though there are also many exceptions:

// review the data
// Sales table: 28980 based on the filter
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| take 10

// Products: 2517 records
Products
| take 10

// Customers: 18484 records
Customers 
| take 10

In KQL one must start with the fact table (e.g NewSales) and add the dimensions accordingly. Alternatively, it's useful to group only by ProductKey, respectively by CustomerKey.

// review the basis output by Products & Customers
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| order by TotalCost desc
| summarize result_record_count = count()

Now let's join the Product dimension table to the NewSales fact table via the join operator:

// total by product via join
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey
| join kind=inner (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    )
    on ProductKey
| project ProductName, TotalCost
//| summarize record_count = count() 
| order by TotalCost desc

One can join now also the second dimension table and eventually select only the columns in scope, respectively do further calculations (see CustomerName):

// total by product & customer via joins
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| join kind=inner (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    | project ProductKey, ProductName 
    )
    on ProductKey
| join kind=inner (
    Customers
    | project CustomerKey, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project CustomerName, ProductName, TotalCost
//| summarize record_count = count() 
| order by CustomerName asc, ProductName asc

Once the query built, one can go ahead and remove the pieces of logic not needed, an reevaluate the number of records with a count (see commented line).

Alternatively, one can rewrite the above query as follows via the lookup operators:

// total by Product & Customer via lookup
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| lookup Products on ProductKey
| lookup Customers on CustomerKey 
//| summarize record_count = count()

| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost

The two last queries should produce the same results. The first query allows more flexibility in what concerns the constraints applied, while the second is more compact while it also allows to easier test how the number of records changes by added each join. This set of tests is done to make sure that duplicates aren't created in the process, respectively that records are not removed unnecessarily from the logic.

If the fields used in joins don't have the same name, which happens in data many models, one can use the $left and $right to refer to the table participating in the join. This syntax makes the query a bit more verbose, though it brings more clarity. Even if some vendors provide similar syntax (e.g. USING in Oracle), one may prefer the syntax that offers clarity.

// total by product via lookup with table alias
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| lookup Products on $left.ProductKey == $right.ProductKey
| lookup Customers on $left.CustomerKey == $right.CustomerKey
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost

Notes:
1) For readability purposes, it might be useful to use indentation for the lines that are continuation of the previous line. Therefore, if the line doesn't start with a pipe ("|"), the next line starts a few characters (a tab) further.

2) According to the documentation (see [1]), the "lookup" operator optimizes the performance of queries where a fact table is enriched with data from a dimension table.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Kusto Tutorial: Join data from multiple tables [link]

23 January 2025

💎SQL Reloaded: Number of Records VI (via sp_MSForEachTable Undocumented Stored Procedure)

Starting with SQL Server 2000 it's possible to execute a command via the undocumented stored procedure sp_MSForEachTable for each table available in a database, respectively for subsets of the tables. In a previous post I shown how the stored procedure can be used in several scenarios, including how to get the total number of records in each set of tables. However, the code used generates a result set for each table, which makes it difficult to aggregate the information for further processing. In many scenarios, it would be useful to store the result as a temporary or even persisted table.

-- dropping the tables
DROP TABLE IF EXISTS #Tables
DROP TABLE IF EXISTS #TablesRecordCount

-- create a temporary table to store the input list
SELECT TableName
INTO #Tables 
FROM (VALUES ('Person.Address')
, ('Person.AddressType')
, ('Person.BusinessEntity')) DAT(TableName)


-- create a temporary table to store the results
CREATE TABLE dbo.#TablesRecordCount (
  table_name nvarchar(150) NOT NULL
, number_records bigint
, run_date datetime2(0)
, comment nvarchar(255)
)

-- getting the number of records for the list of tables into the result table
INSERT INTO #TablesRecordCount
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) numer_records, GetDate() run_date, ''testing round 1'' comment FROM ?'
, @whereand = ' And Object_id In (Select Object_id(TableName) FROM #Tables)'

-- reviewing the result
SELECT *
FROM #TablesRecordCount
ORDER BY number_records DESC

The above solution uses two temporary tables, though it can be easily adapted to persist the result in a standard table: just replace the "#" with the schema part (e.g. "dbo."). This can be useful in troubleshooting scenarios, when the code is run at different points in time, eventually for different sets of tables.

The code is pretty simple and can be extended as needed. Unfortunately, there's no guarantee that the sp_MSForEachTable stored procedure will be supported in the next versions of the SQL Server. For example, the stored procedure is not available in SQL databases, respectively in Fabric warehouses. In SQL databases the following error is thrown:

"Msg 2812, Level 16, State 62, Line 1, Could not find stored procedure 'sys.sp_MSForEachTable'."

To test whether the feature works in your environment, it's enough to run a call to the respective stored procedure:

-- retrieve the record count for all tables
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) numer_records FROM ?'

Or, you can check whether it works for one table (replace the Person.AddressType table with one from your environment):

-- getting the number of records for the list of tables into another table
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) numer_records FROM ?'
, @whereand = ' And Object_id = Object_id(''Person.AddressType'')'

The solution could prove to be useful in multiple scenarios, though one should consider also the risk of being forced to rewrite the code when the used stored procedure becomes unavailable. Even if it takes more time to write, a solution based on cursors can be more feasible (see previous post).

Update 29-Jan-2025: Probably, despite their usefulness, the undocumented features will not be brought to SQL databases (see [1], 47:30). So, be careful about using the respective features as standard solutions in production environments!

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

SQL Troubles

Pages

01 March 2025

💎🤖💫SQL Reloaded: Copilot Stories (Part VII: Using the sp_MSForEachTable Undocumented Stored Procedures in SQL Server)

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

06 February 2025

🌌🏭KQL Reloaded: First Steps (Part VI: Actual vs. Estimated Count)

04 February 2025

🌌🏭KQL Reloaded: First Steps (Part III: Basic Joins)

23 January 2025

💎SQL Reloaded: Number of Records VI (via sp_MSForEachTable Undocumented Stored Procedure)

About Me