SQL Troubles

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 9-Feb-2024

[Microsoft Fabric] Data pipeline

{def} a logical sequence of activities that orchestrate a process and perform together a task [1]

usually by extracting data from one or more sources and loading it into a destination;

⇐ often transforming it along the way [1]

⇐ allows to manage the activities as a set instead of each one individually [2]
⇐ used to automate ETL processes that ingest transactional data from operational data stores into an analytical data store [1]
e.g. lakehouse or data warehouse

{concept} activity

{def} an executable task in a pipeline

a flow of activities can be defined by connecting them in a sequence [1]
its outcome (success, failure, or completion) can be used to direct the flow to the next activity in the sequence [1]

{type} data movement activities

copies data from a source data store to a sink data store [2]

{type} data transformation activities

encapsulate data transfer operations

incl. simple Copy Data activities that extract data from a source and load it to a destination
incl. complex Data Flow activities that encapsulate dataflows (Gen2) that apply transformations to the data as it is transferred
incl. notebook activities to run a Spark notebook
incl. stored procedure activities to run SQL code
incl. delete data activities to delete existing data

{type} control flow activities

used to

implement loops
implement conditional branching
manage variables
manage parameter values
enable to implement complex pipeline logic to orchestrate data ingestion and transformation flow [1]

can be parameterized

⇐enabling to provide specific values to be used each time a pipeline is run [1]

when executed, a run is initiated (aka data pipeline run

runs can be initiated on-demand or scheduled to start at a specific frequency
use the unique run ID to review run details to confirm they completed successfully and investigate the specific settings used for each execution [1]

{benefit} increases pipelines’ reusability

{concept} pipeline template

predefined pipeline that can be used and customize as required

{concept} data pipeline run

occurs when a data pipeline is executed
the activities in the data pipeline are executed to completion [3]
can be triggered one of two ways

on-demand
on a schedule

the scheduled pipeline will be able to run based on the time and frequency set [3]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2023) Use Data Factory pipelines in Microsoft Fabric [link]
[2] Microsoft Learn (2024) Microsoft Fabric: Activity overview [link]
[3] Microsoft Learn (2024) Microsoft Fabric Concept: Data pipeline Runs [link]

Resources
[R1] Metadata Driven Pipelines for Microsoft Fabric (link)
[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

🌌🏭KQL Reloaded: First Steps (Part XI: Window Functions)

Window functions are one of the powerful features available in RDBMS as they allow to operate across several lines or a result set and return a result for each line. The good news is that KQL supports several window functions, which allow to address several scenarios. However, the support is limited and needs improvement.

One of the basic scenarios that takes advantage of windows functions is the creation of a running sum or creating a (dense) rank across a whole dataset. For this purposes, in Kusto one can use the row_cumsum for running sum, respectively the row_number, row_rank_dense and row_rank_min for ranking. In addition, one can refer to the values of a field from previous (prev) and next record.

// rank window functions within dataset (not partitioned)
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| where CustomerKey in (12617, 12618)
| project CustomerKey, DateKey, ProductName, TotalCost
| sort by CustomerKey asc, DateKey asc 
| extend Rank1 = row_rank_dense(DateKey)
    , Rank2 = row_number(0)
    , Rank3 = row_rank_min(DateKey)
    , Sum = row_cumsum(TotalCost)
    , NextDate = next(DateKey)
    , PrevDate = prev(DateKey)

Often, it's needed to operate only inside of a partition and not across the whole dataset. Some of the functions provide additional parameters for this:

// rank window functions within partitions
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| where CustomerKey in (12617, 12618)
| project CustomerKey, DateKey, ProductName, TotalCost
| sort by CustomerKey asc, DateKey asc 
| extend RowRank1 = row_rank_dense(DateKey, prev(CustomerKey) != CustomerKey)
    , RowRank2 = row_number(0, prev(CustomerKey) != CustomerKey)
    , Sum = row_cumsum(TotalCost, prev(CustomerKey) != CustomerKey)
    , NextDate = iif(CustomerKey == next(CustomerKey), next(DateKey), datetime(null))
    , PrevDate = iif(CustomerKey == prev(CustomerKey), prev(DateKey),  datetime(null))

In addition, the partitions can be defined explicitly via the partition operator:

// creating explicit partitions
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| where CustomerKey in (12617, 12618)
| project CustomerKey, DateKey, ProductName, TotalCost
| partition by CustomerKey
(
    order by DateKey asc
    | extend prev_cost = prev(TotalCost, 1)
)
| order by CustomerKey asc, DateKey asc
| extend DifferenceCost = TotalCost - prev_cost

It will be interesting to see whether Microsoft plans for the introduction of further window functions in KQL to bridge the gap. The experience proved that such functions are quite useful in data analytics, sometimes developers needing to go a long extent for achieving the respective behavior (see visual calcs in Power BI). For example, SQL Server leverages several types of such functions, though it took Microsoft more than several versions to this make progress. More over, developers can introduce their own functions, even if this involves .Net programming.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Kusto: Window functions overview [link]

🏭🗒️Microsoft Fabric: Kusto Query Language (KQL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 9-Feb-2025

[Microsoft Fabric] Kusto Query Language (KQL)

{def} a read-only request to process query language [1]

designed for data exploration and summarization [1]

very similar to SQL

the explain command can be used to transform SQL into KQL code

⇐ not all the SQL syntax can be translated

statements are sequenced being executed in the order of their arrangement

funnel like processing where data is piped from one operator to the next

data is filtered, rearranged or summarized at each step and then fed into the following step
statements are sequenced by a pipe (|)

returns data in a tabular or graph format
designed and developed to take advantage of cloud computing through clustering and scaling compute [2]

ideal engine to power fast, real-time dashboards

case-sensitive in general
named after the undersea pioneer Jacques Cousteau [2]
operation sequence

filter data
aggregate data
order data
modify column output

supports standard data types

string

a sequence of zero or more Unicode characters
characters are encoded in UTF-8.

32-bit whole-number integer

long

signed 64-bit whole-number integer

real (aka double)

64-bit decimal-based number
and provides high precision with decimal points.

decimal

a 128-bit decimal number
provides the highest precision of decimal points
{recommendation} if precision is not needed, use the real type instead [2]

bool

a boolean value that can be a true (1), false (0), or null

datetime

represents a date in the UTC zone

timespan

represents a time interval

days, hours, minutes, seconds, milliseconds, microseconds, tick

if no time frame is specified, it will default to day

dynamic

a special data type that can take

any value from the other data types
arrays
a {name = value} property bag

guid

a 128-bit globally unique value

statement types

tabular expression statement
let statement

used to

set variable names equal to an expression
create views

⇐ used mostly to

help break complex expressions into multiple parts, each represented by a variable
sett constants outside the query to aid in readability

set statement

used to set the query duration

{tool}Microsoft Santinel

{def} a cloud native SIEM and SOAR that provides cyberthreat detection, investigation, response, and proactive hunting, with a bird's-eye view across your enterprise [3]

{tool} Kusto Explorer

{def} user-friendly interface to query and analyze data with KQL [4]

{tool} Azure Data Studio

{def} lightweight, cross-platform data management and development tool for data professionals [5]

Previous Post <<||>> Next Post

References:
[1] Microsoft (2024) Real-time Analytics: End-to-End Workshop
[2] Mark Morowczynski et al (2024) The Definitive Guide to KQL: Using Kusto Query Language for Operations, Defending, and Threat Hunting
[3] Microsoft Learn (2024) Azure: What is Microsoft Sentinel [link]

[4] Microsoft Learn (2024) Kusto: Kusto.Explorer installation and user interface [link]
[5] Microsoft Learn (2024) SQL: What is Azure Data Studio? [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

KQL - Kusto Query Language (
SIEM - security information and event management

SOAR - security orchestration, automation, and response
SQL - Structured Query Language

UTC - Universal Time Coordinated

🏭🗒️Microsoft Fabric: Sharding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 9-Feb-2024

[Microsoft Fabric] Data Partitioning (aka Sharding)

{definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
allows a logical database to be partitioned across multiple physical servers [1]

each partition is referred to as a shard
the largest tables are partitioned across multiple database servers [1]

when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]

partitioning is based on a Key Value

e.g. such as a user ID

proven technique for achieving data processing on a massive scale [1]

solution used at the largest websites

e.g. Facebook, Twitter
usually associated with rapid growth

⇒ the approach needs to be dynamic [1]

the only way to scale a relational database to massive web use [1]

together with caching and replication [1]

{drawback} involves significant operational complexities and compromises [1]

the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.

{drawback} high operational costs [1]
{drawback} application complexity

it’s up to the application code to route SQL requests to the correct shard [1]

⇒ a dynamic routing layer must be implemented

⇐ most massive websites are adding shards as they grow [1]
layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]

{drawback} crippled SQL

[sharded database] it is not possible to issue a SQL statement that operates across shards [1]

⇒ usually SQL statements are limited to row-level access [1]
⇒ only programmers can query the database as a whole [1]
joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]

{drawback} loss of transactional integrity

ACID transactions against multiple shards are not possible and/or not practical [1]

⇐ {exception} there are database systems that support 2PC

involves considerable troubleshooting as conflicts and bottlenecks can occur [1]

{drawback} operational complexity.

load balancing across shards becomes extremely problematic

adding new shards requires a complex rebalancing of data [1]
changing the database schema requires a rolling operation across all the shards [1]

⇒ can lead to transitory inconsistencies in the schema [1]

a sharded database entails a huge amount of operational effort and administrator skill [1]

{concept} CAP (Consistency, Availability, and Partition) theorem

in a distributed database system, one can have at most only two of CAP tolerance [1]
consistency

every user of the database has an identical view of the data at any given instant [1]

availability

in the event of a failure, the database remains operational [1]

partition tolerance

the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]

{concept} partitioning

{def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]

⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.

a partition is a set of replicas)

{type} [stateless services] a logical unit that contains one or more instances of a service [5]

partitioning a stateless service is a very rare scenario
scalability and availability are normally achieved by adding more instances
{subtype} externally persisted state

persists its state externally [5]

e.g. databases in Azure SQL Database

{subtype} computation-only services

service that do not manage any persistent state e.g. calculator or image thumbnailing [5]

{type} scalable stateful services

partition state (data)
a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
the state must be accessed and stored

⇒ bound by

network bandwidth limits
system memory limits
disk storage limits

{scenario} run into resource constraints in a running cluster

{recommendation} scale out the cluster to accommodate the new requirements [4]

{concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]

⇐ addresses the significant challenges in developing and managing cloud applications
places the partitions on different nodes [5]

allows partitions to grow to a node's resource limit

⇐ partitions are rebalances across nodes [5]

{benefit} ensures the continued efficient use of hardware resources [5]

{default} makes sure that there is about the same number of primary and secondary replicas on each node

⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
hot and cold spots may appear in a cluster

⇐ it should be preferably avoided

{recommendation} partition the state so is evenly distributed across all partitions [5]
{recommendation} report load from each of the replicas for the service [5]

provides the capability to report load consumed by services [5]

e.g. amount of memory, number of records
detects which partitions server higher loads than others [5]

⇐ based on the metrics reported

rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded [5]
⇐ it's not always possible to know how much data will be in a given partition

{recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]

{benefit} prevents situations described in the voting example [5]

{recommendation} report load

{benefit} helps smooth out temporary differences in access or load over time [5]

{recommendation} choose an optimal number of partitions to begin with

⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]

⇐ assuming the maximum number of partitions is a valid approach [5]

⇒ one may end up needing more partitions than initially considered [5]

⇐ {constraint} the partition count can't be changed after the fact [5]

⇒ apply more advanced partition approaches

e.g. creating a new service instance of the same service type
e.g. implement client-side logic that routes the requests to the correct service instance

Previous Post <<||>> Next Post

References:

[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data

[2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed

[3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
[4] Microsoft Fabric (2024) Data sharding policy [link]
[5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
[6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ACID - atomicity, consistency, isolation, durability

2PC - Two Phase Commit

CAP - Consistency, Availability, Partition

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

In SQL Server and other RDBMS databases there are many scenarios in which one needs information from a fact table based on a dimension table without requiring information from the dimension table.

Correlated Subquery via EXISTS

Before considering the main example, let's start with a simple subquery:

// subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE ProductKey IN (
    SELECT DISTINCT ProductKey
    FROM Products 
    WHERE ProductSubcategoryName = 'MP4&MP3'
    ) 

// subquery in KQL
NewSales
| where ProductKey in (
    (Products
    | where (ProductSubcategoryName == "MP4&MP3")
    | project ProductKey
    | distinct *))
| project CustomerKey, ProductKey, ProductName

Of course, the ProductKey is unique by design, though there can be dimension, fact tables or subqueries in which the value is not unique.

Now let's consider the correlated subquery pattern, which should provide the same outcome as above, though in RDBMS there are scenarios in which it provides better performance, especially when the number of values from subquery is high.

// correlated subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE EXISTS (
    SELECT Products.ProductKey
    FROM Products 
    WHERE NewSales.ProductKey = Products.ProductKey)

Unfortunately, trying to translate the code via explain leads to the following error, which confirms that the syntax is not supported in KQL (see [1]):

"Error: Reference to missing column 'NewSales.ProductKey'"

Fortunately, in this case one can use the first version of the query.

Correlated Subquery via CROSS APPLY

Before creating the main query, let's look at the inner query and check whether it gets correctly translate to KQL:

// subquery logic
--
explain
SELECT sum(TotalCost) TotalCost 
FROM NewSales 
WHERE DateKey > '20240101' and DateKey <'20240201'

Now, let's bring the logic within the CROSS APPLY:

// correlated subquery in SQL
--
explain
SELECT ProductKey
, ProductName
, TotalCost
FROM Products
    CROSS APPLY (
        SELECT sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
          AND Products.ProductKey = NewSales.ProductKey
    ) DAT

Running the above code leads to the following error:

"Sql node of type 'Microsoft.SqlServer.TransactSql.ScriptDom.UnqualifiedJoin' is not implemented"

Unfortunately, many SQL queries are written following this pattern, especially when an OUTER CROSS APPLY is used, retrieving thus all the records from the dimension table.

In this case one can rewrite the query via a RIGHT JOIN:

// correlated subquery in SQL
--
explain
SELECT PRD.ProductKey
, PRD.ProductName
, SAL.TotalCost
FROM Products PRD
    LEFT JOIN (
        SELECT ProductKey
        , sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
        GROUP BY ProductKey
    ) SAL
      ON PRD.ProductKey = SAL.ProductKey

// direct translation of the query
Products
| join kind=leftouter 
    (NewSales
        | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
        | summarize TotalCost=sum(TotalCost) by ProductKey
        | project ProductKey, TotalCost
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()


// query after restructuring
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| summarize TotalCost=sum(TotalCost) by ProductKey
| join kind=rightouter
    (
        Products
        | project ProductKey, ProductName
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()

During transformations it's important to check whether the number of records changes between the various versions of the query (including the most general version in which filtering constraints were applied).

Especially when SQL solutions are planned to be migrated to KQL, it's important to know which query patterns can be used in KQL.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] GitHib (2024) Module Ex-01 - Advanced KQL [link]

🌌🏭KQL Reloaded: First Steps (Part IX: Translating SQL to KQL - More Joins)

The last post exemplified the use of "explain" to translate queries from SQL to KQL. The current post attempts to test the feature based on the various join constructs available in SQL by using the NewSales and Products tables. The post presumes that the reader as a basic understanding of the join types from SQL-based environments.

Inner Join

// full join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// full join in KQL
NewSales
| join kind=inner (
    Products
    | project ProductKey, ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| project CustomerKey, ProductKey, ProductName
| limit 10

A full join is probably the most used type of join given that fact tables presume the existence of dimensions, even if poor data warehousing design can lead also to exception. The join retrieves all the data matching from both tables, including the eventual duplicates from both sides of the join.

Left Join

// left join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     LEFT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// left join in KQL
NewSales
| join kind=leftouter (
    Products
    | project ProductKey

        , Product = ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| where isnull(Product) 
| project CustomerKey
    , ProductKey
    , ProductName
| limit 10

A left join retrieves all the records from the left table, typically the fact table, independently whether records were found in the dimension table. One can check whether mismatches exist by retrieving the records where no match was found.

Right Join

// right join in SQL
--
explain
SELECT NewSales.CustomerKey
, Products.ProductKey
, Products.ProductName
FROM NewSales 
     RIGHT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// right join in KQL
NewSales
| join kind=rightouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
| where isnull(ProductKey) 
| project CustomerKey
    , DimProductKey
    , DimProductName
| limit 10

A right join retrieves the records from the dimension together with the matches from the fact table, independently whether a match was found in the fact table.

Full Outer Join

// full outer join in SQL
--
explain
SELECT NewSales.CustomerKey
, Coalesce(NewSales.ProductKey, Products.ProductKey) ProductKey
, Coalesce(NewSales.ProductName, Products.ProductName) ProductName
FROM NewSales 
     FULL OUTER JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 


// full outer join in KQL
NewSales
| join kind=fullouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
//| where isnull(ProductKey) 
| project CustomerKey
    , ProductKey = coalesce(ProductKey, DimProductKey)
    , ProductName = coalesce(ProductName, DimProductName)
| limit 10

A full outer join retrieves all the data from both sides of the join independently on whether a match is found. In RDBMS this type of join performs poorly especially when further joins are considered, respectively when many records are involved on both sides of the join. Therefore it should be avoided when possible, though in many cases it might be the only feasible solution. There are also alternatives that involve a UNION between a LEFT JOIN and a RIGHT JOIN, the letter retrieving only the records which is not found in the fact table (see last query from a previous post). This can be a feasible solution when data sharding is involved.

Notes:
1) If one ignores the unnecessary logic introduced by the translation via explain, the tool is excellent for learning KQL. It would be interesting to understand why the tool used a certain complex translation over another, especially when there's a performance benefit in the use of a certain piece of code.
2) Also in SQL-based queries it's recommended to start with the fact table, respectively with the table having the highest cardinality and/or the lowest level of detail, though the database engine might find an optimal plan independently of which table was written first.

Happy coding!

Previous Post <<||>> Next Post

🌌🏭KQL Reloaded: First Steps (Part VIII: Translating SQL to KQL - Full Joins)

One of the great features of KQL is the possibility of translating SQL code to KQL via the "explain" keyword, allowing thus to port SQL code to KQL, respectively help translate knowledge from one programming language to another.

Let's start with a basic example:

// transform SQL to KQL code (to be run only the first part from --)
--
explain
SELECT top(10) CustomerKey, FirstName, LastName, CityName, CompanyName 
FROM Customers 
ORDER BY CityName DESC

// output: translated KQL code 
Customers
| project CustomerKey, FirstName, LastName, CityName, CompanyName
| sort by CityName desc nulls first
| take int(10)

The most interesting part of the translation is how "explain" translate joins from SQL to KQL. Let's start with a FULL JOIN from the set of patterns considered in a previous post on SQL joins:

--
explain
SELECT CST.CustomerKey
, CST.FirstName + ' ' + CST.LastName CustomerName
, Cast(SAL.DateKey as Date) DateKey
, SAL.TotalCost
FROM NewSales SAL
    JOIN Customers CST
      ON SAL.CustomerKey = CST.CustomerKey 
WHERE SAL.DateKey > '20240101' AND SAL.DateKey < '20240201'
ORDER BY CustomerName, DateKey, TotalCost DESC

And, here's the translation:

// translated code
NewSales
| project-rename ['SAL.DateKey']=DateKey
| join kind=inner (Customers
| project-rename ['CST.CustomerKey']=CustomerKey
    , ['CST.CityName']=CityName
    , ['CST.CompanyName']=CompanyName
    , ['CST.ContinentName']=ContinentName
    , ['CST.Education']=Education
    , ['CST.FirstName']=FirstName
    , ['CST.Gender']=Gender
    , ['CST.LastName']=LastName
    , ['CST.MaritalStatus']=MaritalStatus
    , ['CST.Occupation']=Occupation
    , ['CST.RegionCountryName']=RegionCountryName
    , ['CST.StateProvinceName']=StateProvinceName) 
    on ($left.CustomerKey == $right.['CST.CustomerKey'])
| where ((['SAL.DateKey'] > todatetime("20240101")) 
    and (['SAL.DateKey'] < todatetime("20240201")))
| project ['CST.CustomerKey']

    , CustomerName=__sql_add(__sql_add(['CST.FirstName']
    , " "), ['CST.LastName'])
    , DateKey=['SAL.DateKey']
    , TotalCost
| sort by CustomerName asc nulls first
    , DateKey asc nulls first
    , TotalCost desc nulls first
| project-rename CustomerKey=['CST.CustomerKey']

The code was slightly formatted to facilitated its reading. Unfortunately, the tool doesn't work well with table aliases, introduces also all the fields available from the dimension table, which can become a nightmare for the big dimension tables, the concatenation seems strange, and if one looks deeper, further issues can be identified. So, the challenge is how to write a query in SQL so it can minimize the further changed in QKL.

Probably, one approach is to write the backbone of the query in SQL and add the further logic after translation.

--
explain
SELECT NewSales.CustomerKey
, NewSales.DateKey 
, NewSales.TotalCost
FROM NewSales 
    INNER JOIN Customers 
      ON NewSales.CustomerKey = Customers.CustomerKey 
WHERE DateKey > '20240101' AND DateKey < '20240201'
ORDER BY NewSales.CustomerKey
, NewSales.DateKey

And the translation looks simpler:

// transformed query
NewSales
| join kind=inner 
(Customers
| project-rename ['Customers.CustomerKey']=CustomerKey
    , ['Customers.CityName']=CityName
    , ['Customers.CompanyName']=CompanyName
    , ['Customers.ContinentName']=ContinentName
    , ['Customers.Education']=Education
    , ['Customers.FirstName']=FirstName
    , ['Customers.Gender']=Gender
    , ['Customers.LastName']=LastName
    , ['Customers.MaritalStatus']=MaritalStatus
    , ['Customers.Occupation']=Occupation
    , ['Customers.RegionCountryName']=RegionCountryName
    , ['Customers.StateProvinceName']=StateProvinceName) 
    on ($left.CustomerKey == $right.['Customers.CustomerKey'])
| where ((DateKey > todatetime("20240101")) 
    and (DateKey < todatetime("20240201")))
| project CustomerKey, DateKey, TotalCost
| sort by CustomerKey asc nulls first
, DateKey asc nulls first

I would have written the query as follows:

// transformed final query
NewSales
| where (DateKey > todatetime("20240101")) 
    and (DateKey < todatetime("20240201"))
| join kind=inner (
    Customers
    | project CustomerKey
        , FirstName
        , LastName 
    ) on $left.CustomerKey == $right.CustomerKey
| project CustomerKey
    , CustomerName = strcat(FirstName, ' ', LastName)
    , DateKey
    , TotalCost
| sort by CustomerName asc nulls first
    , DateKey asc nulls first

So, it makes sense to create the backbone of a query, translate it to KQL via explain, remove the unnecessary columns and formatting, respectively add what's missing. Once the patterns were mastered, there's probably no need to use the translation tool, but could prove to be also some exceptions. Anyway, the translation tool helps considerably in learning. Big kudos for the development team!

Notes:
1) The above queries ignore the fact that Customer information is available also in the NewSales table, making thus the joins obsolete. The joins were considered only for exemplification purposes. Similar joins might be still needed for checking the "quality" of the data (e.g. for dimensions that change over time). Even if such "surprises" shouldn't appear by design, real life designs continue to surprise...
2) Queries should be less verbose by design (aka simplicity by design)! The more unnecessary code is added, the higher the chances for errors to be overseen, respectively the more time is needed to understand and validated the queries!

Happy coding!

Previous Post <<||>> Next Post

References
[1] Microsoft Lear n (2024) Azure: Query data using T-SQL [link]

SQL Troubles

Pages

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

🌌🏭KQL Reloaded: First Steps (Part XI: Window Functions)

🏭🗒️Microsoft Fabric: Kusto Query Language (KQL) [Notes]

🏭🗒️Microsoft Fabric: Sharding [Notes]

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

🌌🏭KQL Reloaded: First Steps (Part IX: Translating SQL to KQL - More Joins)

🌌🏭KQL Reloaded: First Steps (Part VIII: Translating SQL to KQL - Full Joins)

About Me