SQL Troubles

05 February 2025

🌌🏭KQL Reloaded: First Steps (Part IV: Left, Right, Anti-Joins and Unions)

In a standard scenario there is a fact table and multiple dimension table (see previous post), though one can look at the same data from multiple perspectives. In KQL it's recommended to start with the fact table, however in some reports one needs records from the dimension table independently whether there are any records in the fact tale.

For example, it would be useful to show all the products, independently whether they were sold or not. It's what the below query does via a RIGHT JOIN between the fact table and the Customer dimension:

// totals by customer via right join
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, TotalCost
//| summarize record_count = count() 
| order by CustomerName asc

The Product details can be added then via a LEFT JOIN between the fact table and the Product dimension:

// total by customer via right join with product information
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
    , LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| join kind=leftouter (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    | project ProductKey, ProductName 
    )
    on ProductKey
| project RegionCountryName, CustomerName, ProductName, TotalCost, FirstPurchaseDate, LastPurchaseDate, record_count
//| where record_count>1 // multiple purchases
| order by CustomerName asc, ProductName asc

These kind of queries need adequate validation and for this it might be needed to restructure the queries.

// validating the multiple records (defailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| lookup Products on ProductKey
| lookup Customers on CustomerKey 
| where  FirstName == 'Alexandra' and LastName == 'Sanders'
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost, DateKey, ProductKey, CustomerKey 

// validating the multiple records against the fact table (detailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| where CustomerKey == 14912

Mixing RIGHT and LEFT joins in this way increases the complexity of the queries and sometimes comes with a burden for validating the logic. In SQL one could prefer to start with the Customer table, add the summary data and the other dimensions. In this way one can do a total count individually starting with the Customer table and adding each join which reviewing the record count for each change.

In special scenario instead of a RIGHT JOIN one could use a FULL JOIN and add the Customers without any orders via a UNION. In some scenarios this approach can offer even a better performance. For this approach, one needs to get the Customers without Orders via an anti-join, more exactly a rightanti:

// customers without orders
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
, LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightanti (
    Customers
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, ProductName = '', TotalCost = 0
//| summarize count()
| order by CustomerName asc, ProductName asc

And now, joining the Customers with orders with the ones without orders gives an overview of all the customers:

// total by product via lookup with table alias
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey //, ProductKey 
//| lookup Products on ProductKey 
| lookup Customers on CustomerKey
| project RegionCountryName
    , CustomerName = strcat(FirstName, ' ', LastName)
    //, ProductName
    , TotalCost
//| summarize count()
| union withsource=SourceTable kind=outer (
    // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
)
//| summarize count()
| order by CustomerName asc
//, ProductName asc

And, of course, the number of records returned by the three queries must match. The information related to the Product were left out for this version, though it can be added as needed. Unfortunately, there are queries more complex than this, which makes the queries more difficult to read, understand and troubleshoot. Inline views could be useful to structure the logic as needed.

let T_customers_with_orders = view () { 
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost) 
        by CustomerKey
         //, ProductKey 
         //, DateKey
    //| lookup Products on ProductKey 
    | lookup Customers on CustomerKey
    | project RegionCountryName
        , CustomerName = strcat(FirstName, ' ', LastName)
        //, ProductName
        //, ProductCategoryName
        , TotalCost
        //, DateKey
};
let T_customers_without_orders = view () { 
   // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
};
T_customers_with_orders
| union withsource=SourceTable kind=outer T_customers_without_orders
| summarize count()

In this way the queries should be easier to troubleshoot and restructure the logic in more manageable pieces.

It would be useful to save the definition of the view (aka stored view), however it's not possible to create views on the machine on which the tests are run:

"Principal 'aaduser=...' is not authorized to write database 'ContosoSales'."

Happy coding!

Previous Post <<||>> Next Post

04 February 2025

🌌🏭KQL Reloaded: First Steps (Part III: Basic Joins)

As data is usually dispersed over multiple tables, in any query languages is quintessential to know how to bring the data together for further analysis, joins allowing to achieve this is SQL as well in KQL. One usually starts with the main table and joins the further tables to it. In a data warehouse, the main table is a fact table to which the dimension tables are further joined to add details, respectively to filter on the dimensions.

Before starting to build any logic, it's important to get an idea of tables' cardinality and which are the keys on which the data to be join. Some models, like the one in the ContosoSales model, are straightforward and the columns participating in the joins have the same names, or at least similar names.

When writing and testing queries, one can start with a small subset of the data, join the tables together, and at the end validate the logic as needed. Selecting a small interval (e.g. a month, week, or even a day) and the records for 1-2 representative records from a dimensional table (e.g. Customers) helps in the process for minimizing the data movement and

// check how many records are available 
// Sales table: 28980 based on the filter
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()

// Products: 2517 records
Products
| summarize record_count = count()

// Customers: 18484 records
Customers 
| summarize record_count = count()

In a second step one can look at the individual values, identify uniqque indentifiers, the columns that participate in joins, respectively other columns of interest. If needed, one can five deeper by checking the number for individual values. Selecting 10-100 records is usually enough for getting an idea about the dataset, though there are also many exceptions:

// review the data
// Sales table: 28980 based on the filter
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| take 10

// Products: 2517 records
Products
| take 10

// Customers: 18484 records
Customers 
| take 10

In KQL one must start with the fact table (e.g NewSales) and add the dimensions accordingly. Alternatively, it's useful to group only by ProductKey, respectively by CustomerKey.

// review the basis output by Products & Customers
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| order by TotalCost desc
| summarize result_record_count = count()

Now let's join the Product dimension table to the NewSales fact table via the join operator:

// total by product via join
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey
| join kind=inner (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    )
    on ProductKey
| project ProductName, TotalCost
//| summarize record_count = count() 
| order by TotalCost desc

One can join now also the second dimension table and eventually select only the columns in scope, respectively do further calculations (see CustomerName):

// total by product & customer via joins
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| join kind=inner (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    | project ProductKey, ProductName 
    )
    on ProductKey
| join kind=inner (
    Customers
    | project CustomerKey, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project CustomerName, ProductName, TotalCost
//| summarize record_count = count() 
| order by CustomerName asc, ProductName asc

Once the query built, one can go ahead and remove the pieces of logic not needed, an reevaluate the number of records with a count (see commented line).

Alternatively, one can rewrite the above query as follows via the lookup operators:

// total by Product & Customer via lookup
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| lookup Products on ProductKey
| lookup Customers on CustomerKey 
//| summarize record_count = count()

| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost

The two last queries should produce the same results. The first query allows more flexibility in what concerns the constraints applied, while the second is more compact while it also allows to easier test how the number of records changes by added each join. This set of tests is done to make sure that duplicates aren't created in the process, respectively that records are not removed unnecessarily from the logic.

If the fields used in joins don't have the same name, which happens in data many models, one can use the $left and $right to refer to the table participating in the join. This syntax makes the query a bit more verbose, though it brings more clarity. Even if some vendors provide similar syntax (e.g. USING in Oracle), one may prefer the syntax that offers clarity.

// total by product via lookup with table alias
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by ProductKey, CustomerKey
| lookup Products on $left.ProductKey == $right.ProductKey
| lookup Customers on $left.CustomerKey == $right.CustomerKey
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost

Notes:
1) For readability purposes, it might be useful to use indentation for the lines that are continuation of the previous line. Therefore, if the line doesn't start with a pipe ("|"), the next line starts a few characters (a tab) further.

2) According to the documentation (see [1]), the "lookup" operator optimizes the performance of queries where a fact table is enriched with data from a dimension table.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Kusto Tutorial: Join data from multiple tables [link]

🧭Business Intelligence: Perspectives (Part 26: Monitoring - A Cockpit View)

Business Intelligence Series

The monitoring of business imperatives is sometimes compared metaphorically with piloting an airplane, where pilots look at the cockpit instruments to verify whether everything is under control and the flight ensues according to the expectations. The use of a cockpit is supported by the fact that an airplane is an almost "closed" system in which the components were developed under strict requirements and tested thoroughly under specific technical conditions. Many instruments were engineered and evolved over decades to operate as such. The processes are standardized, inputs and outputs are under strict control, otherwise the whole edifice would crumble under its own complexity.

In organizational setups, a similar approach is attempted for monitoring the most important aspects of a business. A few dashboards and reports are thus built to monitor and control what’s happening in the areas which were identified as critical for the organization. The various gauges and other visuals were designed to provide similar perspectives as the ones provided by an airplane’s cockpit. At first sight the cockpit metaphor makes sense, though at careful analysis, there are major differences.

Probably, the main difference is that businesses don’t necessarily have standardized processes that were brought under control (and thus have variation). Secondly, the data used doesn’t necessarily have the needed quality and occasionally isn’t fit for use in the business processes, including supporting processes like reporting or decision making. Thirdly, are high the chances that the monitoring within the BI infrastructures doesn’t address the critical aspects of the business, at least not at the needed level of focus, detail or frequency. The interplay between these three main aspects can lead to complex issues and a muddy ground for a business to build a stable edifice upon.

The comparison with airplanes’ cockpit was chosen because the number of instruments available for monitoring is somewhat comparable with the number of visuals existing in an organization. In contrast, autos have a smaller number of controls simple enough to help the one(s) sitting in the cockpit. A car’s monitoring capabilities can probably reflect the needs of single departments or teams, though each unit needs its own gauges with specific business focus. The parallel is however limited because the areas of focus in organizations can change and shift in other directions, some topics may have a periodic character while others can regain momentum after a long time.

There are further important aspects. At high level, the expectation is for software products and processes, including the ones related to BI topics, to have the same stability and quality as the mass production of automobiles, airplanes or other artifacts that have similar complexity and manufacturing characteristics. Even if the design process of software and manufacturing may share many characteristics, the similar aspects diverge as soon as the production processes start, respectively progress, and these are the areas where the most differences lie. Starting from the requirements and ending with the overall goals, everything resembles the characteristics of quick shifting sands on which is challenging to build any stabile edifice.

At micro level in manufacturing each piece was carefully designed and produced according to a set of characteristics that were proved to work. Everything must fit perfectly in the grand design and there are many tests and steps to make sure that happens. To some degree the same is attempted when building software products, though the processes break along the way with the many changes attempted, with the many cost, time and quality constraints. At some point the overall complexity kicks back; it might be still manageable though the overall effort is higher than what organizations bargained for.

Previous Post <<||>> Next Post

02 February 2025

🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

Business Intelligence Series

Last week Microsoft had a great session [1] on the present and future of SQL databases, a “light” version of Azure SQL databases designed for Microsoft Fabric environments, respectively workloads. SQL databases are currently available for testing, and after the first tests the product looks promising. Even if there are several feature gaps, it’s expected that Microsoft will bridge the gaps over time. Conversely, there might be features that don’t make sense in Fabric, respectively new features that need to be considered for facilitating the work in OneLake and its ecosystem.

During the session several Microsoft professionals answered the audience’s questions, and they did a great job. Even if the answers and questions barely scratched the surface, they offered some insight into what Microsoft wants to do. Probably the expectation is that SQL databases won’t need any administration - indexes being maintained automatically, infrastructure scaling as needed, however everything sounds too nice to be true if one considers in general the experience with RDBMS – the devil hides usually in details.

Even if the solutions built follow the best practices in the field, which frankly seldom happens, transferring the existing knowledge to Fabric may encounter some important challenges revolving around performance, flexibility, accessibility and probably costs. Even if SQL databases are expected to fill some minor gaps, considering the lessons of the past, such solutions can easily grow. Even if a lot of processing power is thrown at the SQL queries and the various functionality, customers still need to write quality code and refactor otherwise the costs will explode sooner or later.

As the practice has proven so many times while troubleshooting performance issues, sometimes one needs to use all the arsenal available – DBCC, DMVs and sometimes even undocumented features - to get a better understanding of what’s happening. Even if there are some voices stating that developers don’t need to know how the SQL engine works, just applying solutions blindly after a recipe can accidentally increase the value of code, though most likely it doesn’t exploit the full potential available. Unfortunately, this is a subjective topic without hard numbers to support it, and the stories told by developers and third-parties usually don’t tell the whole story.

It’s also true that diving deep into a database’s internal working requires time, that’s quite often not available, and the value for such an effort doesn’t necessarily pay off. Above this, there’s a software engineer’s aim of understanding of how things work. Otherwise, one should drop the engineering word and just call it coding. Conversely, the data citizen just needs a high-level knowledge of how things work, though the past 20-30 years proved that that’s often not enough. The more people don’t have the required knowledge, the higher the chances that code needs refactoring. Just remember the past issues organizations had with MS Access and Excel when people started to create their own solutions, the whole infrastructure being invaded by poorly designed solutions that continue to haunt some organizations even today.

Even if lot of technical knowledge can be transported to Microsoft Fabric, the new environments may still require also adequate tools that can be used for monitoring and troubleshooting. Microsoft seems to work in this direction, though from the information available the tools don’t and can’t offer the whole perspective. It will be interesting to see how much the current, respectively the future dashboards and various reports can help; respectively what important gaps will surface. Until the gaps are addressed, probably the SQL professional must rely on SQL scripts and the DMVs available. All this can be summarized in a few words: it will not be boring!

Previous Post <<||>> Next Post

References:
[1] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

31 January 2025

🌌🏭KQL Reloaded: First Steps (Part II: Working with Dates)

Knowing how to work with dates and manipulate them accordingly is probably one of the important pieces of knowledge in any query language. Fortunately, KQL has many date-related functions that can help developers in the process and the available documentation is detailed enough to support users in learning their use. The current post focuses on the main uses of dates.

Please note that the scripts consider the main time units. Please check the documentation for functions' call for other time units.

Create dates via make_datetime, now and ago functions:

// creating dates from parts
print datetime1 = make_datetime(2025,1,1)
, datetime2 = make_datetime(2025,1,1,0,0,0)
, datetime3 = make_datetime(2025,1,1,1,2,3)

// creating dates from values
print date1 = datetime(2024-02-01)
, datetime(2024-02-01 00:00:00)
, datetime(2024-02-01 23:59:59)
, datetime(2024-02-01 23:59:59.123)

// now vs todatetime (return the same value)
print datetime1 = now()
, datetime2 = todatetime(now())
, datetime3 = ago(0h)

Extract information from dates via datetime_part, substring and individual functions: getyear, week_of_year, monthofyear, dayofyear, hourofday, dayofmonth, dayofweek, dayofyear, dayofmonth, dayofweek, endofyear, endofweek, endofmonth, endofday, startofyear, startofweek, startofmonth, startofday:

// date parts front date
print now = now()
, year = datetime_part('year', now())
, month = datetime_part('month', now())
, day = datetime_part('day', now())
, hour = datetime_part('hour', now())
, minute = datetime_part('minute', now())
, second = datetime_part('second', now())

// date parts from string 
let t = datetime("2024-12-31 10:35:59");
print year = substring(t, 0, 4) 
, month = substring(t, 5, 2) 
, day = substring(t, 8, 2) 
, hour = substring(t, 11, 2) 
, minute = substring(t, 14, 2) 
, second = substring(t, 17, 2) 

// date parts via functions
print year = getyear(now())
, week = week_of_year(now())//ISO 8601 compliant
, month = monthofyear(now())
, day = dayofyear(now())
, hour = hourofday(now())

// day functions
print year = dayofyear(now())
, month = dayofmonth(now())
, day = dayofweek(now())

// end of time dates
print year = endofyear(now())
, week = endofweek(now())
, month = endofmonth(now())
, day = endofday(now())

// start of time dates
print year = startofyear(now())
, week = startofweek(now())
, month = startofmonth(now())
, day = startofday(now())

//time units
print hours1 = time(1hour)
, minutes1 = time(1min)
, seconds1 = time(1second)
, hours2 = totimespan("0.01:00:00")
, minutes2 = totimespan("0.00:01:00")
, seconds2 = totimespan("0.00:00:01")

Working with dates via datetime_add and datetime_diff functions:

// adding time units
print year = datetime_add('year',1,now())
, month = datetime_add('month',1,now())
, day = datetime_add('day',1,now())
, hour = datetime_add('hour',1,now())
, minutes = datetime_add('minute',1,now())
, seconds =  datetime_add('second',1,now())

// data differences
print years = datetime_diff('year', now(), ago(1h))
, months = datetime_diff('month', now(), ago(1h))
, days = datetime_diff('day', now(), ago(1h))
, hours = datetime_diff('hour', now(), ago(1h))
, minutes = datetime_diff('minute', now(), ago(1h))
, seconds = datetime_diff('second', now(), ago(1h))

Working with time zones via datetime_local_to_utc and datetime_utc_to_local functions:

// local time across time zones
print ParisTime = datetime_local_to_utc(now(),'Europe/Paris')
, LondonTime = datetime_local_to_utc(now(),'Europe/London')

, LondonTime = datetime_local_to_utc(now(),'Europe/Budapest')
, AthensTime = datetime_local_to_utc(now(),'Europe/Athens')

// local time across time zones
print ParisTime = datetime_utc_to_local(now(),'Europe/Paris')
, LondonTime = datetime_utc_to_local(ago(1h),'Europe/London')
, BudapestTime = datetime_utc_to_local(ago(-1h),'Europe/Budapest')
, AthensTime = datetime_utc_to_local(ago(-2h),'Europe/Athens')

Applying different formatting with tostring, format_timespan and format_datetime functions:

// date and time to string
print date1 = tostring(datetime(2024-02-01 00:00:00))
, time1 = tostring(totimespan("0.01:02:03"))
, hours1 = tostring(time(1hour))

// formatting timespans
let t = time("25.10:35:59.123456");
print date1 = format_timespan(t, 'dd.hh:mm:ss:FF')
, date2 = format_timespan(t, 'ddd.h:mm:ss [ffff]')

// formatting dates
let t = datetime("2024-12-31 10:35:59");
print date1 = format_datetime(t,'yyyy-MM-dd [HH:mm:ss]')
, date2 = format_datetime(t,'yy-MM-dd HH:mm:ss')
, date3 = format_datetime(t,'yy-MM-dd [HH:mm:ss tt]')

Considering the above functions, one has a good basis for working with dates.

Happy coding!

Previous Post <<||>> Next Post

29 January 2025

🌌🏭KQL Reloaded: First Steps (Part I: Simple Queries)

If one has followed the Microsoft training, webcasts and other resources, it becomes clear that the Kusto Query Language (KQL) can't be ignored as it's useful in various scenarios thar deal with large volumes of data.

Even if KQL was created to query big tables, at least for the first steps it's recommended to start with a small data model, and when the basics were understood one can move to bigger databases. Moreover, it's easier to validate the logic when using small datasets.

Probably, the most important aspect before starting is that KQL is case-sensitive and this applies to everything – table and column names, operators, functions, etc. The below code can be tried in the online explorer (see [2], [3]), which makes available several databases for playing around. The users might need to register beforehand.

The following queries are based on the ContosoSales database (available in the above mentioned link). First, here are some simple projections. Each query is preceded by its short description in which the text was commented via "//" and must be run individually.

// selecting all records
Customers 

// selecting all records
Customers 
| take 10

// multiple filters
Customers 
| where CityName == 'Berkeley'
| where Occupation != 'Professional'
| take 10

// multiple filters on the same column
Customers 
| where Occupation == 'Professional' and Occupation == 'Clerical')
| take 10

// multiple filters on the same column
Customers 
| where Occupation in ('Professional','Clerical')
| take 10

// multiple filters on the same column
Customers 
| where not(Occupation in ('Professional','Clerical'))
| take 10

//subset of columns
Customers
| take 5
| project ContinentName, CityName, FirstName, LastName, Gender

Here are some example for the selection of unique values, the equivalent of SELECT DISTINCT from SQL:

//distinct values used
Customers
| distinct  Occupation

//distinct values used sorted ascendingly 
Customers
| distinct  Occupation
| sort by Occupation asc

//combinations of values used
Customers
| distinct  Occupation, Education
| sort by Occupation asc, Education asc

When further data is needed, one needs to resume to grouping values, the equivalent of GROUP BY:

// record count
Customers 
| count

// record count for constraint
Customers 
| where CityName == 'Berkeley'
| count

// record count for constraint: returns 0 records (KQL is case sensitive)
Customers 
| where CityName == 'BERKELEY'
| count

// numnber of records by occupation 
Customers
| summarize occupations_count = count() by Occupation

// numnber of records by occupation with bar chart visual
Customers
| summarize occupations_count = count() by Occupation
| render barchart

The last query renders the data directly to a bar chart, which is a cool feature, especially when is needed to understand the distribution of values. Executing the query without the last line renders the initial dataset.

Azure Data Explorer - Chart Example

Here are some first impressions:
1) The language is relatively simple, though the transition from SQL to KQL requires time, even if the thinking process of writing the code is not that far away. For those with no SQL knowledge, the transition might be a bit more challenging, though practice makes perfect!
2) One can try to run the code line by line to understand the steps used.
3) There are also some online code converters from SQL to KQL (see for example msticpy).
4) The intellisense capabilities facilitate the overall experience.
5) Unfortunately, there seems to be no code formatter to HTML for KQL, so one needs to compromise in one way or another.
6) For a comparison between SQL and KQL see [4].

Happy coding!

|>> Next Post

References:
[1] Microsoft Learn (2024) Kusto Query Language learning resources [link]
[2] TestingSpot Blog (2021) Learning Kusto Query Language - A tool for performance test engineers [link]
[3] Azure Data Explorer (2025] Create a free cluster to explore your data [link]
[4] Microsoft Learn (2024) SQL to Kusto Query Language cheat sheet [link]

Resources:
[R1] GitHub repository (2022) Sentinel Queries [link]
[R2] GitHub repository (2022) Kusto Query Language Grammar [link]
[R3] GitHub repository (2024) The Definitive Guide to KQL source code [link]
[R4[ Git Hub repository (2022) Learning Kijo [link]
[R5] Kusto gamification [link]
[R6] Clickhouse KQL [link]
[R6] KQL Cafe user group [link]

SQL Troubles

Pages

05 February 2025

🌌🏭KQL Reloaded: First Steps (Part IV: Left, Right, Anti-Joins and Unions)

04 February 2025

🌌🏭KQL Reloaded: First Steps (Part III: Basic Joins)

🧭Business Intelligence: Perspectives (Part 26: Monitoring - A Cockpit View)

02 February 2025

🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

31 January 2025

🌌🏭KQL Reloaded: First Steps (Part II: Working with Dates)

29 January 2025

🌌🏭KQL Reloaded: First Steps (Part I: Simple Queries)

About Me