Showing posts with label KQL. Show all posts
Showing posts with label KQL. Show all posts

11 March 2025

🏭🎗️🗒️Microsoft Fabric: Real-Time Dashboards (RTD) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 10-Mar-2025

Real-Time Intelligence architecture
Real-Time Intelligence architecture [5]

[Microsoft Fabric] Real-Time Dashboard

  • [def]
    • a collection of tiles
      • optionally organized in pages
        • act as containers of tiles
        • organize tiles into logical groups
          • e.g. by data source or by subject area
        • used to create a dashboard with multiple views
          • e.g. dashboard with a drillthrough from a summary page to a details page [1]
      • each tile has 
        • an underlying query
        • a visual representation
    • exists within the context of a workspace [1]
      • always associated with the workspace used to create it [1]
  • {concept} tile
    • uses  KQL snippets to retrieve data and render visuals [1]
    • can be added directly from queries written in a KQL queryset [1]
  • {concept} data source
    • reusable reference to a specific database in the same workspace as the dashboard [1]
  • {concept} parameters 
    • significantly improve dashboard rendering performance [1]
    • enable to use filter values as early as possible in the query
      • filtering is enabled when the parameter is included in the query associated with the tiles [1]
  • {concept} cloud connection
    • uses dashboard owner's identity to give access to the underlying data source to other users [2]
    • when not used for 90 days, it will expire [2]
      • ⇒ a new gateway connection must be set up [2]
      • via Manage connections >> Gateways page >> Edit credentials and verify the user again
    • a separate connection is needed for each data source [2]
  • {feature} natively export KQL queries to a dashboard as visuals and later modify their underlying queries and visual formatting as needed [1]
    • the fully integrated dashboard experience provides improved query and visualization performance [1]
  • {feature} encrypted at rest
    • dashboards and dashboard-related metadata about users are encrypted at rest using Microsoft-managed keys [1]
  • {feature} auto refresh
    • allows to automatically update the data on a dashboard without manually reloading the page or clicking a refresh button [1]
    • can be set by a database editor
      • both editors and viewers can change the actual rate of auto refresh while viewing a dashboard [1]
    • database editors can limit the minimum refresh rate that any viewer can set
      • ⇐ reduces the cluster load
      • when set, database users can't set a refresh rate lower than the minimum [1]
  • {feature} explore data 
    • enables users to extend the exploration of dashboards beyond the data displayed in the tiles [3]
      • begins with viewing the data and its corresponding visualization as they appear on the tile [3]
      • users can add or removing filters and aggregations, and use further visualizations [3]
        • ⇐ no knowledge of KQL is needed [3]
  • {feature} conditional formatting
    • allows users to format data points based on their values, utilizing 
      • colors
        • {rule} color by condition
          • allows to set one or more logical conditions that must be met for a value to be colored [4]
          • available for table, stat, and multi stat visuals [4]
        • {rule} color by value
          • allows to visualize values on a color gradient [4]
          • available for table visuals [4]
      • tags
      • icons
    • can be applied either 
      • to a specific set of cells within a designated column [4]
      • to entire rows [4]
    • one or more conditional formatting rules can be applied for each visual [4]
      • when multiple rules conflict, the last rule defined takes precedence over any previous ones [4]
  • {action} export dashboard
    • dashboards can be exported to a JSON file
    • can be useful in several scenarios 
      • {scenario} version control
        • the file can be used to restore the dashboard to a previous version [1]
      • {scenario} dashboard template
        • the file can be used as template for creating new dashboards [1]
      • {scenario} manual editing
      • edit the file to modify the dashboard and imported the file back to the dashboard [1]
    • ADX dashboards can be exported and imported as RT dashboards [6]
  • {action} share dashboard
    • one can specify if the user can view, edit, or share [2]
    • ⇐ the permissions are not for the underlying data [2]
      • permissions are set by defining the identity that the dashboard uses for accessing data from each data sources[2]
      • {type|default} pass-through identity
        • used when authenticating to access the underlying data source [2]
        • the user is only able to view the data in the tiles [2]
      • {type} dashboard editor’s identity:
        • allows the user to use editor’s identity, and thus permissions[2]
          • the editor defines a cloud connection that the dashboard uses to connect to the relevant data source [2]
          • only editors can define cloud connections and permissions for a specific real-time dashboard [2]
            • each editor that modifies the real-time dashboard needs to set up own cloud connection [2]
            • if a valid connection doesn't exist, the user is able to view the real-time dashboard but will only see data if they themselves have access to it [2]
  • {action} revoke a user’s access permissions
    • remove access from the dashboard [2]
    • remove the cloud connection.
      • via Settings >> Manage connections and gateways >> Options >> Remove
    • remove the user from the cloud connection.
      • via Settings >> Manage connections and gateways >> Options >> Manage users >> {select User} >> Delete
    • edit the Data source access permissions.
      • via Data source >> New data source >> edit >> Data source access >> Pass-through identity
      • ⇐ the user uses own identity to access the data source [2]
  • {prerequisite} a workspace with a Microsoft Fabric-enabled capacity [1]
  • {prerequisite} a KQL database with data [1]
  • {setting} Users can create real-time dashboards [1]
    References:
    [1] Microsoft Learn (2024) Fabric: Create a Real-Time Dashboard [link
    [2] Microsoft Learn (2024) Fabric: Real-Time Dashboard permissions (preview) [link
    [3] Microsoft Learn (2024) Fabric: Explore data in Real-Time Dashboard tiles [link
    [4] Microsoft Learn (2024) Fabric: Apply conditional formatting in Real-Time Dashboard visuals [link]
    [5] Microsoft Learn (2025) Fabric: Real Time Intelligence L200 Pitch Deck [link]
    [6] Microsoft Fabric Updates Blog (2024) Easily recreate your ADX dashboards as Real-Time Dashboards in Fabric, by Michal Bar [link]
    [7] Microsoft Learn (2025) Create Real-Time Dashboards with Microsoft Fabric [link]

    Resources:
    [R1] Microsoft Learn (2024) Microsoft Fabric exercises [link]
    [R2] Microsoft Fabric Updates Blog (2024) Announcing Real-Time Dashboards generally available [link]
    [R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    ADX - Azure Data Explorer
    KQL  - Kusto Query Language
    MF - Microsoft Fabric
    RT - Real-Time

    09 March 2025

    🏭🎗️🗒️Microsoft Fabric: Eventhouses [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 9-Mar-2025

    Real-Time Intelligence architecture
    Real-Time Intelligence architecture [4]

    [Microsoft Fabric] Eventhouses

    • [def] 
    • a service that empowers users to extract insights and visualize data in motion
      • offers an end-to-end solution for 
        • event-driven scenarios
          • ⇐ rather than schedule-driven solutions  [1]
      • a workspace of databases
        • can be shared across projects [1]
    • allows to manage multiple databases at once
      • sharing capacity and resources to optimize performance and cost
      • provides unified monitoring and management across all databases and per database [1]
    • provide a solution for handling and analyzing large volumes of data
      • particularly in scenarios requiring real-time analytics and exploration [1]
      • designed to handle real-time data streams efficiently [1]
        • lets organizations ingest, process, and analyze data in near real-time [1]
    • provide a scalable infrastructure that allows organizations to handle growing volumes of data, ensuring optimal performance and resource use.
      • preferred engine for semistructured and free text analysis
      • tailored to time-based, streaming events with structured, semistructured, and unstructured data [1]
      • allows to get data 
        • from multiple sources, 
        • in multiple pipelines
          • e.g. Eventstream, SDKs, Kafka, Logstash, data flows, etc.
        • multiple data formats [1]
      • data is automatically indexed and partitioned based on ingestion time
    • designed to optimize cost by suspending the service when not in use [1]
      • reactivating the service, can lead to a latency of a few seconds [1]
        • for highly time-sensitive systems that can't tolerate this latency, use Minimum consumption setting [1] 
          • enables the service to be always available at a selected minimum level [1]
            • customers pay for 
              • the minimum compute level selected [1]
              • the actual consumption when the compute level is above the minimum set [1]
          • the specified compute is available to all the databases within the eventhouse [1]
      • {scenario} solutions that includes event-based data
        • e.g. telemetry and log data, time series and IoT data, security and compliance logs, or financial records [1]
    • KQL databases 
      • can be created within an eventhouse [1]
      • can either be a standard database, or a database shortcut [1]
      • an exploratory query environment is created for each KQL Database, which can be used for exploration and data management [1]
      • data availability in OneLake can be enabled on a database or table level [1]
    • Eventhouse page 
      • serves as the central hub for all your interactions within the Eventhouse environment [1]
      • Eventhouse ribbon
        • provides quick access to essential actions within the Eventhouse
      • explorer pane
        • provides an intuitive interface for navigating between Eventhouse views and working with databases [1]
      • main view area 
        • displays the system overview details for the eventhouse [1]
    • {feature} Eventhouse monitoring
      • offers comprehensive insights into the usage and performance of the eventhouse by collecting end-to-end metrics and logs for all aspects of an Eventhouse [2]
      • part of workspace monitoring that allows you to monitor Fabric items in your workspace [2]
      • provides a set of tables that can be queried to get insights into the usage and performance of the eventhouse [2]
        • can be used to optimize the eventhouse and improve the user experience [2]
    • {feature} query logs table
      • contains the list of queries run on an Eventhouse KQL database
        • for each query, a log event record is stored in the EventhouseQueryLogs table [3]
      • can be used to
        • analyze query performance and trends [3]
        • troubleshoot slow queries [3]
        • identify heavy queries consuming large amount of system resources [3]
        • identify the users/applications running the highest number of queries[3]
    • {feature} OneLake availability
      • {benefit} allows to create one logical copy of a KQL database data in an eventhouse by turning on the feature [4]
        • users can query the data in the KQL database in Delta Lake format via other Fabric engines [4]
          • e.g. Direct Lake mode in Power BI, Warehouse, Lakehouse, Notebooks, etc.
      • {prerequisite} a workspace with a Microsoft Fabric-enabled capacity [4]
      • {prerequisite} a KQL database with editing permissions and data [4]
      • {constraint} rename tables
      • {constraint} alter table schemas
      • {constraint} apply RLS to tables
      • {constraint} data can't be deleted, truncated, or purged
      • when turned on, a mirroring policy is enabled
        • can be used to monitor data latency or alter it to partition delta tables [4]
    • {feature} robust adaptive mechanism
      • intelligently batches incoming data streams into one or more Parquet files, structured for analysis [4]
      • ⇐ important when dealing with trickling data [4]
        • ⇐ writing many small Parquet files into the lake can be inefficient resulting in higher costs and poor performance [4]
      • delays write operations if there isn't enough data to create optimal Parquet files [4]
        • ensures Parquet files are optimal in size and adhere to Delta Lake best practices [4]
        • ensures that the Parquet files are primed for analysis and balances the need for prompt data availability with cost and performance considerations [4]
        • {default} the write operation can take up to 3 hours or until files of sufficient size are created [4]
          • typically the files have 200-256 MB
          • the value can be adjusted between 5 minutes and 3 hours [4]
            • {warning} adjusting the delay to a shorter period might result in a suboptimal delta table with a large number of small files [4]
              • can lead to inefficient query performance [4]
          • {restriction} the resultant table in OneLake is read-only and can't be optimized after creation [4]
      • delta tables can be partitioned to improve query speed [4]
        • each partition is represented as a separate column using the PartitionName listed in the Partitions list [4]
          • ⇒ OneLake copy has more columns than the source table [4]
    References:
    [1] Microsoft Learn (2025) Microsoft Fabric: Eventhouse overview [link]
    [2] Microsoft Learn (2025) Microsoft Fabric: Eventhouse monitoring [link
    [3] Microsoft Learn (2025) Microsoft Fabric: Query logs [link]  
    [4] Microsoft Learn (2025) Microsoft Fabric: Eventhouse OneLake Availability [link]
    [5] Microsoft Learn (2025) Real Time Intelligence L200 Pitch Deck [link]

    Resources:
    [R1] Microsoft Learn (2024) Microsoft Fabric exercises [link]
    [R2] Eventhouse Monitoring (Preview) [link]
    [R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    KQL - Kusto Query Language
    SDK - Software Development Kit
    RLS - Row Level Security 
    RTI - Real-Time Intelligence

    09 February 2025

    🌌🏭KQL Reloaded: First Steps (Part XI: Window Functions)

    Window functions are one of the powerful features available in RDBMS as they allow to operate across several lines or a result set and return a result for each line. The good news is that KQL supports several window functions, which allow to address several scenarios. However, the support is limited and needs improvement.

    One of the basic scenarios that takes advantage of windows functions is the creation of a running sum or creating a (dense) rank across a whole dataset. For this purposes, in Kusto one can use the row_cumsum for running sum, respectively the row_number, row_rank_dense and row_rank_min for ranking. In addition, one can refer to the values of a field from previous (prev) and next record. 

    // rank window functions within dataset (not partitioned)
    NewSales
    | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
    | where CustomerKey in (12617, 12618)
    | project CustomerKey, DateKey, ProductName, TotalCost
    | sort by CustomerKey asc, DateKey asc 
    | extend Rank1 = row_rank_dense(DateKey)
    
        , Rank2 = row_number(0)
    , Rank3 = row_rank_min(DateKey) , Sum = row_cumsum(TotalCost) , NextDate = next(DateKey) , PrevDate = prev(DateKey)

    Often, it's needed to operate only inside of a partition and not across the whole dataset. Some of the functions provide additional parameters for this:

    // rank window functions within partitions
    NewSales
    | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
    | where CustomerKey in (12617, 12618)
    | project CustomerKey, DateKey, ProductName, TotalCost
    | sort by CustomerKey asc, DateKey asc 
    | extend RowRank1 = row_rank_dense(DateKey, prev(CustomerKey) != CustomerKey)
        , RowRank2 = row_number(0, prev(CustomerKey) != CustomerKey)
        , Sum = row_cumsum(TotalCost, prev(CustomerKey) != CustomerKey)
        , NextDate = iif(CustomerKey == next(CustomerKey), next(DateKey), datetime(null))
        , PrevDate = iif(CustomerKey == prev(CustomerKey), prev(DateKey),  datetime(null))
    

    In addition, the partitions can be defined explicitly via the partition operator:

    // creating explicit partitions
    NewSales
    | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
    | where CustomerKey in (12617, 12618)
    | project CustomerKey, DateKey, ProductName, TotalCost
    | partition by CustomerKey
    (
        order by DateKey asc
        | extend prev_cost = prev(TotalCost, 1)
    )
    | order by CustomerKey asc, DateKey asc
    | extend DifferenceCost = TotalCost - prev_cost
    

    It will be interesting to see whether Microsoft plans for the introduction of further window functions in KQL to bridge the gap. The experience proved that such functions are quite useful in data analytics, sometimes developers needing to go a long extent for achieving the respective behavior (see visual calcs in Power BI). For example, SQL Server leverages several types of such functions, though it took Microsoft more than several versions to this make progress. More over, developers can introduce their own functions, even if this involves .Net programming. 

    Happy coding!

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2024) Kusto: Window functions overview [link]

    🏭🗒️Microsoft Fabric: Kusto Query Language (KQL) [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 9-Feb-2025

    [Microsoft Fabric] Kusto Query Language (KQL)

    • {def} a read-only request to process query language [1]
      • designed for data exploration and summarization [1]
        • very similar to SQL
          • the explain command can be used to transform SQL into KQL code
            • ⇐ not all the SQL syntax can be translated
      • statements are sequenced being executed in the order of their arrangement
        • funnel like processing where data is piped from one operator to the next
          • data is filtered, rearranged or summarized at each step and then fed into the following step
          • statements are sequenced by a pipe (|)
      • returns data in a tabular or graph format
      • designed and developed to take advantage of cloud computing through clustering and scaling compute [2]
        • ideal engine to power fast, real-time dashboards
      • case-sensitive in general 
      • named after the undersea pioneer Jacques Cousteau [2]
      • operation sequence
        • filter data
        • aggregate data
        • order data
        • modify column output
    • supports standard data types 
      • string
        • a sequence of zero or more Unicode characters
        • characters are encoded in UTF-8.
      • int 
        • 32-bit whole-number integer
      • long
        • signed 64-bit whole-number integer
      • real (aka double)
        • 64-bit decimal-based number 
        • and provides high precision with decimal points.
      • decimal
        • a 128-bit decimal number
        • provides the highest precision of decimal points
        • {recommendation} if precision is not needed, use the real type instead [2]
      • bool 
        • a boolean value that can be a true (1), false (0), or null
      • datetime 
        • represents a date in the UTC zone
      • timespan 
        • represents a time interval
          •  days, hours, minutes, seconds, milliseconds, microseconds, tick
        • if no time frame is specified, it will default to day
      • dynamic
        • a special data type that can take 
          • any value from the other data types
          • arrays
          • a {name = value} property bag
      • guid
        • a 128-bit globally unique value
    • statement types
      • tabular expression statement
      • let statement
        • used to 
          • set variable names equal to an expression
          • create views
        • ⇐ used mostly to 
          • help break complex expressions into multiple parts, each represented by a variable
          • sett constants outside the query to aid in readability
      • set statement
        • used to set the query duration
    • {tool}Microsoft Santinel 
      • {def} a cloud native SIEM and SOAR that provides cyberthreat detection, investigation, response, and proactive hunting, with a bird's-eye view across your enterprise [3]
    • {tool} Kusto Explorer
      • {def} user-friendly interface to query and analyze data with KQL [4]
    • {tool} Azure Data Studio 
      • {def} lightweight, cross-platform data management and development tool for data professionals [5]

    References:
    [1] Microsoft (2024) Real-time Analytics: End-to-End Workshop
    [2] Mark Morowczynski et al (2024) The Definitive Guide to KQL: Using Kusto Query Language for Operations, Defending, and Threat Hunting
    [3] Microsoft Learn (2024) Azure: What is Microsoft Sentinel [link]
    [4] Microsoft Learn (2024) Kusto: Kusto.Explorer installation and user interface [link]
    [5] Microsoft Learn (2024) SQL: What is Azure Data Studio? [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    KQL - Kusto Query Language (
    SIEM - security information and event management
    SOAR - security orchestration, automation, and response
    SQL - Structured Query Language
    UTC - Universal Time Coordinated

    08 February 2025

    🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

    In SQL Server and other RDBMS databases there are many  scenarios in which one needs information from a fact table based on a dimension table without requiring information from the dimension table. 

    Correlated Subquery via EXISTS

    Before considering the main example, let's start with a simple subquery:

    // subquery in SQL
    --
    explain
    SELECT CustomerKey
    , ProductKey
    , ProductName
    FROM NewSales 
    WHERE ProductKey IN (
        SELECT DISTINCT ProductKey
        FROM Products 
        WHERE ProductSubcategoryName = 'MP4&MP3'
        ) 
    
    // subquery in KQL
    NewSales
    | where ProductKey in (
        (Products
        | where (ProductSubcategoryName == "MP4&MP3")
        | project ProductKey
        | distinct *))
    | project CustomerKey, ProductKey, ProductName
    

    Of course, the ProductKey is unique by design, though there can be dimension, fact tables or subqueries in which the value is not unique.

    Now let's consider the correlated subquery pattern, which should provide the same outcome as above, though in RDBMS there are scenarios in which it provides better performance, especially when the number of values from subquery is high.

    // correlated subquery in SQL
    --
    explain
    SELECT CustomerKey
    , ProductKey
    , ProductName
    FROM NewSales 
    WHERE EXISTS (
        SELECT Products.ProductKey
        FROM Products 
        WHERE NewSales.ProductKey = Products.ProductKey)
    

    Unfortunately, trying to translate the code via explain leads to the following error, which confirms that the syntax is not supported in KQL (see [1]):

    "Error: Reference to missing column 'NewSales.ProductKey'"

     Fortunately, in this case one can use the first version of the query. 

    Correlated Subquery via CROSS APPLY

    Before creating the main query, let's look at the inner query and check whether it gets correctly translate to KQL:

    // subquery logic
    --
    explain
    SELECT sum(TotalCost) TotalCost 
    FROM NewSales 
    WHERE DateKey > '20240101' and DateKey <'20240201'
    

    Now, let's bring the logic within the CROSS APPLY:

    // correlated subquery in SQL
    --
    explain
    SELECT ProductKey
    , ProductName
    , TotalCost
    FROM Products
        CROSS APPLY (
            SELECT sum(TotalCost) TotalCost 
            FROM NewSales 
            WHERE DateKey > '20240101' and DateKey <'20240201'
              AND Products.ProductKey = NewSales.ProductKey
        ) DAT
    

    Running the above code leads to the following error:

    "Sql node of type 'Microsoft.SqlServer.TransactSql.ScriptDom.UnqualifiedJoin' is not implemented"

    Unfortunately, many SQL queries are written following this pattern, especially when an OUTER CROSS APPLY is used, retrieving thus all the records from the dimension table. 

    In this case one can rewrite the query via a RIGHT JOIN:

    // correlated subquery in SQL
    --
    explain
    SELECT PRD.ProductKey
    , PRD.ProductName
    , SAL.TotalCost
    FROM Products PRD
        LEFT JOIN (
            SELECT ProductKey
            , sum(TotalCost) TotalCost 
            FROM NewSales 
            WHERE DateKey > '20240101' and DateKey <'20240201'
            GROUP BY ProductKey
        ) SAL
          ON PRD.ProductKey = SAL.ProductKey
    
    // direct translation of the query
    Products
    | join kind=leftouter 
        (NewSales
            | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
            | summarize TotalCost=sum(TotalCost) by ProductKey
            | project ProductKey, TotalCost
        ) on ($left.ProductKey == $right.ProductKey)
    | project ProductKey, ProductName, TotalCost
    //| where isnull(TotalCost)
    //| summarize record_number = count()
    
    
    // query after restructuring
    NewSales
    | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
    | summarize TotalCost=sum(TotalCost) by ProductKey
    | join kind=rightouter
        (
            Products
            | project ProductKey, ProductName
        ) on ($left.ProductKey == $right.ProductKey)
    | project ProductKey, ProductName, TotalCost
    //| where isnull(TotalCost)
    //| summarize record_number = count()
    

    During transformations it's important to check whether the number of records changes between the various versions of the query (including the most general version in which filtering constraints were applied).

    Especially when SQL solutions are planned to be migrated to KQL, it's important to know which query patterns can be used in KQL. 

    Happy coding!

    Previous Post <<||>>  Next Post

    References:
    [1] GitHib (2024) Module Ex-01 - Advanced KQL [link]

    🌌🏭KQL Reloaded: First Steps (Part IX: Translating SQL to KQL - More Joins)

    The last post exemplified the use of "explain" to translate queries from SQL to KQL. The current post attempts to test the feature based on the various join constructs available in SQL by using the NewSales and Products tables. The post presumes that the reader as a basic understanding of the join types from SQL-based environments. 

    Inner Join

    // full join in SQL
    --
    explain
    SELECT NewSales.CustomerKey
    , NewSales.ProductKey
    , Products.ProductName
    FROM NewSales 
         JOIN Products 
          ON NewSales.ProductKey = Products.ProductKey 
    
    // full join in KQL
    NewSales
    | join kind=inner (
        Products
        | project ProductKey, ProductName 
        )
        on ($left.ProductKey == $right.ProductKey)
    | project CustomerKey, ProductKey, ProductName
    | limit 10
    

    A full join is probably the most used type of join given that fact tables presume the existence of dimensions, even if poor data warehousing design can lead also to exception. The join retrieves all the data matching from both tables, including the eventual duplicates from both sides of the join. 

    Left Join

    // left join in SQL
    --
    explain
    SELECT NewSales.CustomerKey
    , NewSales.ProductKey
    , Products.ProductName
    FROM NewSales 
         LEFT JOIN Products 
          ON NewSales.ProductKey = Products.ProductKey 
    
    // left join in KQL
    NewSales
    | join kind=leftouter (
        Products
        | project ProductKey
            , Product = ProductName 
        )
        on ($left.ProductKey == $right.ProductKey)
    | where isnull(Product) 
    | project CustomerKey
        , ProductKey
        , ProductName
    | limit 10
    

    A left join retrieves all the records from the left table, typically the fact table, independently whether records were found in the dimension table. One can check whether mismatches exist by retrieving the records where no match was found.

    Right Join

    // right join in SQL
    --
    explain
    SELECT NewSales.CustomerKey
    , Products.ProductKey
    , Products.ProductName
    FROM NewSales 
         RIGHT JOIN Products 
          ON NewSales.ProductKey = Products.ProductKey 
    
    // right join in KQL
    NewSales
    | join kind=rightouter (
        Products
        | project DimProductKey = ProductKey
        , DimProductName = ProductName 
        )
        on ($left.ProductKey == $right.DimProductKey)
    | where isnull(ProductKey) 
    | project CustomerKey
        , DimProductKey
        , DimProductName
    | limit 10
    

    A right join retrieves the records from the dimension together with the matches from the fact table, independently whether a match was found in the fact table. 

    Full Outer Join

    // full outer join in SQL
    --
    explain
    SELECT NewSales.CustomerKey
    , Coalesce(NewSales.ProductKey, Products.ProductKey) ProductKey
    , Coalesce(NewSales.ProductName, Products.ProductName) ProductName
    FROM NewSales 
         FULL OUTER JOIN Products 
          ON NewSales.ProductKey = Products.ProductKey 
    
    
    // full outer join in KQL
    NewSales
    | join kind=fullouter (
        Products
        | project DimProductKey = ProductKey
        , DimProductName = ProductName 
        )
        on ($left.ProductKey == $right.DimProductKey)
    //| where isnull(ProductKey) 
    | project CustomerKey
        , ProductKey = coalesce(ProductKey, DimProductKey)
        , ProductName = coalesce(ProductName, DimProductName)
    | limit 10
    

    A full outer join retrieves all the data from both sides of the join independently on whether a match is found. In RDBMS this type of join performs poorly especially when further joins are considered, respectively when many records are involved on both sides of the join. Therefore it should be avoided when possible, though in many cases it might be the only feasible solution. There are also alternatives that involve a UNION between a LEFT JOIN and a RIGHT JOIN, the letter retrieving only the records which is not found in the fact table (see last query from a previous post). This can be a feasible solution when data sharding is involved. 

    Notes:
    1) If one ignores the unnecessary logic introduced by the translation via explain, the tool is excellent for learning KQL. It would be interesting to understand why the tool used a certain complex translation over another, especially when there's a performance benefit in the use of a certain piece of code.
    2) Also in SQL-based queries it's recommended to start with the fact table, respectively with the table having the highest cardinality and/or the lowest level of detail, though the database engine might find an optimal plan independently of which table was written first.

    Happy coding!

    Previous Post <<||>> Next Post

    🌌🏭KQL Reloaded: First Steps (Part VIII: Translating SQL to KQL - Full Joins)

    One of the great features of KQL is the possibility of translating SQL code to KQL via the "explain" keyword, allowing thus to port SQL code to KQL, respectively help translate knowledge from one programming language to another. 

    Let's start with a basic example:

    // transform SQL to KQL code (to be run only the first part from --)
    --
    explain
    SELECT top(10) CustomerKey, FirstName, LastName, CityName, CompanyName 
    FROM Customers 
    ORDER BY CityName DESC
    
    // output: translated KQL code 
    Customers
    | project CustomerKey, FirstName, LastName, CityName, CompanyName
    | sort by CityName desc nulls first
    | take int(10)
    

    The most interesting part of the translation is how "explain" translate joins from SQL to KQL. Let's start with a FULL JOIN from the set of patterns considered in a previous post on SQL joins:

    --
    explain
    SELECT CST.CustomerKey
    , CST.FirstName + ' ' + CST.LastName CustomerName
    , Cast(SAL.DateKey as Date) DateKey
    , SAL.TotalCost
    FROM NewSales SAL
        JOIN Customers CST
          ON SAL.CustomerKey = CST.CustomerKey 
    WHERE SAL.DateKey > '20240101' AND SAL.DateKey < '20240201'
    ORDER BY CustomerName, DateKey, TotalCost DESC

    And, here's the translation:

    // translated code
    NewSales
    | project-rename ['SAL.DateKey']=DateKey
    | join kind=inner (Customers
    | project-rename ['CST.CustomerKey']=CustomerKey
        , ['CST.CityName']=CityName
        , ['CST.CompanyName']=CompanyName
        , ['CST.ContinentName']=ContinentName
        , ['CST.Education']=Education
        , ['CST.FirstName']=FirstName
        , ['CST.Gender']=Gender
        , ['CST.LastName']=LastName
        , ['CST.MaritalStatus']=MaritalStatus
        , ['CST.Occupation']=Occupation
        , ['CST.RegionCountryName']=RegionCountryName
        , ['CST.StateProvinceName']=StateProvinceName) 
        on ($left.CustomerKey == $right.['CST.CustomerKey'])
    | where ((['SAL.DateKey'] > todatetime("20240101")) 
        and (['SAL.DateKey'] < todatetime("20240201")))
    | project ['CST.CustomerKey']
        , CustomerName=__sql_add(__sql_add(['CST.FirstName']
        , " "), ['CST.LastName'])
        , DateKey=['SAL.DateKey']
        , TotalCost
    | sort by CustomerName asc nulls first
        , DateKey asc nulls first
        , TotalCost desc nulls first
    | project-rename CustomerKey=['CST.CustomerKey']
    

    The code was slightly formatted to facilitated its reading. Unfortunately, the tool doesn't work well with table aliases, introduces also all the fields available from the dimension table, which can become a nightmare for the big dimension tables, the concatenation seems strange, and if one looks deeper, further issues can be identified. So, the challenge is how to write a query in SQL so it can minimize the further changed in QKL.

    Probably, one approach is to write the backbone of the query in SQL and add the further logic after translation. 

    --
    explain
    SELECT NewSales.CustomerKey
    , NewSales.DateKey 
    , NewSales.TotalCost
    FROM NewSales 
        INNER JOIN Customers 
          ON NewSales.CustomerKey = Customers.CustomerKey 
    WHERE DateKey > '20240101' AND DateKey < '20240201'
    ORDER BY NewSales.CustomerKey
    , NewSales.DateKey
    

    And the translation looks simpler:

    // transformed query
    NewSales
    | join kind=inner 
    (Customers
    | project-rename ['Customers.CustomerKey']=CustomerKey
        , ['Customers.CityName']=CityName
        , ['Customers.CompanyName']=CompanyName
        , ['Customers.ContinentName']=ContinentName
        , ['Customers.Education']=Education
        , ['Customers.FirstName']=FirstName
        , ['Customers.Gender']=Gender
        , ['Customers.LastName']=LastName
        , ['Customers.MaritalStatus']=MaritalStatus
        , ['Customers.Occupation']=Occupation
        , ['Customers.RegionCountryName']=RegionCountryName
        , ['Customers.StateProvinceName']=StateProvinceName) 
        on ($left.CustomerKey == $right.['Customers.CustomerKey'])
    | where ((DateKey > todatetime("20240101")) 
        and (DateKey < todatetime("20240201")))
    | project CustomerKey, DateKey, TotalCost
    | sort by CustomerKey asc nulls first
    , DateKey asc nulls first
    

    I would have written the query as follows:

    // transformed final query
    NewSales
    | where (DateKey > todatetime("20240101")) 
        and (DateKey < todatetime("20240201"))
    | join kind=inner (
        Customers
        | project CustomerKey
            , FirstName
            , LastName 
        ) on $left.CustomerKey == $right.CustomerKey
    | project CustomerKey
        , CustomerName = strcat(FirstName, ' ', LastName)
        , DateKey
        , TotalCost
    | sort by CustomerName asc nulls first
        , DateKey asc nulls first
    

    So, it makes sense to create the backbone of a query, translate it to KQL via explain, remove the unnecessary columns and formatting, respectively add what's missing. Once the patterns were mastered, there's probably no need to use the translation tool, but could prove to be also some exceptions. Anyway, the translation tool helps considerably in learning. Big kudos for the development team!

    Notes:
    1) The above queries ignore the fact that Customer information is available also in the NewSales table, making thus the joins obsolete. The joins were considered only for exemplification purposes. Similar joins might be still needed for checking the "quality" of the data (e.g. for dimensions that change over time). Even if such "surprises" shouldn't appear by design, real life designs continue to surprise...
    2) Queries should be less verbose by design (aka simplicity by design)! The more unnecessary code is added, the higher the chances for errors to be overseen, respectively the more time is needed to understand and validated the queries!

    Happy coding!

    Previous Post <<||>> Next Post

    References
    [1] Microsoft Lear n (2024) Azure: Query data using T-SQL [link]

    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.