Showing posts with label Microsoft Fabric. Show all posts
Showing posts with label Microsoft Fabric. Show all posts

23 February 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part X: Templates for Database Objects)

One of the new features remarked in SQL databases when working on the previous post is the availability of templates in SQL databases. The functionality is useful even if is kept to a minimum. Probably, more value can be obtained when used in combination with Copilot, which requires at least a F12 capacity.

Schemas

Schemas are used to create a logical grouping of objects such as tables, stored procedures, and functions. From a structural and security point of view it makes sense to create additional schemas to manage the various database objects and use the default dbo schema only occasionally (e.g. for global created objects).

-- generated template - schema
CREATE SCHEMA SchemaName

-- create schema
CREATE SCHEMA Test

One can look at the sys.schemas to retrieve all the schemas available:

-- retrieve all schemas
SELECT schema_id
, name
, principal_id
FROM sys.schemas
ORDER BY schema_id

Tables

Tables, as database objects that contain all the data in a database are probably the elements that need the greatest attention in design and data processing. In some cases a table can be dedenormalized and it can store all the data needed, much like in MS Excel, respectively, benormalized in fact and dimension tables. 

Tables can be created explicitly by defining in advance their structure (see Option 1), respectively on the fly (see Option 2). 

-- Option 1
-- create the table manually (alternative to precedent step
CREATE TABLE [Test].[Customers](
	[CustomerId] [int] NOT NULL,
	[AddressID] [int] NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[CompanyName] [nvarchar](128) NULL,
	[SalesPerson] [nvarchar](256) NULL
) ON [PRIMARY]
GO

-- insert records
INSERT INTO Test.Customers
SELECT CustomerId
, Title
, FirstName 
, LastName
, CompanyName
, SalesPerson
FROM SalesLT.Customer -- checking the output (both scenarios) SELECT top 100 * FROM Test.Customers

One can look at the sys.tables to retrieve all the tables available:

-- retrieve all tables
SELECT schema_name(schema_id) schema_name
, object_id
, name
FROM sys.tables
ORDER BY schema_name
, name

Views

Views are much like virtual table based on the result-set of an SQL statement that combines data from one or multiple tables.  They can be used to encapsulate logic, respectively project horizontally or  vertically a subset of the data. 

-- create view
CREATE OR ALTER VIEW Test.vCustomers
-- Customers 
AS
SELECT CST.CustomerId 
, CST.Title
, CST.FirstName 
, IsNull(CST.MiddleName, '') MiddleName
, CST.LastName 
, CST.CompanyName 
, CST.SalesPerson 
FROM SalesLT.Customer CST

-- test the view 
SELECT *
FROM Test.vCustomers
WHERE CompanyName = 'A Bike Store'

One can look at the sys.views to retrieve all the views available:

-- retrieve all views
SELECT schema_name(schema_id) schema_name
, object_id
, name
FROM sys.views
ORDER BY schema_name
, name

User-Defined Functions

A user-defined function (UDF) allows to create a function by using a SQL expression. It can be used alone or as part of a query, as in the below example.

-- generated template - user defined function
CREATE FUNCTION [dbo].[FunctionName] (
    @param1 INT,
    @param2 INT
)
RETURNS INT AS BEGIN RETURN
    @param1 + @param2
END

-- user-defined function: 
CREATE OR ALTER FUNCTION Test.GetFirstMiddleLastName (
    @FirstName nvarchar(50),
    @MiddleName nvarchar(50),
    @LastName nvarchar(50)
)
RETURNS nvarchar(150) AS 
BEGIN 
   RETURN IsNull(@FirstName, '') + IsNull(' ' + @MiddleName, '') + IsNull(' ' + @LastName, '') 
END

-- test UDF on single values
SELECT Test.GetFirstMiddleLastName ('Jack', NULL, 'Sparrow')
SELECT Test.GetFirstMiddleLastName ('Jack', 'L.', 'Sparrow')

-- test UDF on a whole table
SELECT TOP 100 Test.GetFirstMiddleLastName (FirstName, MiddleName, LastName)
FROM SalesLT.Customer

One can look at the sys.objects to retrieve all the scalar functions available:

-- retrieve all scalar functions
SELECT schema_name(schema_id) schema_name
, name
, object_id
FROM sys.objects 
WHERE type_desc = 'SQL_SCALAR_FUNCTION'
ORDER BY schema_name
, name

However, UDFs prove to be useful when they mix the capabilities of functions with the ones of views allowing to create a "parametrized view" (see next example) or even encapsulate a multi-line statement that returns a dataset. Currently, there seems to be no template available for creating such functions.

-- table-valued function
CREATE OR ALTER FUNCTION Test.tvfGetCustomers (
    @CompanyName nvarchar(50) NULL
)
RETURNS TABLE
-- Customers by Company
AS
RETURN (
	SELECT CST.CustomerId 
	, CST.CompanyName
	, CST.Title
	, IsNull(CST.FirstName, '') + IsNull(' ' + CST.MiddleName, '') + IsNull(' ' + CST.LastName, '') FullName
	, CST.FirstName 
	, CST.MiddleName 
	, CST.LastName 
	FROM SalesLT.Customer CST
	WHERE CST.CompanyName = IsNull(@CompanyName, CST.CompanyName)
);

-- test function for values
SELECT *
FROM Test.tvfGetCustomers ('A Bike Store')
ORDER BY CompanyName
, FullName

-- test function for retrieving all values
SELECT *
FROM Test.tvfGetCustomers (NULL)
ORDER BY CompanyName
, FullName

One can look at the sys.objects to retrieve all the table-valued functions available:

-- retrieve all table-valued functions
SELECT schema_name(schema_id) schema_name
, name
, object_id
FROM sys.objects 
WHERE type_desc = 'SQL_INLINE_TABLE_VALUED_FUNCTION'
ORDER BY schema_name , name

Stored Procedures

A stored procedure is a prepared SQL statement that is stored as a database object and precompiled. Typically, the statements considered in SQL functions can be created also as stored procedure, however the latter doesn't allow to reuse the output directly.

-- get customers by company
CREATE OR ALTER PROCEDURE Test.spGetCustomersByCompany (
    @CompanyName nvarchar(50) NULL
)
AS
BEGIN
	SELECT CST.CustomerId 
	, CST.CompanyName
	, CST.Title
	, IsNull(CST.FirstName, '') + IsNull(' ' + CST.MiddleName, '') + IsNull(' ' + CST.LastName, '') FullName
	, CST.FirstName 
	, CST.MiddleName 
	, CST.LastName 
	FROM SalesLT.Customer CST
	WHERE CST.CompanyName = IsNull(@CompanyName, CST.CompanyName)
	ORDER BY CST.CompanyName
	, FullName
END 

-- test the procedure 
EXEC Test.spGetCustomersByCompany NULL -- all customers
EXEC Test.spGetCustomersByCompany 'A Bike Store' -- individual customer

One can look at the sys.objects to retrieve all the stored procedures available:

-- retrieve all scalar functions
SELECT schema_name(schema_id) schema_name
, name
, object_id
FROM sys.objects 
WHERE type_desc = 'SQL_STORED_PROCEDURE'
ORDER BY schema_name , name

In the end, don't forget to drop the objects created above (note the order of the dependencies):

-- drop function 
DROP FUNCTION IF EXISTS Test.GetFirstMiddleLastName

-- drop function 
DROP FUNCTION IF EXISTS Test.tvfGetCustomers 
-- drop precedure DROP VIEW IF EXISTS Test.Test.spGetCustomersByCompany -- drop view DROP VIEW IF EXISTS Test.vCustomers -- drop schema DROP SCHEMA IF EXISTS Test

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: Overview of Copilot in Fabric [link]

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part IX: From OLTP to OLAP Data Models)

With SQL databases Microsoft brought OLTP to Microsoft Fabric which allows addressing a wider range of requirements, though this involves also some challenges that usually are addressed by the transition from the OLTP to OLAP architectures. Typically, there's an abstraction layer that is built on top of the OLTP data models that allows to address the various OLAP requirements. As soon as OLTP and OLAP models are mixed together, this opens the door to design and data quality issues that have impact on the adoption of solutions by users. Probably, those who worked with MS Access or even MS Excel directly or in combination with SQL Server can still remember the issues they run into.

Ideally, it should be a separation layer between OLTP and the OLAP data. This can be easily achieved in SQL databases by using two different schemas that mimic the interaction between the two types of architectures. So, supposing that the dbo schema from the SalesLT is the data as maintain by the OLTP layer, one can add an additional schema Test in which the OLAP logic is modelled. This scenario is not ideal, though it allows to model the two aspects of the topic considered. The following steps are to be performed in the environment in which the SalesLT database was created. 

Independently in which layer one works, it's ideal to create a set of views that abstracts the logic and ideally simplifies the processing of data. So, in a first step it's recommended to abstract the data from the source by creating a set of views like the one below:

-- drop view (cleaning)
-- DROP VIEW IF EXISTS SalesLT.vCustomerLocations 

-- create view
CREATE VIEW SalesLT.vCustomerLocations
-- Customers with main office
AS
SELECT CST.CustomerId 
, CSA.AddressID
, CST.Title
, CST.FirstName 
, IsNull(CST.MiddleName, '') MiddleName
, CST.LastName 
, CST.CompanyName 
, CST.SalesPerson 
, IsNull(CSA.AddressType, '') AddressType
, IsNull(ADR.City, '') City
, IsNull(ADR.StateProvince, '') StateProvince
, IsNull(ADR.CountryRegion, '') CountryRegion
, IsNull(ADR.PostalCode, '') PostalCode
FROM SalesLT.Customer CST
	 LEFT JOIN SalesLT.CustomerAddress CSA
	   ON CST.CustomerID = CSA.CustomerID
	  AND CSA.AddressType = 'Main Office'
	 	LEFT JOIN SalesLT.Address ADR
		  ON CSA.AddressID = ADR.AddressID

The view uses LEFT instead of FULL joins because this allows more flexibility, respectively identifying the gaps existing between entities (e.g. customers without addresses). In these abstractions, the number of transformations is kept to a minimum to reflect the data as reflected by the source. It may be chosen to minimize the occurrence of NULL values as this simplifies the logic for comparisons (see the use of IsNull).

Once the abstraction from the OLTP layer was built, one can make the data available in the OLAP layer:

-- create schema
CREATE SCHEMA Test

-- dropping the target table (for cleaning)
-- DROP TABLE IF EXISTS Test.CustomerLocations

-- Option 1
-- create the table on the fly
SELECT *
INTO Test.CustomerLocations
FROM SalesLT.vCustomerLocations

-- Option 2
-- create the table manually (alternative to precedent step
CREATE TABLE [Test].[CustomerLocations](
	[CustomerId] [int] NOT NULL,
	[AddressID] [int] NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[CompanyName] [nvarchar](128) NULL,
	[SalesPerson] [nvarchar](256) NULL,
	[AddressType] [nvarchar](50) NULL,
	[City] [nvarchar](30) NULL,
	[StateProvince] [nvarchar](50) NULL,
	[CountryRegion] [nvarchar](50) NULL,
	[PostalCode] [nvarchar](15) NULL
) ON [PRIMARY]
GO

-- insert records
INSERT INTO Test.CustomerLocations
SELECT *
FROM SalesLT.vCustomerLocations


-- checking the output (both scenarios)
SELECT top 100 *
FROM Test.CustomerLocations


-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS Test.vCustomerLocations

-- create view
CREATE VIEW Test.vCustomerLocations
-- Customer locations
AS
SELECT CSL.CustomerId 
, CSL.AddressID
, CSL.Title
, CSL.FirstName 
, CSL.MiddleName 
, CSL.LastName 
, Concat(CSL.FirstName, ' ' + CSL.MiddleName, ' ', CSL.LastName) FullName
, CSL.CompanyName 
, CSL.SalesPerson 
, CSL.AddressType
, CSL.City
, CSL.StateProvince
, CSL.CountryRegion 
, CSL.PostalCode
FROM Test.CustomerLocations CSL

-- test the view
SELECT top 100 *
FROM Test.vCustomerLocations

Further on, one can create additional objects as required. Usually, a set of well-designed views is enough, offering the needed flexibility with a minimum of code duplication. In addition, one can build stored procedures and table-valued functions as needed:

-- drop the function (for cleaning)
-- DROP FUNCTION IF EXISTS Test.tvfGetCustomerAddresses

-- generated template - function
CREATE FUNCTION Test.tvfGetCustomerAddresses (
    @CountryRegion nvarchar(50) NULL,
    @StateProvince nvarchar(50) NULL
)
RETURNS TABLE
-- Customers by Country & State province
AS
RETURN (
SELECT CSL.CustomerId 
, CSL.AddressID
, CSL.Title
, CSL.FirstName 
, CSL.MiddleName 
, CSL.LastName 
, CSL.FullName
, CSL.CompanyName 
, CSL.SalesPerson 
, CSL.AddressType 
, CSL.City
, CSL.StateProvince 
, CSL.CountryRegion 
, CSL.PostalCode
FROM Test.vCustomerLocations CSL
WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
  AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
);

-- retrieving all records
SELECT *
FROM Test.tvfGetCustomerAddresses(NULL, NULL)

-- providing parameters
SELECT *
FROM Test.tvfGetCustomerAddresses('United States', 'Utah')

-- filtering on non-parametrized volumns
SELECT *
FROM Test.tvfGetCustomerAddresses('United States', 'Utah')
WHERE City = 'Salt Lake City'



-- drop the procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS Test.spGetCustomerAddresses 

-- generated template - stored procedure
CREATE PROCEDURE Test.spGetCustomerAddresses (
    @CountryRegion nvarchar(50) NULL,
    @StateProvince nvarchar(50) NULL
)
-- Customers by Country & State province
AS
BEGIN
	SELECT CSL.CustomerId 
	, CSL.AddressID
	, CSL.Title
	, CSL.FirstName 
	, CSL.MiddleName 
	, CSL.LastName 
	, CSL.FullName
	, CSL.CompanyName 
	, CSL.SalesPerson 
	, CSL.AddressType 
	, CSL.City
	, CSL.StateProvince 
	, CSL.CountryRegion 
	, CSL.PostalCode
	FROM Test.vCustomerLocations CSL
	WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
	AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
END 

-- retrieving all records
EXEC Test.spGetCustomerAddresses NULL, NULL

-- providing parameters
 EXEC Test.spGetCustomerAddresses 'United States', 'Utah'

These steps can repeated for each entity in scope.

This separation between OLTP and OLAP is usually necessary given that business processes need a certain amount of time until they are correctly reflected as per reporting needs. Otherwise, the gaps can negatively impact the quality of data used for reporting. For some reports these deviation might be acceptable, though there will be probably also (many) exceptions. Independently of the solution used, it's still needed to make sure that the data are appropriate for the processes and reporting. 

If no physical separation is needed between the two types of layers, one can remove the persisted tables from the logic and keep the objects as they are.

Independently of which architecture is chosen, one shouldn't forget to validate one's presumptions in what concerns the data model (e.g. customers without addresses, address types, etc.).

Previous Post <<||>> Next Post

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

 Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 14-Feb-2024

[Microsoft Fabric] Partitions

  • {def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards
    • a pattition is defined based on one or more fields
    • each partition contains a subset of the data
    • each partitions can be stored and processed independently
  • {goal} improve performance, scalability, and manageability of large data tables
  • {benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]
    • e.g., date ranges, regions, categories, entities
  • {benefit} allows to improve queries' performance as they can target specific partitions [2]
    • reduces the amount of data scanned [2]
    • improves queries' performance [2]
  • {benefit} allows for more efficient data loading [2]
  • {benefit} facilitates the management of big tables [2]
    • maintenance tasks can be performed on individual partitions  [2]
    • obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]
  • applies to 
    • backups
    • indexing
    • allows optimizing query performance for specific subsets of data
    • statistics
  • performance can be affected by
    • the choice of partition columns for a delta table [1]
    • the number and size of partitions of the partition column [1]
    • a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]
      • ⇐ negatively impacts performance of the metadata discovery scan for changes [1]
      • {recommendation} if the cardinality of a column is high, choose another column for partitioning [1]
    • the size of each partition can also affect performance
      • {recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
      • {recommendation} follow the best practices for delta tables maintenance [1]
      • a large volume of small-sized parquet files increases the time it takes to sync the changes [1]
        • ⇒ leads to large number of parquet files in a delta table due to [1]
          • over-partitioning
            • partitions with high number of unique values [1]
            • {recommendation} choose a partition column that 
              • doesn't have a high cardinality [1]
              • results in individual partition size of at least 1 GB [1]
          • many small files
            • batch and streaming data ingestion rates might result in small files
              • depends on the frequency and size of changes being written to a lakehouse [1]
            • {recommendation} implement regular lakehouse table maintenance [1] 

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance
written [link]
[3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]
[4] 

Resources
[R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]

13 February 2025

🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 12-Feb-2024

[Microsoft Fabric] SQL Analytics Endpoint

  • {def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
    • enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
    • created by default for each lakehouses and mirrored databases provisioned in a workspace
      • every lakehouse has one SQL analytics endpoint [1]
    • manages the automatically generated tables so the workspace users can't modify them [1]
  • {feature} a background process is responsible for 
    • scanning lakehouse for changes [1]
      • automatically generates a table in the appropriate schema for every Delta table in the Lakehouse
        • tables are created with a minor delay [1]
        • data is refreshed automatically
          • the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]
        • {alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]
    • keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]
      • {characteristic} transparently managed
        • ⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
        • when a change is detected in the lakehouse
          • a background process updates metadata [1]
          • the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]
      • {characteristic} low latency
        • under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
        • the actual length of time can vary from a few seconds to minutes [1]
      • the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]
        • ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]
          •  ⇐ provides a seamless experience [3]
  • {feature} can be enriched by adding database objects
    • schemas
    • views
    • procedures
    • other database objects
  • {feature} automatic metadata discovery
    • tracks changes committed to lakehouses [1]
      • is a single instance per Fabric workspace [1]
      • {issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]
        • {resolution} migrate each lakehouse to a separate workspace [1]
          • allows automatic metadata discovery to scale [1]
      • {issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint
        • create a new table in the  [1]
        • many committed rows may not be visible in the endpoint [1]
        • ⇐ may be impacted by ETL processing that generate large volumes of changes
        • {recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]
          • forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]
      • {issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]
        • {recommendation}schedule regular lakehouse table maintenance operations [1]
      • {limitation} not all Delta features are understood by the automatic sync process [1]
  • {feature} endpoint reprovisioning
    • if the initial provisioning attempt fails, users have the option to try again [2]
      • ⇐ without the need to create an entirely new lakehouse [2]
      • {benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2024

[Microsoft Fabric] Data pipeline

  • {def} a logical sequence of activities that orchestrate a process and perform together a task [1]
    • usually by extracting data from one or more sources and loading it into a destination; 
      • ⇐ often transforming it along the way [1]
      • ⇐ allows to manage the activities as a set instead of each one individually [2]
      • ⇐ used to automate ETL processes that ingest transactional data from operational data stores into an analytical data store [1]
      • e.g. lakehouse or data warehouse 
  • {concept} activity
    • {def} an executable task in a pipeline
      • a flow of activities can be defined by connecting them in a sequence [1]
      • its outcome (success, failure, or completion) can be used to direct the flow to the next activity in the sequence [1]
    • {type} data movement activities
      • copies data from a source data store to a sink data store [2]
    • {type} data transformation activities
      • encapsulate data transfer operations
        • incl. simple Copy Data activities that extract data from a source and load it to a destination
        • incl. complex Data Flow activities that encapsulate dataflows (Gen2) that apply transformations to the data as it is transferred
        • incl. notebook activities to run a Spark notebook
        • incl. stored procedure activities to run SQL code
        • incl. delete data activities to delete existing data
    • {type} control flow activities
        • used to 
          • implement loops
          • implement conditional branching
          • manage variables
          • manage parameter values
          • enable to implement complex pipeline logic to orchestrate data ingestion and transformation flow  [1]
        • can be parameterized
          • ⇐enabling to provide specific values to be used each time a pipeline is run  [1]
        • when executed, a run is initiated (aka data pipeline run
          • runs can be initiated on-demand or scheduled to start at a specific frequency
          • use the unique run ID to review run details to confirm they completed successfully and investigate the specific settings used for each execution [1]
      • {benefit} increases pipelines’ reusability
    • {concept} pipeline template
      • predefined pipeline that can be used and customize as required
    • {concept} data pipeline run
      • occurs when a data pipeline is executed
      • the activities in the data pipeline are executed to completion [3] 
      • can be triggered one of two ways
        • on-demand
        • on a schedule
          • the scheduled pipeline will be able to run based on the time and frequency set [3]

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2023) Use Data Factory pipelines in Microsoft Fabric [link]
    [2] Microsoft Learn (2024) Microsoft Fabric: Activity overview [link]
    [3] Microsoft Learn (2024) Microsoft Fabric Concept: Data pipeline Runs [link]

    Resources
    [R1] Metadata Driven Pipelines for Microsoft Fabric (link)
    [R2] 

    🏭🗒️Microsoft Fabric: Kusto Query Language (KQL) [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 9-Feb-2025

    [Microsoft Fabric] Kusto Query Language (KQL)

    • {def} a read-only request to process query language [1]
      • designed for data exploration and summarization [1]
        • very similar to SQL
          • the explain command can be used to transform SQL into KQL code
            • ⇐ not all the SQL syntax can be translated
      • statements are sequenced being executed in the order of their arrangement
        • funnel like processing where data is piped from one operator to the next
          • data is filtered, rearranged or summarized at each step and then fed into the following step
          • statements are sequenced by a pipe (|)
      • returns data in a tabular or graph format
      • designed and developed to take advantage of cloud computing through clustering and scaling compute [2]
        • ideal engine to power fast, real-time dashboards
      • case-sensitive in general 
      • named after the undersea pioneer Jacques Cousteau [2]
      • operation sequence
        • filter data
        • aggregate data
        • order data
        • modify column output
    • supports standard data types 
      • string
        • a sequence of zero or more Unicode characters
        • characters are encoded in UTF-8.
      • int 
        • 32-bit whole-number integer
      • long
        • signed 64-bit whole-number integer
      • real (aka double)
        • 64-bit decimal-based number 
        • and provides high precision with decimal points.
      • decimal
        • a 128-bit decimal number
        • provides the highest precision of decimal points
        • {recommendation} if precision is not needed, use the real type instead [2]
      • bool 
        • a boolean value that can be a true (1), false (0), or null
      • datetime 
        • represents a date in the UTC zone
      • timespan 
        • represents a time interval
          •  days, hours, minutes, seconds, milliseconds, microseconds, tick
        • if no time frame is specified, it will default to day
      • dynamic
        • a special data type that can take 
          • any value from the other data types
          • arrays
          • a {name = value} property bag
      • guid
        • a 128-bit globally unique value
    • statement types
      • tabular expression statement
      • let statement
        • used to 
          • set variable names equal to an expression
          • create views
        • ⇐ used mostly to 
          • help break complex expressions into multiple parts, each represented by a variable
          • sett constants outside the query to aid in readability
      • set statement
        • used to set the query duration
    • {tool}Microsoft Santinel 
      • {def} a cloud native SIEM and SOAR that provides cyberthreat detection, investigation, response, and proactive hunting, with a bird's-eye view across your enterprise [3]
    • {tool} Kusto Explorer
      • {def} user-friendly interface to query and analyze data with KQL [4]
    • {tool} Azure Data Studio 
      • {def} lightweight, cross-platform data management and development tool for data professionals [5]

    References:
    [1] Microsoft (2024) Real-time Analytics: End-to-End Workshop
    [2] Mark Morowczynski et al (2024) The Definitive Guide to KQL: Using Kusto Query Language for Operations, Defending, and Threat Hunting
    [3] Microsoft Learn (2024) Azure: What is Microsoft Sentinel [link]
    [4] Microsoft Learn (2024) Kusto: Kusto.Explorer installation and user interface [link]
    [5] Microsoft Learn (2024) SQL: What is Azure Data Studio? [link]

    Acronyms:
    KQL - Kusto Query Language (
    SIEM - security information and event management
    SOAR - security orchestration, automation, and response
    SQL - Structured Query Language
    UTC - Universal Time Coordinated

    🏭🗒️Microsoft Fabric: Sharding [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 9-Feb-2024

     [Microsoft Fabric] Data Partitioning (aka Sharding)

    • {definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
    • allows a logical database to be partitioned across multiple physical servers [1]
      • each partition is referred to as a shard
      • the largest tables are partitioned across multiple database servers [1]
        • when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]
      • partitioning is based on a Key Value
        • e.g. such as a user ID
    • proven technique for achieving data processing on a massive scale [1]
      • solution used at the largest websites 
        • e.g. Facebook, Twitter
        • usually associated with rapid growth
          •  ⇒ the approach needs to be dynamic [1]
      • the only way to scale a relational database to massive web use [1]
        • together with caching and replication [1]
    • {drawback} involves significant operational complexities and compromises [1]
      • the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
      • requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.
    • {drawback} high operational costs [1]
    • {drawback} application complexity
      • it’s up to the application code to route SQL requests to the correct shard [1]
        • ⇒ a dynamic routing layer must be implemented
          • ⇐ most massive websites are adding shards as they grow [1]
          • layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]
    • {drawback} crippled SQL
      • [sharded database] it is not possible to issue a SQL statement that operates across shards [1]
        • ⇒ usually SQL statements are limited to row-level access [1]
        • ⇒ only programmers can query the database as a whole [1]
        • joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]
    • {drawback} loss of transactional integrity
      • ACID transactions against multiple shards are not possible and/or not practical [1]
        • ⇐ {exception} there are database systems that support 2PC
          • involves considerable troubleshooting as conflicts and bottlenecks can occur [1]
    • {drawback} operational complexity. 
      • load balancing across shards becomes extremely problematic
        • adding new shards requires a complex rebalancing of data [1]
        • changing the database schema requires a rolling operation across all the shards [1]
          • ⇒ can lead to transitory inconsistencies in the schema [1]
      • a sharded database entails a huge amount of operational effort and administrator skill [1]
    • {concept} CAP (Consistency, Availability, and Partition) theorem 
      • in a distributed database system, one can have at most only two of CAP tolerance [1]
      • consistency
        • every user of the database has an identical view of the data at any given instant [1]
      • availability
        • in the event of a failure, the database remains operational [1]
      • partition tolerance
        • the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]
    • {concept} partitioning
      • {def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]
        • ⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.
          • a partition is a set of replicas)
      • {type} [stateless services] a logical unit that contains one or more instances of a service [5]
        • partitioning a stateless service is a very rare scenario
        • scalability and availability are normally achieved by adding more instances
        • {subtype} externally persisted state
          • persists its state externally [5]
            • e.g. databases in Azure SQL Database
        • {subtype} computation-only services
          • service that do not manage any persistent state e.g. calculator or image thumbnailing [5]
      • {type} scalable stateful services
        • partition state (data)
        • a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
        • the state must be accessed and stored
          • ⇒ bound by
            • network bandwidth limits
            • system memory limits
            • disk storage limits
          • {scenario} run into resource constraints in a running cluster
            • {recommendation} scale out the cluster to accommodate the new requirements [4]
    • {concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]
      • ⇐ addresses the significant challenges in developing and managing cloud applications
      • places the partitions on different nodes [5]
        • allows partitions to grow to a node's resource limit
          • ⇐ partitions are rebalances across nodes [5]
        • {benefit} ensures the continued efficient use of hardware resources [5]
      • {default} makes sure that there is about the same number of primary and secondary replicas on each node
        • ⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
        • hot and cold spots may appear in a cluster
            • ⇐ it should be preferably avoided
          • {recommendation} partition the state so is evenly distributed across all partitions [5]
          • {recommendation} report load from each of the replicas for the service [5]
      • provides the capability to report load consumed by services [5]
        • e.g.  amount of memory, number of records
        • detects which partitions server higher loads than others [5]
          • ⇐ based on the metrics reported
        • rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded  [5]
        • ⇐ it's not always possible to know how much data will be in a given partition
          • {recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]
            • {benefit} prevents situations described in the voting example [5]
          • {recommendation} report load
            • {benefit} helps smooth out temporary differences in access or load over time [5]
      • {recommendation} choose an optimal number of partitions to begin with
        • ⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]
          • ⇐ assuming the maximum number of partitions is a valid approach [5]
        • ⇒ one may end up needing more partitions than initially considered [5]
          • ⇐ {constraint} the partition count can't be changed after the fact [5]
            •  ⇒ apply more advanced partition approaches
              • e.g. creating a new service instance of the same service type
              • e.g. implement client-side logic that routes the requests to the correct service instance

    References:
    [1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
    [2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed
    [3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
    [4] Microsoft Fabric (2024) Data sharding policy [link]
    [5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
    [6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

    Acronyms:
    ACID - atomicity, consistency, isolation, durability
    2PC - Two Phase Commit
    CAP - Consistency, Availability, Partition

    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.