Showing posts with label Microsoft Fabric. Show all posts
Showing posts with label Microsoft Fabric. Show all posts

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

 Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 14-Feb-2024

[Microsoft Fabric] Partitions

  • {def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards
    • a pattition is defined based on one or more fields
    • each partition contains a subset of the data
    • each partitions can be stored and processed independently
  • {goal} improve performance, scalability, and manageability of large data tables
  • {benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]
    • e.g., date ranges, regions, categories, entities
  • {benefit} allows to improve queries' performance as they can target specific partitions [2]
    • reduces the amount of data scanned [2]
    • improves queries' performance [2]
  • {benefit} allows for more efficient data loading [2]
  • {benefit} facilitates the management of big tables [2]
    • maintenance tasks can be performed on individual partitions  [2]
    • obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]
  • applies to 
    • backups
    • indexing
    • allows optimizing query performance for specific subsets of data
    • statistics
  • performance can be affected by
    • the choice of partition columns for a delta table [1]
    • the number and size of partitions of the partition column [1]
    • a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]
      • ⇐ negatively impacts performance of the metadata discovery scan for changes [1]
      • {recommendation} if the cardinality of a column is high, choose another column for partitioning [1]
    • the size of each partition can also affect performance
      • {recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
      • {recommendation} follow the best practices for delta tables maintenance [1]
      • a large volume of small-sized parquet files increases the time it takes to sync the changes [1]
        • ⇒ leads to large number of parquet files in a delta table due to [1]
          • over-partitioning
            • partitions with high number of unique values [1]
            • {recommendation} choose a partition column that 
              • doesn't have a high cardinality [1]
              • results in individual partition size of at least 1 GB [1]
          • many small files
            • batch and streaming data ingestion rates might result in small files
              • depends on the frequency and size of changes being written to a lakehouse [1]
            • {recommendation} implement regular lakehouse table maintenance [1] 

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance
written [link]
[3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]
[4] 

Resources
[R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]

13 February 2025

🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 12-Feb-2024

[Microsoft Fabric] SQL Analytics Endpoint

  • {def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
    • enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
    • created by default for each lakehouses and mirrored databases provisioned in a workspace
      • every lakehouse has one SQL analytics endpoint [1]
    • manages the automatically generated tables so the workspace users can't modify them [1]
  • {feature} a background process is responsible for 
    • scanning lakehouse for changes [1]
      • automatically generates a table in the appropriate schema for every Delta table in the Lakehouse
        • tables are created with a minor delay [1]
        • data is refreshed automatically
          • the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]
        • {alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]
    • keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]
      • {characteristic} transparently managed
        • ⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
        • when a change is detected in the lakehouse
          • a background process updates metadata [1]
          • the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]
      • {characteristic} low latency
        • under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
        • the actual length of time can vary from a few seconds to minutes [1]
      • the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]
        • ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]
          •  ⇐ provides a seamless experience [3]
  • {feature} can be enriched by adding database objects
    • schemas
    • views
    • procedures
    • other database objects
  • {feature} automatic metadata discovery
    • tracks changes committed to lakehouses [1]
      • is a single instance per Fabric workspace [1]
      • {issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]
        • {resolution} migrate each lakehouse to a separate workspace [1]
          • allows automatic metadata discovery to scale [1]
      • {issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint
        • create a new table in the  [1]
        • many committed rows may not be visible in the endpoint [1]
        • ⇐ may be impacted by ETL processing that generate large volumes of changes
        • {recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]
          • forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]
      • {issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]
        • {recommendation}schedule regular lakehouse table maintenance operations [1]
      • {limitation} not all Delta features are understood by the automatic sync process [1]
  • {feature} endpoint reprovisioning
    • if the initial provisioning attempt fails, users have the option to try again [2]
      • ⇐ without the need to create an entirely new lakehouse [2]
      • {benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2024

[Microsoft Fabric] Data pipeline

  • {def} a logical sequence of activities that orchestrate a process and perform together a task [1]
    • usually by extracting data from one or more sources and loading it into a destination; 
      • ⇐ often transforming it along the way [1]
      • ⇐ allows to manage the activities as a set instead of each one individually [2]
      • ⇐ used to automate ETL processes that ingest transactional data from operational data stores into an analytical data store [1]
      • e.g. lakehouse or data warehouse 
  • {concept} activity
    • {def} an executable task in a pipeline
      • a flow of activities can be defined by connecting them in a sequence [1]
      • its outcome (success, failure, or completion) can be used to direct the flow to the next activity in the sequence [1]
    • {type} data movement activities
      • copies data from a source data store to a sink data store [2]
    • {type} data transformation activities
      • encapsulate data transfer operations
        • incl. simple Copy Data activities that extract data from a source and load it to a destination
        • incl. complex Data Flow activities that encapsulate dataflows (Gen2) that apply transformations to the data as it is transferred
        • incl. notebook activities to run a Spark notebook
        • incl. stored procedure activities to run SQL code
        • incl. delete data activities to delete existing data
    • {type} control flow activities
        • used to 
          • implement loops
          • implement conditional branching
          • manage variables
          • manage parameter values
          • enable to implement complex pipeline logic to orchestrate data ingestion and transformation flow  [1]
        • can be parameterized
          • ⇐enabling to provide specific values to be used each time a pipeline is run  [1]
        • when executed, a run is initiated (aka data pipeline run
          • runs can be initiated on-demand or scheduled to start at a specific frequency
          • use the unique run ID to review run details to confirm they completed successfully and investigate the specific settings used for each execution [1]
      • {benefit} increases pipelines’ reusability
    • {concept} pipeline template
      • predefined pipeline that can be used and customize as required
    • {concept} data pipeline run
      • occurs when a data pipeline is executed
      • the activities in the data pipeline are executed to completion [3] 
      • can be triggered one of two ways
        • on-demand
        • on a schedule
          • the scheduled pipeline will be able to run based on the time and frequency set [3]

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2023) Use Data Factory pipelines in Microsoft Fabric [link]
    [2] Microsoft Learn (2024) Microsoft Fabric: Activity overview [link]
    [3] Microsoft Learn (2024) Microsoft Fabric Concept: Data pipeline Runs [link]

    Resources
    [R1] Metadata Driven Pipelines for Microsoft Fabric (link)
    [R2] 

    🏭🗒️Microsoft Fabric: Kusto Query Language (KQL) [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 9-Feb-2025

    [Microsoft Fabric] Kusto Query Language (KQL)

    • {def} a read-only request to process query language [1]
      • designed for data exploration and summarization [1]
        • very similar to SQL
          • the explain command can be used to transform SQL into KQL code
            • ⇐ not all the SQL syntax can be translated
      • statements are sequenced being executed in the order of their arrangement
        • funnel like processing where data is piped from one operator to the next
          • data is filtered, rearranged or summarized at each step and then fed into the following step
          • statements are sequenced by a pipe (|)
      • returns data in a tabular or graph format
      • designed and developed to take advantage of cloud computing through clustering and scaling compute [2]
        • ideal engine to power fast, real-time dashboards
      • case-sensitive in general 
      • named after the undersea pioneer Jacques Cousteau [2]
      • operation sequence
        • filter data
        • aggregate data
        • order data
        • modify column output
    • supports standard data types 
      • string
        • a sequence of zero or more Unicode characters
        • characters are encoded in UTF-8.
      • int 
        • 32-bit whole-number integer
      • long
        • signed 64-bit whole-number integer
      • real (aka double)
        • 64-bit decimal-based number 
        • and provides high precision with decimal points.
      • decimal
        • a 128-bit decimal number
        • provides the highest precision of decimal points
        • {recommendation} if precision is not needed, use the real type instead [2]
      • bool 
        • a boolean value that can be a true (1), false (0), or null
      • datetime 
        • represents a date in the UTC zone
      • timespan 
        • represents a time interval
          •  days, hours, minutes, seconds, milliseconds, microseconds, tick
        • if no time frame is specified, it will default to day
      • dynamic
        • a special data type that can take 
          • any value from the other data types
          • arrays
          • a {name = value} property bag
      • guid
        • a 128-bit globally unique value
    • statement types
      • tabular expression statement
      • let statement
        • used to 
          • set variable names equal to an expression
          • create views
        • ⇐ used mostly to 
          • help break complex expressions into multiple parts, each represented by a variable
          • sett constants outside the query to aid in readability
      • set statement
        • used to set the query duration
    • {tool}Microsoft Santinel 
      • {def} a cloud native SIEM and SOAR that provides cyberthreat detection, investigation, response, and proactive hunting, with a bird's-eye view across your enterprise [3]
    • {tool} Kusto Explorer
      • {def} user-friendly interface to query and analyze data with KQL [4]
    • {tool} Azure Data Studio 
      • {def} lightweight, cross-platform data management and development tool for data professionals [5]

    References:
    [1] Microsoft (2024) Real-time Analytics: End-to-End Workshop
    [2] Mark Morowczynski et al (2024) The Definitive Guide to KQL: Using Kusto Query Language for Operations, Defending, and Threat Hunting
    [3] Microsoft Learn (2024) Azure: What is Microsoft Sentinel [link]
    [4] Microsoft Learn (2024) Kusto: Kusto.Explorer installation and user interface [link]
    [5] Microsoft Learn (2024) SQL: What is Azure Data Studio? [link]

    Acronyms:
    KQL - Kusto Query Language (
    SIEM - security information and event management
    SOAR - security orchestration, automation, and response
    SQL - Structured Query Language
    UTC - Universal Time Coordinated

    🏭🗒️Microsoft Fabric: Sharding [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 9-Feb-2024

     [Microsoft Fabric] Data Partitioning (aka Sharding)

    • {definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
    • allows a logical database to be partitioned across multiple physical servers [1]
      • each partition is referred to as a shard
      • the largest tables are partitioned across multiple database servers [1]
        • when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]
      • partitioning is based on a Key Value
        • e.g. such as a user ID
    • proven technique for achieving data processing on a massive scale [1]
      • solution used at the largest websites 
        • e.g. Facebook, Twitter
        • usually associated with rapid growth
          •  ⇒ the approach needs to be dynamic [1]
      • the only way to scale a relational database to massive web use [1]
        • together with caching and replication [1]
    • {drawback} involves significant operational complexities and compromises [1]
      • the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
      • requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.
    • {drawback} high operational costs [1]
    • {drawback} application complexity
      • it’s up to the application code to route SQL requests to the correct shard [1]
        • ⇒ a dynamic routing layer must be implemented
          • ⇐ most massive websites are adding shards as they grow [1]
          • layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]
    • {drawback} crippled SQL
      • [sharded database] it is not possible to issue a SQL statement that operates across shards [1]
        • ⇒ usually SQL statements are limited to row-level access [1]
        • ⇒ only programmers can query the database as a whole [1]
        • joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]
    • {drawback} loss of transactional integrity
      • ACID transactions against multiple shards are not possible and/or not practical [1]
        • ⇐ {exception} there are database systems that support 2PC
          • involves considerable troubleshooting as conflicts and bottlenecks can occur [1]
    • {drawback} operational complexity. 
      • load balancing across shards becomes extremely problematic
        • adding new shards requires a complex rebalancing of data [1]
        • changing the database schema requires a rolling operation across all the shards [1]
          • ⇒ can lead to transitory inconsistencies in the schema [1]
      • a sharded database entails a huge amount of operational effort and administrator skill [1]
    • {concept} CAP (Consistency, Availability, and Partition) theorem 
      • in a distributed database system, one can have at most only two of CAP tolerance [1]
      • consistency
        • every user of the database has an identical view of the data at any given instant [1]
      • availability
        • in the event of a failure, the database remains operational [1]
      • partition tolerance
        • the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]
    • {concept} partitioning
      • {def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]
        • ⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.
          • a partition is a set of replicas)
      • {type} [stateless services] a logical unit that contains one or more instances of a service [5]
        • partitioning a stateless service is a very rare scenario
        • scalability and availability are normally achieved by adding more instances
        • {subtype} externally persisted state
          • persists its state externally [5]
            • e.g. databases in Azure SQL Database
        • {subtype} computation-only services
          • service that do not manage any persistent state e.g. calculator or image thumbnailing [5]
      • {type} scalable stateful services
        • partition state (data)
        • a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
        • the state must be accessed and stored
          • ⇒ bound by
            • network bandwidth limits
            • system memory limits
            • disk storage limits
          • {scenario} run into resource constraints in a running cluster
            • {recommendation} scale out the cluster to accommodate the new requirements [4]
    • {concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]
      • ⇐ addresses the significant challenges in developing and managing cloud applications
      • places the partitions on different nodes [5]
        • allows partitions to grow to a node's resource limit
          • ⇐ partitions are rebalances across nodes [5]
        • {benefit} ensures the continued efficient use of hardware resources [5]
      • {default} makes sure that there is about the same number of primary and secondary replicas on each node
        • ⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
        • hot and cold spots may appear in a cluster
            • ⇐ it should be preferably avoided
          • {recommendation} partition the state so is evenly distributed across all partitions [5]
          • {recommendation} report load from each of the replicas for the service [5]
      • provides the capability to report load consumed by services [5]
        • e.g.  amount of memory, number of records
        • detects which partitions server higher loads than others [5]
          • ⇐ based on the metrics reported
        • rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded  [5]
        • ⇐ it's not always possible to know how much data will be in a given partition
          • {recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]
            • {benefit} prevents situations described in the voting example [5]
          • {recommendation} report load
            • {benefit} helps smooth out temporary differences in access or load over time [5]
      • {recommendation} choose an optimal number of partitions to begin with
        • ⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]
          • ⇐ assuming the maximum number of partitions is a valid approach [5]
        • ⇒ one may end up needing more partitions than initially considered [5]
          • ⇐ {constraint} the partition count can't be changed after the fact [5]
            •  ⇒ apply more advanced partition approaches
              • e.g. creating a new service instance of the same service type
              • e.g. implement client-side logic that routes the requests to the correct service instance

    References:
    [1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
    [2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed
    [3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
    [4] Microsoft Fabric (2024) Data sharding policy [link]
    [5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
    [6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

    Acronyms:
    ACID - atomicity, consistency, isolation, durability
    2PC - Two Phase Commit
    CAP - Consistency, Availability, Partition

    02 February 2025

    🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

    Business Intelligence Series
    Business Intelligence Series

    Last week Microsoft had a great session [1] on the present and future of SQL databases, a “light” version of Azure SQL databases designed for Microsoft Fabric environments, respectively workloads. SQL databases are currently available for testing, and after the first tests the product looks promising. Even if there are several feature gaps, it’s expected that Microsoft will bridge the gaps over time. Conversely, there might be features that don’t make sense in Fabric, respectively new features that need to be considered for facilitating the work in OneLake and its ecosystem.

    During the session several Microsoft professionals answered the audience’s questions, and they did a great job. Even if the answers and questions barely scratched the surface, they offered some insight into what Microsoft wants to do. Probably the expectation is that SQL databases won’t need any administration - indexes being maintained automatically, infrastructure scaling as needed, however everything sounds too nice to be true if one considers in general the experience with RDBMS – the devil hides usually in details.

    Even if the solutions built follow the best practices in the field, which frankly seldom happens, transferring the existing knowledge to Fabric may encounter some important challenges revolving around performance, flexibility, accessibility and probably costs. Even if SQL databases are expected to fill some minor gaps, considering the lessons of the past, such solutions can easily grow. Even if a lot of processing power is thrown at the SQL queries and the various functionality, customers still need to write quality code and refactor otherwise the costs will explode sooner or later. 

    As the practice has proven so many times while troubleshooting performance issues, sometimes one needs to use all the arsenal available – DBCC, DMVs and sometimes even undocumented features - to get a better understanding of what’s happening. Even if there are some voices stating that developers don’t need to know how the SQL engine works, just applying solutions blindly after a recipe can accidentally increase the value of code, though most likely it doesn’t exploit the full potential available. Unfortunately, this is a subjective topic without hard numbers to support it, and the stories told by developers and third-parties usually don’t tell the whole story. 

    It’s also true that diving deep into a database’s internal working requires time, that’s quite often not available, and the value for such an effort doesn’t necessarily pay off. Above this, there’s a software engineer’s aim of understanding of how things work. Otherwise, one should drop the engineering word and just call it coding. Conversely, the data citizen just needs a high-level knowledge of how things work, though the past 20-30 years proved that that’s often not enough. The more people don’t have the required knowledge, the higher the chances that code needs refactoring. Just remember the past issues organizations had with MS Access and Excel when people started to create their own solutions, the whole infrastructure being invaded by poorly designed solutions that continue to haunt some organizations even today.

    Even if lot of technical knowledge can be transported to Microsoft Fabric, the new environments may still require also adequate tools that can be used for monitoring and troubleshooting. Microsoft seems to work in this direction, though from the information available the tools don’t and can’t offer the whole perspective. It will be interesting to see how much the current, respectively the future dashboards and various reports can help; respectively what important gaps will surface. Until the gaps are addressed, probably the SQL professional must rely on SQL scripts and the DMVs available. All this can be summarized in a few words: it will not be boring!

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

    29 January 2025

    🌌🏭KQL Reloaded: First Steps (Part I: Simple Queries)

    If one has followed the Microsoft training, webcasts and other resources, it becomes clear that the Kusto Query Language (KQL) can't be ignored as it's useful in various scenarios thar deal with large volumes of data. 

    Even if KQL was created to query big tables, at least for the first steps it's recommended to start with a small data model, and when the basics were understood one can move to bigger databases. Moreover, it's easier to validate the logic when using small datasets.

    Probably, the most important aspect before starting is that KQL is case-sensitive and this applies to everything – table and column names, operators, functions, etc. The below code can be tried in the online explorer (see [2], [3]), which makes available several databases for playing around. The users might need to register beforehand.

    The following queries are based on the ContosoSales database (available in the above mentioned link). First, here are some simple projections. Each query is preceded by its short description in which the text was commented via "//" and must be run individually. 

    // selecting all records
    Customers 
    
    // selecting all records
    Customers 
    | take 10
    
    // multiple filters
    Customers 
    | where CityName == 'Berkeley'
    | where Occupation != 'Professional'
    | take 10
    
    // multiple filters on the same column
    Customers 
    | where Occupation == 'Professional' and Occupation == 'Clerical')
    | take 10
    
    // multiple filters on the same column
    Customers 
    | where Occupation in ('Professional','Clerical')
    | take 10
    
    // multiple filters on the same column
    Customers 
    | where not(Occupation in ('Professional','Clerical'))
    | take 10
    
    //subset of columns
    Customers
    | take 5
    | project ContinentName, CityName, FirstName, LastName, Gender
    

    Here are some example for the selection of unique values, the equivalent of SELECT DISTINCT from SQL:

    //distinct values used
    Customers
    | distinct  Occupation
    
    //distinct values used sorted ascendingly 
    Customers
    | distinct  Occupation
    | sort by Occupation asc
    
    //combinations of values used
    Customers
    | distinct  Occupation, Education
    | sort by Occupation asc, Education asc
    

    When further data is needed, one needs to resume to grouping values, the equivalent of GROUP BY:

    // record count
    Customers 
    | count
    
    // record count for constraint
    Customers 
    | where CityName == 'Berkeley'
    | count
    
    // record count for constraint: returns 0 records (KQL is case sensitive)
    Customers 
    | where CityName == 'BERKELEY'
    | count
    
    // numnber of records by occupation 
    Customers
    | summarize occupations_count = count() by Occupation
    
    // numnber of records by occupation with bar chart visual
    Customers
    | summarize occupations_count = count() by Occupation
    | render barchart
    

    The last query renders the data directly to a bar chart, which is a cool feature, especially when is needed to understand the distribution of values. Executing the query without the last line renders the initial dataset.

    Azure Data Explorer - Chart example
    Azure Data Explorer - Chart Example

    Here are some first impressions:
    1) The language is relatively simple, though the transition from SQL to KQL requires time, even if the thinking process of writing the code is not that far away. For those with no SQL knowledge, the transition might be a bit more challenging, though practice makes perfect!
    2) One can try to run the code line by line to understand the steps used.
    3) There are also some online code converters from SQL to KQL (see for example msticpy).
    4) The intellisense capabilities facilitate the overall experience. 
    5) Unfortunately, there seems to be no code formatter to HTML for KQL, so one needs to compromise in one way or another.
    6) For a comparison between SQL and KQL see [4].

    Happy coding!

    |>> Next Post 

    References:
    [1] Microsoft Learn (2024) Kusto Query Language learning resources [link]
    [2] TestingSpot Blog (2021) Learning Kusto Query Language - A tool for performance test engineers [link]
    [3] Azure Data Explorer (2025] Create a free cluster to explore your data [link]
    [4] Microsoft Learn (2024) SQL to Kusto Query Language cheat sheet [link]

    Resources:
    [R1] GitHub repository (2022) Sentinel Queries [link]
    [R2] GitHub repository (2022) Kusto Query Language Grammar [link]
    [R3] GitHub repository (2024) The Definitive Guide to KQL source code [link]
    [R4[ Git Hub repository (2022) Learning Kijo [link]
    [R5] Kusto gamification [link]

    22 January 2025

    🏭🗒️Microsoft Fabric: Clone Tables in Warehouses [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 22-Jan-2025

    [Microsoft Fabric] Zero-copy Clone

    • {def} a replica of an existing OneLake table created by copying existing table's metadata and referencing its data files [1]
      • the metadata is copied while the underlying data of the table stored as parquet files is not copied [1]
      • its creation is like creating a delta table [1]
      • DML/DDL changes on the source 
        • are not reflected in the clone table [1]
        • are not reflected on the source [1]
      • can be created within or across schemas in a warehouse [1]
      • created based on either:
        • current point-in-time
          • based on the present state of the table [1]
        • previous point-in-time
          • based on a point-in-time up to seven days in the past
            • the table clone contains the data as it appeared at a desired past point in time
            • all CRUD operations are retained for seven calendar days
          • created with a timestamp based on UTC
    • {characteristic} autonomous existence
      • the original source and the clones can be deleted without any constraints [1]
      • once a clone is created, it remains in existence until deleted by the user [1]
    • {characteristic} inherits 
      • object-level SQL security from the source table of the clone [1]
        • DENY permission can be set on the table clone if desired [1]
          • the workspace roles provide read access by default [1]
      • all attributes that exist at the source table, whether the clone was created within the same schema or across different schemas in a warehouse [1]
      • the primary and unique key constraints defined in the source table [1]
    • a read-only delta log is created for every table clone that is created within the Warehouse [1]
    • {benefit} facilitates development and testing processes 
      • by creating copies of tables in lower environments [1]
    • {benefit} provides consistent reporting and zero-copy duplication of data for analytical workloads and ML modeling and testing [1]
    • {benefit} provides the capability of data recovery in the event of a failed release or data corruption by retaining the previous state of data [1]
    • {benefit} helps create historical reports that reflect the state of data as it existed as of a specific point-in-time in the past [1]
    • {limitation} table clones across warehouses in a workspace are not currently supported [1]
    • {limitation} table clones across workspaces are not currently supported [1]
    • {limitation} clone table is not supported on the SQL analytics endpoint of the Lakehouse [1]
    • {limitation} clone of a warehouse or schema is currently not supported [1]
    • {limitation} table clones submitted before the retention period of seven days cannot be created [1]
    • {limitation} cloned tables do not currently inherit row-level security or dynamic data masking [1]
    • {limitation} changes to the table schema prevent a clone from being created prior to the table schema change [1]
    • {best practice} create the clone tables in dedicated schema(s)
    • [syntax] CREATE TABLE <schema.clone_table_name> AS CLONE OF <schema.table_name>

    Previous Post  <<||>> Next Post

    References:
    [1] Microsoft Learn (2023) Clone table in Microsoft Fabric [link]
    [2] Microsoft Learn (2024) Tutorial: Clone tables in the Fabric portal [link]
    [3] Microsoft Learn (2024) Tutorial: Clone a table with T-SQL in a Warehouse [link]
    [4] Microsoft Learn (2024) SQL: CREATE TABLE AS CLONE OF [link]

    🏭🗒️Microsoft Fabric: Folders [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 22-Jan-2025

    [Microsoft Fabric] Folders

    • {def} organizational units inside a workspace that enable users to efficiently organize and manage artifacts in the workspace [1]
    • identifiable by its name
      • {constraint} must be unique in a folder or at the root level of the workspace
      • {constraint} can’t include certain special characters [1]
        • C0 and C1 control codes [1]
        • leading or trailing spaces [1]
        • characters: ~"#.&*:<>?/{|} [1]
      • {constraint} can’t have system-reserved names
        • e.g. $recycle.bin, recycled, recycler.
      • {constraint} its length can't exceed 255 characters
    • {operation} create folder
      • can be created in
        • an existing folder (aka nested subfolder) [1]
          • {restriction} a maximum of 10 levels of nested subfolders can be created [1]
          • up to 10 folders can be created in the root folder [1]
          • {benefit} provide a hierarchical structure for organizing and managing items [1]
        • the root
    • {operation} move folder
    • {operation} rename folder
      • same rules applies as for folders’ creation [1]
    • {operation} delete folder
      • {restriction} currently can be deleted only empty folders [1]
        • {recommendation} make sure the folder is empty [1]
    •  {operation} create item in in folder
      • {restriction} certain items can’t be created in a folder
        • dataflows gen2
        • streaming semantic models
        • streaming dataflows
      • ⇐ items created from the home page or the Create hub, are created at the root level of the workspace [1]
    • {operation} move file(s) between folders [1]
    • {operation} publish to folder [1]
      •   Power BI reports can be published to specific folders
        • {restriction} folders' name must be unique throughout an entire workspace, regardless of their location [1]
          • when publishing a report to a workspace that has another report with the same name in a different folder, the report will publish to the location of the already existing report [1]
    • {limitation}may not be supported by certain features
      •   e.g. Git
    • {recommendation} use folders to organize workspaces [1]
    • {permissions}
      • inherit the permissions of the workspace where they're located [1] [2]
      • workspace admins, members, and contributors can create, modify, and delete folders in the workspace [1]
      • viewers can only view folder hierarchy and navigate in the workspace [1]
    • [deployment pipelines] deploying items in folders to a different stage, the folder hierarchy is automatically applied [2]

    Previous Post  <<||>>  Next Post

    References:
    [1] Microsoft Fabric (2024) Create folders in workspaces [link]
    [2] Microsoft Fabric (2024) The deployment pipelines process [link]
    [3] Microsoft Fabric Updates Blog (2025) Define security on folders within a shortcut using OneLake data access roles [link]
    [4] Microsoft Fabric Updates Blog (2025) Announcing the General Availability of Folder in Workspace [link]
    [5] Microsoft Fabric Updates Blog (2025) Announcing Folder in Workspace in Public Preview [link]
    [6] Microsoft Fabric Updates Blog (2025) Getting the size of OneLake data items or folders [link]

    21 January 2025

    🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

    Last updated: 21-Jan-2025

    [Data Warehousing] Extract, Transform, Load (ETL)  

    • {def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)
      • includes 
        • the transportation of data
        • overlaps between stages
        • changes in flow 
          • due to 
            • new technologies
            • changing requirements
        • changes in scope
        • troubleshooting
          • due to data mismatches
    • {step} extraction
      • data is extracted directly from the source systems or intermediate repositories
        • data may be made available in intermediate repositories, when the direct access to the source system is not possible
          •  this approach can add a complexity layer
      •  {substep} data validation
        • an automated process that validates whether data pulled from sources has the expected values
        • relies on a validation engine
          • rejects data if it falls outside the validation rules
          • analyzes rejected records on an ongoing basis to
            • identifies what went wrong
            • corrects the source data
            • modifies extraction to resolve the problem in the next batches
    • {step} transform
      • transforms the data, removing extraneous or erroneous data
      • applies business rules 
      • checks data integrity
        • ensures that the data is not corrupted in the source or corrupted by ETL
        • may ensure no data was dropped in previous stages
      • aggregates the data if necessary
    • {step} load
      • {substep} store the data into a staging layer
        • transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
        • {advantage} makes it easier to roll back, if something went wrong
        • {advantage} allows to develop the logic iteratively and publish the data only when needed 
        • {advantage} can be used to generate audit reports for 
          • regulatory compliance 
          • diagnose and repair of data problems
        • modern ETL process perform transformations in place, instead of in staging areas
      • {substep} publish the data to the target
        • loads the data into the target table(s)
        • {scenario} the existing data are overridden every time the ETL pipeline loads a new batch
          • this might happen daily, weekly, or monthly
        • {scenario} add new data without overriding
          • the timestamp can indicate the data is new
        • {recommendation} prevent the loading process to error out due to disk space and performance limitations
    • {approach} building an ETL infrastructure
      • involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly
        • recurring process
          • e.g. data used for reporting
        • one-time process
          • e.g. data migration
      • may involve 
        • multiple source or destination systems 
        • different types of data
          • e.g. reference, master and transactional data
          • ⇐ may have complex dependencies
        • different level of structuredness
          • e.g. structured, semistructured, nonstructured 
        • different data formats
        • data of different quality
        • different ownership 
    • {recommendation} consider ETL best practices
      • {best practice} define requirements upfront in a consolidated and consistent manner
        • allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
        • the requirements may involve all the aspects of the process
          • e.g. data extraction, data transformation, standard formatting, etc.
      • {best practice} define a high level strategy
        • allows to define the road ahead, risks and other aspects 
        • allows to provide transparency
        • this may be part of a broader strategy that can be referenced 
      • {best practice} align the requirements and various aspects to the existing strategies existing in the organization
        • allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
        • e.g. IT, business, Information Security, Data Management strategies
      • {best practice} define the scope upfront
        • allows to better estimate the effort and validate the outcomes
        • even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations
      • {best practice} manage the effort as a project and use a suitable Project Management methodology
        • allows to apply structured well-established PM practices 
        • it might be suited to adapt the methodology to project's particularities 
      • {best practice} convert data to standard formats to standardize data processing
        • allows to reduce the volume of issues resulted from data type mismatches
        • applies mainly to dates, numeric or other values for which can be defined standard formats
      • {best practice} clean the data in the source systems, when  cost-effective
        • allows to reduces the overall effort, especially when this is done in advance
        • this should be based ideally on the scope
      • {best practice} define and enforce data ownership
        • allows to enforce clear responsibilities across the various processes
        • allows to reduce the overall effort
      • {best practice} document data dependencies
        • document the dependencies existing in the data at the various levels
      • {best practice} protocol data movement from source(s) to destination(s) in term of data volume
        • allows to provide transparence into the data movement process
        • allows to identify gaps in the data or broader issues
        • can be used for troubleshooting and understanding the overall data growth
    • {recommendation} consider proven systems, architectures and methodologies
      • allows to minimize the overall effort and costs associated with the process

    20 January 2025

    🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 20-Jan-2025

    [Azure] Service Principal (SPN)  

    • {def} a non-human, application-based security identity used by applications or automation tools to access specific Azure resources [1]
      • can be assigned precise permissions, making them perfect for automated processes or background services
        • allows to minimize the risks of human error and identity-based vulnerabilities
        • supported in datasets, Gen1/Gen2 dataflows, datamarts [2]
        • authentication type 
          • supported only by [2]
            • Azure Data Lake Storage
            • Azure Data Lake Storage Gen2
            • Azure Blob Storage
            • Azure Synapse Analytics
            • Azure SQL Database
            • Dataverse
            • SharePoint online
          • doesn’t support
            • SQL data source with Direct Query in datasets [2]
    • when registering a new application in Microsoft Entra ID, a SPN is automatically created for the app registration [4]
      • the access to resources is restricted by the roles assigned to the SPN
        • ⇒ gives control over which resources can be accessed and at which level [4]
      • {recommendation} use SPN with automated tools [4]
        • rather than allowing them to sign in with a user identity  [4]
      • {prerequisite} an active Microsoft Entra user account with sufficient permissions to 
        • register an application with the tenant [4]
        • assign to the application a role in the Azure subscription [4]
        •  requires Application.ReadWrite.All permission [4]
    • extended to support Fabric Data Warehouses [1]
      • {benefit} automation-friendly API Access
        • allows to create, update, read, and delete Warehouse items via Fabric REST APIs using service principals [1]
        • enables to automate repetitive tasks without relying on user credentials [1]
          • e.g. provisioning or managing warehouses
          • increases security by limiting human error
        • the warehouses thus created, will be displayed in the Workspace list view in Fabric UI, with the Owner name of the SPN [1]
        • applicable to users with administrator, member, or contributor workspace role [3]
        • minimizes risk
          • the warehouses created with delegated account or fixed identity (owner’s identity) will stop working when the owner leaves the organization [1]
            • Fabric requires the user to login every 30 days to ensure a valid token is provided for security reasons [1]
      • {benefit} seamless integration with Client Tools: 
        • tools like SSMS can connect to the Fabric DWH using SPN [1]
        • SPN provides secure access for developers to 
          • run COPY INTO
            • with and without firewall enabled storage [1]
          • run any T-SQL query programmatically on a schedule with ADF pipelines [1]
      • {benefit} granular access control
        • Warehouses can be shared with an SPN through the Fabric portal [1]
          • once shared, administrators can use T-SQL commands to assign specific permissions to SPN [1]
            • allows to control precisely which data and operations an SPN has access to  [1]
              • GRANT SELECT ON <table name> TO <Service principal name>  
        • warehouses' ownership can be changed from an SPN to user, and vice-versa [3]
      • {benefit} improved DevOps and CI/CD Integration
        • SPN can be used to automate the deployment and management of DWH resources [1]
          •  ensures faster, more reliable deployment processes while maintaining strong security postures [1]
      • {limitation} default semantic models are not supported for SPN created warehouses [3]
        • ⇒ features such as listing tables in dataset view, creating report from the default dataset don’t work [3]
      • {limitation} SPN for SQL analytics endpoints is not currently supported
      • {limitation} SPNs are currently not supported for COPY INTO error files [3]
        • ⇐ Entra ID credentials are not supported as well [3]
      • {limitation} SPNs are not supported for GIT APIs. SPN support exists only for Deployment pipeline APIs [3]
      • monitoring tools
        • [DMV] sys.dm_exec_sessions.login_name column [3] 
        • [Query Insights] queryinsights.exec_requests_history.login_name [3]
        • Query activity
          • submitter column in Fabric query activity [3]
        • Capacity metrics app: 
          • compute usage for warehouse operations performed by SPN appears as the Client ID under the User column in Background operations drill through table [3]

    References:
    [1] Microsoft Fabric Updates Blog (2024) Service principal support for Fabric Data Warehouse [link]
    [2] Microsoft Fabric Learn (2024) Service principal support in Data Factory [link]
    [3] Microsoft Fabric Learn (2024) Service principal in Fabric Data Warehouse [link
    [4] Microsoft Fabric Learn (2024) Register a Microsoft Entra app and create a service principal [link]
    [5] Microsoft Fabric Updates Blog (2024) Announcing Service Principal support for Fabric APIs [link
     
    Acronyms:
    ADF - Azure Data Factory
    API - Application Programming Interface
    CI/CD - Continuous Integration/Continuous Deployment
    DMV - Dynamic Management View
    DWH - Data Warehouse
    SPN - service principal
    SSMS - SQL Server Management Studio

    06 January 2025

    💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VII: Things That Don't Work) [new feature]

    Microsoft does relatively a good job in documenting what doesn't work in Microsoft Fabric's SQL Databases. There's a good overview available already in the documentation, though beyond this the current post lists my finding while testing the previously written code on this blog,

    USE Database

    The standard syntax allows to change via USE the database context to the specified database or database snapshot. Unfortunately, this syntax doesn't seem to be supported currently and unfortunately many scripts seem to abuse of it. Thus, the following line of code throws an error:

    -- changing the context
    USE master;
    GO
    USE tempdb;
    

    "Msg 40508, Level 16, State 1, Line 1, USE statement is not supported to switch between databases. Use a new connection to connect to a different database"

    However, one can use the 3-part naming convention to reference the various objects:

    -- sys metadata - retrieving the database files
    
    SELECT *
    FROM tempdb.sys.database_files dbf
    ORDER BY name;

    Even if the tempdb is not listed in the sys.databases table, it's still available for querying, which can prove helpful for troubleshooting. 

    DBCC commands 

    The documentation warns that some DBCC commands won't work, though in some cases there are also alternatives. For example:

    -- clearing the procedure cache via DBCC
    DBCC FREEPROCCACHE;
    
    Output:
    "Msg 2571, Level 14, State 9, Line 1, User '<user>' does not have permission to run DBCC freeproccache."

    Alternatively, one can use the following command, which seems to work:

    -- clearing the procedure cash via ALTER
    ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;

    CHECKDB, which checks the logical and physical integrity of all the objects in the specified database, can't be used as well:
     
    -- Checking the logical and physical integrity of a database
    DBCC CHECKDB();
    
    Output:
    "Msg 916, Level 14, State 2, Line 1, The server principal "..." is not able to access the database "..." under the current security context."
    The same error message is received for CHECKTABLE, utility which checks the integrity of all the pages and structures that make up the table (or indexed view):

    -- checking a table's integrity
    DBCC CHECKTABLE ('SalesLT.Address');
    
    Output:
    "Msg 916, Level 14, State 2, Line 2, The server principal "..." is not able to access the database "..." under the current security context."
    A similar error messages is received for SQLPERF, which provides transaction log space usage statistics for all databases:

    -- retrieving the LOGSPACE information for all databases
    DBCC SQLPERF (LOGSPACE);
    
    Output: 
    "Msg 297, Level 16, State 10, Line 1, The user does not have permission to perform this action."
    There are however DBCC commands like SHOW_STATISTICS or SHRINKDATABASE which do work. 
     
    -- current query optimization statistics
    DBCC SHOW_STATISTICS('SalesLT.Address','PK_Address_AddressID');
    
    Output:
    Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows Persisted Sample Percent
    PK_Address_AddressID Dec 21 2024 3:02AM 450 450 197 1 4 NO 450 0

    SHRINKDATABASE shrinks the size of the data and log files in the specified database:

    -- shrinking database
    DBCC SHRINKDATABASE([AdventureWorks01-...]) WITH NO_INFOMSGS;
    

    Update 29-Jan-2025: According to an answer from Ask the Expert session on Fabric Database [3], Microsoft seems to be working in bringing more DBCC features to SQL databases.

    Happy coding!

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2024) SQL Server: USE <database> [link]
    [2] Microsoft Learn (2024) Database console commands [link]
    [3] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.