Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

17 March 2025

🏭🗒️Microsoft Fabric: Z-Order [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 17-Mar-2024

[Microsoft Fabric] Z-Order
  • {def} technique to collocate related information in the same set of files [2]
    • ⇐ reorganizes the layout of each data file so that similar column values are strategically collocated near one another for maximum efficiency [1]
    • {benefit} efficient query performance
      • reduces the amount of data to read [2] for certain queries
        • when the data is appropriately ordered, more files can be skipped [3]
        • particularly important for the ordering of multiple columns [3]
    • {benefit} data skipping
      • automatically skips irrelevant data, further enhancing query speeds
        • via data-skipping algorithms [2]
    • {benefit} flexibility
      • can be applied to multiple columns, making it versatile for various data schemas
    • aims to produce evenly-balanced data files with respect to the number of tuples
      • ⇐ but not necessarily data size on disk [2]
        • ⇐ the two measures are most often correlated [2]
          • ⇐ but there can be situations when that is not the case, leading to skew in optimize task times [2]
    • via ZORDER BY clause 
      • applicable to columns with high cardinality commonly used in query predicates [2]
      • multiple columns can be specified as a comma-separated list
        • {warning} the effectiveness of the locality drops with each extra column [2]
          • has tradeoffs
            • it’s important to analyze query patterns and select the right columns when Z Ordering data [3]
        • {warning} using columns that do not have statistics collected on them is  ineffective and wastes resources [2] 
          • statistics collection can be configured on certain columns by reordering columns in the schema, or by increasing the number of columns to collect statistics on [2]
      • {characteristic} not idempotent
        • every time is executed, it will try to create a new clustering of data in all files in a partition [2]
          • it includes new and existing files that were part of previous z-ordering [2]
      • compatible with v-order
    • {concept} [Databricks] liquid clustering 
      • replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance [4] [6]
        • not compatible with the respective features [4] [6]
      • tables created with liquid clustering enabled have numerous Delta table features enabled at creation [4] [6]
      • provides flexibility to redefine clustering keys without rewriting existing data [4] [6]
        • ⇒ allows data layout to evolve alongside analytic needs over time [4] [6]
      • applies to 
        • streaming tables 
        • materialized views
      • {scenario} tables often filtered by high cardinality columns [4] [6]
      • {scenario} tables with significant skew in data distribution [4] [6]
      • {scenario} tables that grow quickly and require maintenance and tuning effort [4] [6]
      • {scenario} tables with concurrent write requirements [4] [6]
      • {scenario} tables with access patterns that change over time [4] [6]
      • {scenario} tables where a typical partition key could leave the table with too many or too few partitions [4] [6]

    References:
    [1] Bennie Haelen & Dan Davis (2024) Delta Lake Up & Running: Modern Data Lakehouse Architectures with Delta Lake
    [2] Delta Lake (2023) Optimizations [link]
    [3] Delta Lake (2023) Delta Lake Z Order, by Matthew Powers [link]
    [4] Delta Lake (2025) Use liquid clustering for Delta tables [link]
    [5] Databricks (2025) Delta Lake table format interoperability [link]
    [6] Microsoft Learn (2025) Use liquid clustering for Delta tables [link]

    Resources:
    [R1] Azure Guru (2024) Z Order in Delta Lake - Part 1 [link]
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    MF - Microsoft Fabric 

    🏭🗒️Microsoft Fabric: V-Order [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 17-Mar-2024

    [Microsoft Fabric] V-Order
    • {def} write time optimization to the parquet file format that enables fast reads under the MF compute engine [2]
      • all parquet engines can read the files as regular parquet files [2]
      • results in a smaller and therefore faster files to read [5]
        • {benefit} improves read performance 
        • {benefit} decreases storage requirements
        • {benefit} optimizes resources' usage
          • reduces the compute resources required for reading data
            • e.g. network bandwidth, disk I/O, CPU usage
      • still conforms to the open-source Parquet file format [5]
        • they can be read by non-Fabric tools [5]
      • delta tables created and loaded by Fabric items automatically apply V-Order
        • e.g. data pipelines, dataflows, notebooks [5]
      • delta tables and its features are orthogonal to V-Order [2]
        •  e.g. Z-Order, compaction, vacuum, time travel
        • table properties and optimization commands can be used to control the v-order of the partitions [2]
      • compatible with Z-Order [2]
      • not all files have this optimization applied [5]
        • e.g. Parquet files uploaded to a Fabric lakehouse, or that are referenced by a shortcut 
        • the files can still be read, the read performance likely won't be as fast as an equivalent Parquet file that's had V-Order applied [5]
      • required by certain features
        • [hash encoding] to assign a numeric identifier to each unique value contained in the column [5]
      • {command} OPTIMIZE 
        • optimizes a Delta table to coalesce smaller files into larger ones [5]
        • can apply V-Order to compact and rewrite the Parquet files [5]
    • [warehouse] 
      • works by applying certain operations on Parquet files
        • special sorting
        • row group distribution
        • dictionary encoding
        • compression 
      • enabled by default
      •  ⇒ compute engines require less network, disk, and CPU resources to read data from storage [1]
        • provides cost efficiency and performance [1]
          • the effect of V-Order on performance can vary depending on tables' schemas, data volumes, query, and ingestion patterns [1]
        • fully-compliant to the open-source parquet format [1]
          • ⇐ all parquet engines can read it as regular parquet files [1]
      • required by certain features
        • [Direct Lake mode] depends on V-Order
      • {operation} disable V-Order
        • causes any new Parquet files produced by the warehouse engine to be created without V-Order optimization [3]
        • irreversible operation
          •  once disabled, it cannot be enabled again [3]
        • {scenario} write-intensive warehouses
          • warehouses dedicated to staging data as part of a data ingestion process [1]
        • {warning} consider the effect of V-Order on performance before deciding to disable it [1]
          • {recommendation} test how V-Order affects the performance of data ingestion and queries before deciding to disable it [1]
        • via ALTER DATABASE CURRENT SET VORDER = OFF; [3]
      • {operation} check current status
        • via  SELECT name, is_vorder_enabled FROM sys.databases; [post]
    • {feature} [lakehouse] Load to Table
      • allows to load a single file or a folder of files to a table [6]
      • tables are always loaded using the Delta Lake table format with V-Order optimization enabled [6]
    • [Direct Lake semantic model] 
      • data is prepared for fast loading into memory [5]
        • makes less demands on capacity resources [5]
        • results in faster query performance [5]
          • because less memory needs to be scanned [5]

    References:
    [1] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
    [2] Microsoft Learn (2024) Delta Lake table optimization and V-Order [link]
    [3] Microsoft Learn (2024) Disable V-Order on Warehouse in Microsoft Fabric [link]
    [4] Miles Cole (2024) To V-Order or Not: Making the Case for Selective Use of V-Order in Fabric Spark [link]
    [5] Microsoft Learn (2024) Understand storage for Direct Lake semantic models [link]
    [6] Microsoft Learn (2025] Fabric: Load to Delta Lake table [link]

    Resources:
    [R1] Serverless.SQL (2024) Performance Analysis of V-Ordering in Fabric Warehouse: On or Off?, by Andy Cutler [link]
    [R2] Redgate (2023 Microsoft Fabric: Checking and Fixing Tables V-Order Optimization, by Dennes Torres [link]
    [R3] Sandeep Pawar (2023) Checking If Delta Table in Fabric is V-order Optimized [link]
    [R4] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    MF - Microsoft Fabric

    15 March 2025

    💫🗒️ERP Systems: Microsoft Dynamics 365's Business Performance Analytics (BPA) [notes]

    Disclaimer: This is work in progress intended to consolidate information from the various sources and not to provide a complete overview of all the features. Please refer to the documentation for a complete overview!

    Last updated: 15-Mar-2025

    [Dynamics 365] Business Performance Analytics (BPA)

    • {def} centralized reporting hub within D365 F&O designed to streamline insights and help organizations make faster, data driven decisions [3]
      • solution designed to transform organization's data into actionable insights [1]
      • provides an intuitive out-of-box data model along with familiar tools like Microsoft Excel and Power BI for self-service analytics [4]
        • data extracted from D365 is classified in BPA in the form of value chains
          • ⇐ a group of business processes on top of the value chain [4]
    • {benefit} allows to simplify data insights by providing a unified view of business data across entities in near real time [4]
    • {benefit} allows to streamline financial and operations reporting to reduce the cycle times [4]
    • {benefit} allows users of all technical abilities to quickly access and analyze data to facilitate data driven decisions [4]
    • {benefit} provides auditors with direct access to financial data, making the audit process more efficient
    • {benefit} enables ease of use through familiar apps like Excel and Power BI, in addition to AI driven insights and automation in this platform that can be scalable and extendable [4]
    • {feature} extends into Microsoft Fabric
      • {benefit} provide a scalable, secure environment for handling large data sets and ensuring insights are always powered by the latest technology [3]
    • {feature} ETL process 
      • involves extracting data from finance and operations database, transforming and loading it into Dataverse [4]
        • each of the entities required for the generation of the dimensional model for the value chains that were mentioned earlier, they are backed by the underlying tables in finance and operations database [4]
      • installed in Dataverse, virtual  entities that are created will then pull in the data into the managed data lake [4]
      • the data is then transformed to generate the dimensional  model which is then pushed into the embedded Power BI workspace in the form of analytical tables [4]
      • BPA consumes this data from Power BI workspace to render the power BI reports [4]
      • this data can also be extended to Fabric if there is a need to consolidate data from multiple sources [4]
    • {feature} reports 
      • designed to provide a detailed overview of an organization's financial health [8]
      • further reports will be added to expand the coverage for the value chains [8]
      • out-of-box reports can't be modified
        • ⇐ users cannot rename, delete or edit these type of reports [8]
        • there’s the option to duplicate the base report and edit the version thus created [8]
      • can be shared with other users who have access to BPA 
        • ⇐ they can receive an in-app notification [8]
        • can be shared over email with another user by entering user’s email address [8] 
        • one can configure whether the recipient can edit or view the report [8]
      •   {feature} allows to create a new Power BI or Excel report from scratch [8]
        • {option} start with a blank report or duplicate an existing report [8]
    • {feature} data refresh
      • automatic data refreshes run currently two times a day [4]
        • at 12:00 AM and 12:00 PM UTC
        • the volume of data is also constrained by the storage capacity of the A3 SKU for Power BI Embedded [1]
          • future release, may support additional data reporting capacity [1]
            • ⇐ so that larger data sets can be reported and analyzed [1]
        • the target is to have refreshes every hour or less [3]
      • data volume will be initially for about eight quarters of data [4]
      • extensibility will be supported with bring your own Fabric [4]
    • architecture
      • SaaS solution
        • {capability} immediate deployment 
          • businesses can start to analyze data and generate insights with minimal setup [1]
        • {capability} comprehensive reporting and dashboards
          • provides access to a wide range of preconfigured reports that cover multiple business functions [1]
        • {capability} near-real-time analytics 
          • future releases will offer more frequent data refreshes to enable near-real-time data analysis and reporting
        • {capability} predictive insights 
          • future releases will introduce predictive analytics capabilities that enable businesses to 
            • forecast trends
            • identify risks
            • seize opportunities [1]
        • {capability} user-friendly interface 
          • intuitive design ⇒ minimal training
            • fosters broader adoption 
            • enables a data-driven culture across the organization [1]
        • {capability} cost-effectiveness
          • available as part of D365 license
            • ⇒ provides advanced analytics without requiring significant investments in IT infrastructure [1]
      • DaaS solution
        • {capability} organizations can integrate its data models with their existing data warehousing infrastructure in Microsoft Fabric [1]
          • maximizes the value of existing data solutions [1]
          • positions businesses for future enhancements [1]
        • {capability} unified and scalable data models
          • customers can build custom models on top of a unified framework
            • ensures consistency and scalability across data sets [1]
        • {capability} future-proofing with automatic upgrades
          • data models integrate seamlessly with future D365 updates
            • reduces manual maintenance and ensures access to the latest features [1]
        • {capability} consistency and standardization
          • data models provide consistency and standardization across data sources
            • ensure high data quality and integrity [1]
        • {capability} advanced analytics and AI 
          • by customizing the data models, organizations can take advantage of advanced analytics and AI capabilities [1]
            • deeper insights without having to develop them from scratch [1]
        • {capability} enhanced data governance
          • unified data models support better data governance by providing standardized data definitions, relationships, and hierarchies [1]
            • ensure consistency and quality across the organization [1]
      • requires an integrated Power Platform environment [5]
        • must be integrated with the Microsoft Entra tenant [5]
      • uses shared Dataverse entitlements [1]
        • includes access to the data lake [1]
    • setup
      • dimensions
        • the selection of dimensions might affect the dimension groups that are created using these dimensions and the users who are assigned there [7]
          • e.g. legal entity, business unit
      • dimension groups
        • users can select specific values for the legal entity, or add a range of values [7]
          • selecting an invalid combination of dimension values, the dimension group will filter out all the records on the report [7]
        • {warning} assigning too many dimension groups to a user, slows the load for that user [7]
      • roles
        • determine which reports the user can access [7]
    • security
      • secure data through role-based access control on top of the value chains [7]
      • the first user who signs into the app is assigned the BPA admin role [7]
        • allows a user to access the administrator section of the BPA [7]
          • where the security can be set up [7]
        • has automatically assigned 
          • Microsoft report viewer role 
          • the All Access Dimension group [7]
            • allow the admin to see the data  in all the reports across all the dimensions [7]
      • {feature} dimension-based role-level security
        • ensures that users only see the data relevant to them based on their role
          •  confidently share reports without duplicating them
            • ⇐ data is automatically filtered by organization's security policies [3]
        • simple but powerful way to maintain control while providing access for teams that love working in Excel [3]
    • accessibility
      • can be accessed through either 
        • Power Platform
          • admins can access BPA app through PowerApps' makeup portal [6]
        • Dynamics 365
          • through the BPA preview shortcut in the homepage or the default dashboard [6]
          • for end users, the BPA preview shortcut is provided when they have certain duties associated to their role(s) [6]
    • licensing
      • included in D365 F&O license [4]
    • requirements
      • requires a tier two environment and Dynamics 365 finance version 1.0.38 or later [5]
    • {project} timeline
      • [2025 wave 1] backup and restore custom reports and analytics
        • {benefit} support better lifecycle management and empower customers to develop on sandbox instances before publishing to production [3]
      • 2025: available in all regions where F&O is available [3]
      • Oct-2024: GA

    References:
    [1] Microsoft Learn (2024) Dynamics 365 Finance: What is Business performance analytics? [link]
    [2] Microsoft Learn (2025) Business performance analytics (BPA) with Dynamics 365 Finance [link]
    [3] Dynamics 365 Finance - Business Performance Analytics 2025 Release Wave 1 Release Highlights [link]
    [4] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 1 [link]
    [5] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 2 [link]
    [6] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 3 [link]
    [7] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 4 [link]   
    [8] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 5 [link]
    [9] Microsoft Learn (2024) Dynamics 365: Business performance analytics introduction [link

    Acronyms:
    AI - Artificial Intelligence
    BPA - Business Performance Analytics
    D365 F&O - Dynamics 365 for Finance and Operations
    DaaS - Data-as-a-Service
    ETL - Extract, Transfer, Load
    GA - General Availability
    MF - Microsoft Fabric
    PP - Public Preview
    SaaS - Software-as-a-Service
    SKU - Stock Keeping Unit
    UTC - Coordinated Universal Time

    14 February 2025

    🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

     Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 14-Feb-2024

    [Microsoft Fabric] Partitions

    • {def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards
      • a pattition is defined based on one or more fields
      • each partition contains a subset of the data
      • each partitions can be stored and processed independently
    • {goal} improve performance, scalability, and manageability of large data tables
    • {benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]
      • e.g., date ranges, regions, categories, entities
    • {benefit} allows to improve queries' performance as they can target specific partitions [2]
      • reduces the amount of data scanned [2]
      • improves queries' performance [2]
    • {benefit} allows for more efficient data loading [2]
    • {benefit} facilitates the management of big tables [2]
      • maintenance tasks can be performed on individual partitions  [2]
      • obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]
    • applies to 
      • backups
      • indexing
      • allows optimizing query performance for specific subsets of data
      • statistics
    • performance can be affected by
      • the choice of partition columns for a delta table [1]
      • the number and size of partitions of the partition column [1]
      • a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]
        • ⇐ negatively impacts performance of the metadata discovery scan for changes [1]
        • {recommendation} if the cardinality of a column is high, choose another column for partitioning [1]
      • the size of each partition can also affect performance
        • {recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
        • {recommendation} follow the best practices for delta tables maintenance [1]
        • a large volume of small-sized parquet files increases the time it takes to sync the changes [1]
          • ⇒ leads to large number of parquet files in a delta table due to [1]
            • over-partitioning
              • partitions with high number of unique values [1]
              • {recommendation} choose a partition column that 
                • doesn't have a high cardinality [1]
                • results in individual partition size of at least 1 GB [1]
            • many small files
              • batch and streaming data ingestion rates might result in small files
                • depends on the frequency and size of changes being written to a lakehouse [1]
              • {recommendation} implement regular lakehouse table maintenance [1] 


    References:
    [1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
    [2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance
    written [link]
    [3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]
    [4] 

    Resources
    [R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    13 February 2025

    🧊💠🗒️Data Warehousing: Table Partitioning in SQL Server [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

    Last updated: 13-Feb-2025

    [Data Warehousing] Table Partitioning

    • {defthe spreading of data across multiple tables based on a set of rules to balance large amounts of data across disks or nodes
      • data is distributed based on a function that defines a range of values for each partition [2] 
        • the table is partitioned by applying the partition scheme to the values in a specified column [2]
    • {operation} partition creation
      • [large partitioned table]
        • should be created two auxiliary nonindexed empty tables with the same structure, including constraints and data compression options [4]
          • first table:  create a check constraint that guarantees that all data from the table fits exactly with one empty partition of the fact table
            • the constraint must be created on the partitioning column [4]
            • a columnstore index can be created on the fact table, as long as it is aligned with the table [4]
            • after truncation of <table 2> the <table 1> is prepared to accept the next partition from your fact table for the next minimally logged deletion [4]
          • second table: for minimally logged deletions of large portions of data, a partition from the fact table can be switched to the empty table version without the check constraint [4]
            • then the table can be truncated
        • for minimally logged inserts, new data to the second auxiliary table should be bulk inserted in the auxiliary that has the check constraint [4]
          • INSERT operation can be minimally logged because the table is empty [4]
          • create a columnstore index on this auxiliary table, using the same structure as the columnstore index on your fact table [4]
          • switch data from this auxiliary table to a partition of your fact table [4]
          • drop the columnstore index on the auxiliary table, and change the check constraint to guarantee that all of the data for the next load can be switched to the next empty partition of the fact table [4]
          • the second auxiliary table is prepared for new bulk loads again [4]
    • {operation} [Query Optimizer] partition elimination 
      • process in which SQL Server accesses only those partitions needed to satisfy query filters [4]
    • {operation} partition switching 
      • {definition} process that switches a block of data from one table or partition to another table or partition [4]
      • types of switches
        • reassign all data from a nonpartitioned table to an empty existing partition of a partitioned table [4]
        • switch a partition of one partitioned table to a partition of another partitioned table [4]
        • reassign all data from a partition of a partitioned table to an existing empty nonpartitioned table [4]
    • {benefit} improves query performance [1]
      • by partitioning a table across filegroups [1]
        • specific ranges of data can be placed on different disk spindles [1]
          • can improve I/O performance [1]
            • ⇐ the disk storage is already configured as a RAID 10 or RAID 5 array [1]
              • ⇒ this usually has little benefit [1]
        • using a mix of fast solid state storage for recent, frequently accessed data, and mechanical disks for older, less queried rows [1]
          • use partitioning to balance disk performance against storage costs [1]
        • biggest performance gain from partitioning in a data warehouse is realized when queries return a range of rows that are filtered on the partitioning key [1]
          • the query optimizer can eliminate partitions that are not within the filter range [1]
            • dramatically reduce the number of rows that need to be read [1]
      • reduces contention [3]
        • can reduce the number of rows included in a table scan [3]
    • {benefit} more granular manageability [1]
      • some maintenance operations can be performed at partition level instead of on the whole table [1]
        • e.g. indexes can be created and rebuilt on a per-partition basis [1]
        • e.g. compression can be applied to individual partitions [1]
        • e.g. by mapping partitions to filegroups, partitions can be backed up and restored independently [1]
          • enables to back up older data once and then configure the backed up partitions as read-only [1]
          • future backups can be limited to the partitions that contain new or updated data [1]
    • {benefit} improved data load performance
      • enables loading many rows very quickly by switching a staging table with a partition
        • can dramatically reduce the time taken by ETL data loads [1]
          • with the right planning, it can be achieved with minimal requirements to drop or rebuild indexes [1]
    • {best practice} partition large fact tables
      • tables of around 50 GB or more
      • ⇐ in general, fact tables benefit from partitioning more than dimension tables [1]
    • {best practice} partition on an incrementing date key [1]
      • assures that the most recent data are in the last partition and the earliest data are in the first partition [1]
    • {best practice} design the partition scheme for ETL and manageability [1]
      • the query performance gains realized by partitioning are small compared to the manageability and data load performance benefits [1]
        • ideally partitions should reflect the ETL load frequency
          • because this simplifies the load process [1]
          • merge partitions periodically to reduce the overall number of partitions (for example, at the start of each year [1]
        • could merge the monthly partitions for the previous year into a single partition for the whole year [1]
    • {best practice} maintain an empty partition at the start and end of the table [1]
      • simplifies the loading of new rows [1]
      • when new records need to be added, split the empty partition 
        • ⇐ to create two empty partitions)
      • then switch the staged data with the first empty partition [1]
        • ⇐ loads the data into the table and leaves the second empty partition you created at the end of the table, ready for the next load [1]
      • a similar technique can be used to archive or delete obsolete data at the beginning of the table [1]
    • {best practice} chose the proper granularity
      • it should be aligned to the business requirements [2]
    • {best practice} create at least one filegroup in addition to the primary one
      • set it as the default filegroup
        • data tables are thus separated from system tables [2]
      • creating dedicated filegroups for extremely large fact tables [2]
        • place the fact tables on their own logical disks [2]
      • use a file and a filegroup separate from the fact and dimension tables [2]
        • {exception} staging tables that will be switched with partitions to perform fast loads [2]
          • staging tables must be created on the same filegroup as the partition with which they will be switched [2]
    • {def} partition scheme 
      • a scheme that maps partitions to filegroups
    • {def} partition function 
      • object that maps rows to partitions by using values from specific columns (aka  partitioning columns)
      • performs logical mapping
    • {def} aligned index 
      • index built on the same partition scheme as its base table [4]
        • if all indexes are aligned with their base table, switching a partition is a metadata operation only [4]
          • ⇒ it’s very fast [4]
    Previous Post <<||>> Next Post

    References:
    [1] 20467A - Designing Business Intelligence Solutions with Microsoft SQL Server 2012
    [2] 20463C - Implementing a Data Warehouse with Microsoft SQL Server
    [3] 10777A - Implementing a Data Warehouse with Microsoft SQL Server 2012
    [4] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
    [5] Microsoft Learn (2009) How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005 [link]

    🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 12-Feb-2024

    [Microsoft Fabric] SQL Analytics Endpoint

    • {def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
      • enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
      • created by default for each lakehouses and mirrored databases provisioned in a workspace
        • every lakehouse has one SQL analytics endpoint [1]
      • manages the automatically generated tables so the workspace users can't modify them [1]
    • {feature} a background process is responsible for 
      • scanning lakehouse for changes [1]
        • automatically generates a table in the appropriate schema for every Delta table in the Lakehouse
          • tables are created with a minor delay [1]
          • data is refreshed automatically
            • the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]
          • {alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]
      • keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]
        • {characteristic} transparently managed
          • ⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
          • when a change is detected in the lakehouse
            • a background process updates metadata [1]
            • the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]
        • {characteristic} low latency
          • under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
          • the actual length of time can vary from a few seconds to minutes [1]
        • the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]
          • ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]
            •  ⇐ provides a seamless experience [3]
    • {feature} can be enriched by adding database objects
      • schemas
      • views
      • procedures
      • other database objects
    • {feature} automatic metadata discovery
      • tracks changes committed to lakehouses [1]
        • is a single instance per Fabric workspace [1]
        • {issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]
          • {resolution} migrate each lakehouse to a separate workspace [1]
            • allows automatic metadata discovery to scale [1]
        • {issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint
          • create a new table in the  [1]
          • many committed rows may not be visible in the endpoint [1]
          • ⇐ may be impacted by ETL processing that generate large volumes of changes
          • {recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]
            • forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]
        • {issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]
          • {recommendation}schedule regular lakehouse table maintenance operations [1]
        • {limitation} not all Delta features are understood by the automatic sync process [1]
    • {feature} endpoint reprovisioning
      • if the initial provisioning attempt fails, users have the option to try again [2]
        • ⇐ without the need to create an entirely new lakehouse [2]
        • {benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

    References:
    [1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
    [2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
    [3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    06 January 2025

    💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VII: Things That Don't Work) [new feature]

    Microsoft does relatively a good job in documenting what doesn't work in Microsoft Fabric's SQL Databases. There's a good overview available already in the documentation, though beyond this the current post lists my finding while testing the previously written code on this blog,

    USE Database

    The standard syntax allows to change via USE the database context to the specified database or database snapshot. Unfortunately, this syntax doesn't seem to be supported currently and unfortunately many scripts seem to abuse of it. Thus, the following line of code throws an error:

    -- changing the context
    USE master;
    GO
    USE tempdb;
    

    "Msg 40508, Level 16, State 1, Line 1, USE statement is not supported to switch between databases. Use a new connection to connect to a different database"

    However, one can use the 3-part naming convention to reference the various objects:

    -- sys metadata - retrieving the database files
    
    SELECT *
    FROM tempdb.sys.database_files dbf
    ORDER BY name;

    Even if the tempdb is not listed in the sys.databases table, it's still available for querying, which can prove helpful for troubleshooting. 

    DBCC commands 

    The documentation warns that some DBCC commands won't work, though in some cases there are also alternatives. For example:

    -- clearing the procedure cache via DBCC
    DBCC FREEPROCCACHE;
    
    Output:
    "Msg 2571, Level 14, State 9, Line 1, User '<user>' does not have permission to run DBCC freeproccache."

    Alternatively, one can use the following command, which seems to work:

    -- clearing the procedure cash via ALTER
    ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;

    CHECKDB, which checks the logical and physical integrity of all the objects in the specified database, can't be used as well:
     
    -- Checking the logical and physical integrity of a database
    DBCC CHECKDB();
    
    Output:
    "Msg 916, Level 14, State 2, Line 1, The server principal "..." is not able to access the database "..." under the current security context."
    The same error message is received for CHECKTABLE, utility which checks the integrity of all the pages and structures that make up the table (or indexed view):

    -- checking a table's integrity
    DBCC CHECKTABLE ('SalesLT.Address');
    
    Output:
    "Msg 916, Level 14, State 2, Line 2, The server principal "..." is not able to access the database "..." under the current security context."
    A similar error messages is received for SQLPERF, which provides transaction log space usage statistics for all databases:

    -- retrieving the LOGSPACE information for all databases
    DBCC SQLPERF (LOGSPACE);
    
    Output: 
    "Msg 297, Level 16, State 10, Line 1, The user does not have permission to perform this action."
    There are however DBCC commands like SHOW_STATISTICS or SHRINKDATABASE which do work. 
     
    -- current query optimization statistics
    DBCC SHOW_STATISTICS('SalesLT.Address','PK_Address_AddressID');
    
    Output:
    Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows Persisted Sample Percent
    PK_Address_AddressID Dec 21 2024 3:02AM 450 450 197 1 4 NO 450 0

    SHRINKDATABASE shrinks the size of the data and log files in the specified database:

    -- shrinking database
    DBCC SHRINKDATABASE([AdventureWorks01-...]) WITH NO_INFOMSGS;
    

    Update 29-Jan-2025: According to an answer from Ask the Expert session on Fabric Database [3], Microsoft seems to be working in bringing more DBCC features to SQL databases.

    Happy coding!

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2024) SQL Server: USE <database> [link]
    [2] Microsoft Learn (2024) Database console commands [link]
    [3] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

    💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VI: Index Usage Analysis) [new feature]

    There are several system dynamic management views (DMV) available in SQL Server, Azure SQL Server and now in SQL databases that allow to gather more information about indexes' fragmentation and usage. Let's look at the most important information available based on the indexes create in the previous posts. As the data were probably purged from the views, it's needed to run first the select queries based on the SalesLT.Product from the previous post. This step is important, otherwise the DMVs might return no records!

    One starting point is to use the sys.dm_db_index_physical_stats DMV to look at the indexes' size and fragmentation information for a given table (or view). The table is used usually as starting point for analyzing indexes' fragmentation and then defragment the indexes with high fragmentation.

    -- sys metadata - index & data size and fragmentation information for the data and indexes of the specified table or view
    SELECT --db_name() db_name
    --, object_name(IND.object_id) table_name
     IND.name index_name
    , IND.type_desc
    , IPS.page_count
    , IPS.record_count
    , IPS.index_level
    , Cast(IPS.avg_fragmentation_in_percent as decimal(10,2)) avg_fragmentation_perc
    , Cast(IPS.avg_page_space_used_in_percent as decimal(10,2)) space_used_perc
    --, IPS.*
    FROM sys.indexes IND
         CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), IND.object_id, IND.index_id, NULL, 'DETAILED') IPS
    WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
    
    Output:
    index_name type_desc page_count record_count index_level avg_fragmentation_perc space_used_perc
    PK_Product_ProductID CLUSTERED 101 295 0 0.99 87.90
    PK_Product_ProductID CLUSTERED 1 101 1 0.00 16.20
    AK_Product_rowguid NONCLUSTERED 2 295 0 50.00 74.69
    AK_Product_rowguid NONCLUSTERED 1 2 1 0.00 0.59
    AK_Product_ProductNumber NONCLUSTERED 2 295 0 50.00 85.79
    AK_Product_ProductNumber NONCLUSTERED 1 2 1 0.00 0.49
    AK_Product_Name NONCLUSTERED 3 295 0 33.33 87.32
    AK_Product_Name NONCLUSTERED 1 3 1 0.00 1.67
    IX_SalesLT_Product_Color NONCLUSTERED 1 295 0 0.00 79.24
    IX_SalesLT_Product_Color_Size NONCLUSTERED 1 295 0 0.00 94.12
    IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 4 295 0 0.00 86.60
    IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 1 4 1 0.00 1.01

    In a second step one can look at the sys.dm_db_index_usage_stats DMV which provides the counts of the different types of index operations and the time each type of operation was last performed:

    -- sys metadata - counts of different types of index operations and the time each type of operation was last performed.
    SELECT -- db_name() db_name
    --, object_name(IND.object_id) table_name
     IND.name
    , IND.type_desc
    , IUS.user_seeks 
    , IUS.user_scans
    , IUS.user_lookups 
    , IUS.user_updates
    , IUS.last_user_seek
    , IUS.last_user_scan 
    , IUS.last_user_lookup
    , IUS.last_user_update
    FROM sys.dm_db_index_usage_stats IUS
         JOIN sys.indexes IND
           ON IUS.index_id = IND.index_id
    WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
    
    Output:
    name type_desc user_seeks user_scans user_lookups user_updates last_user_seek last_user_scan last_user_lookup
    PK_Product_ProductID CLUSTERED 0 10 15 0 2025-01-06T14:23:54 2025-01-06T14:23:54
    IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 2025-01-06T14:23:54
    IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 2025-01-06T13:38:03

    Finally, it might be useful to look also at the sys.dm_db_index_operational_stats DMV which returns the current lower-level I/O, locking, latching, and access method activity for each partition of a table or index in the database (see the documentation for the full list of attrbutes):

    -- sys metadata - index operations stats
    SELECT -- db_name() db_name
    --, object_name(IND.object_id) table_name
     IND.name index_name
    , IND.type_desc
    , IOS.range_scan_count
    , IOS.singleton_lookup_count
    , IOS.leaf_insert_count
    , IOS.leaf_delete_count
    , IOS.leaf_update_count
    , IOS.nonleaf_insert_count
    , IOS.nonleaf_delete_count
    , IOS.nonleaf_update_count
    FROM sys.indexes IND
         CROSS APPLY sys.dm_db_index_operational_stats(DB_ID(), IND.object_id, IND.index_id, NULL) IOS
    WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product')
     AND IOS.range_scan_count<>0
    ORDER BY IND.name;
    
    Output:
    index_name type_desc range_scan_count singleton_lookup_count leaf_insert_count leaf_delete_count leaf_update_count nonleaf_insert_count nonleaf_delete_count nonleaf_update_count
    IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 0 0 0 0
    IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 0 0 0 0
    PK_Product_ProductID CLUSTERED 10 64 0 0 0 0 0 0

    For more information on these DMVs check the documentation.

    Happy coding!

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2024) SQL Server: sys.dm_db_index_physical_stats [link]
    [2] Microsoft Learn (2024) SQL Server: sys.dm_db_index_usage_stats [link]
    [3] Microsoft Learn (2024) SQL Server: sys.dm_db_index_operational_stats [link]

    04 January 2025

    💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part V: Manual Index Maintenance) [new feature]

    Indexes' maintenance in Microsoft Fabric's SQL databases is supposed to happen automatically in the background via automatic tuning options feature, though the whole functionality is still in its early phases, and therefore many questions regarding the whole process may arise. Probably the most important question is whether indexes can still be created, respectively maintained manually. That's useful for temporary or even periodic workloads, where maybe organizations might still want to maintain indexes manually. 

    The tests made below are based on the SalesLT.Product from AdventureWorkds database available in Microsoft Fabric. The target was to create several indexes that could be used for the various testing purposes. Each set of the below scripts was run 5-10 times until records appeared in the sys.dm_db_missing_index_details table for each test case (see further below):

    -- batch 1: filter on single column (to be run 5-10 times)
    SELECT *
    FROM SalesLT.Product 
    WHERE Color = 'Red'
    
    SELECT *
    FROM SalesLT.Product 
    WHERE Color = 'Black'
    
    SELECT *
    FROM SalesLT.Product 
    WHERE Color = 'White'
    
    -- batch 2: filter on two columns (to be run 5-10 times)
    SELECT *
    FROM SalesLT.Product 
    WHERE Color = 'Red'
      AND Size = '58'
    
    SELECT *
    FROM SalesLT.Product 
    WHERE Color = 'Black'
      AND Size = '58'
    
    SELECT *
    FROM SalesLT.Product 
    WHERE Color = 'White'
         AND Size = '58'
    
    -- batch 3: filter with column selection (to be run 5-10 times)
    SELECT ProductNumber, Name, Color, ListPrice
    FROM SalesLT.Product 
    WHERE ListPrice BETWEEN 50 AND 55
    
    SELECT ProductNumber, Name, Color, ListPrice
    FROM SalesLT.Product 
    WHERE ListPrice BETWEEN 100 and 105
    

    Once the scripts run, one can look at the records created in the above considered dynamic management view:

    -- sys metadata -  missing indexes
    SELECT MID.statement AS table_name
    , MID.equality_columns
    , MID.inequality_columns
    , MID.included_columns
    --, MIG.index_group_handle
    --, MIG.index_handle
    FROM sys.dm_db_missing_index_details MID 
        JOIN sys.dm_db_missing_index_groups MIG 
         ON MID.index_handle =  MIG.index_handle
    ORDER BY MIG.index_group_handle
    , MIG.index_handle
    
    Output:
    table_name equality_columns inequality_columns included_columns
    [AdventureWorks01-...].[SalesLT].[Product] [Color]
    [AdventureWorks01-...].[SalesLT].[Product] [Color], [Size]
    [AdventureWorks01-...].[SalesLT].[Product] [ListPrice] [Name], [ProductNumber], [Color]

    The next step is to create one of the indexes (please note that database's name must be replaced accordingly or used only the 2-part naming convention - schema & table name ):

    -- create index on Color
    CREATE INDEX IX_SalesLT_Product_Color 
    ON [AdventureWorks01-...].[SalesLT].[Product] (Color);
    

    Once the script was run, all the records related to the SalesLT.Product disappeared from the dynamic management view. Therefore, it might be a good idea to take a snapshot with view's data before creating any indexes manually. Probably the same behavior should be expected when the indexes are created by the system.

    -- create index on Color & Size
    CREATE INDEX IX_SalesLT_Product_Color_Size
    ON [SalesLT].[Product] (Color, Size);
    
    -- create index on ListPrice with included columns
    CREATE INDEX IX_SalesLT_Product_ListPrice_IC
    ON [SalesLT].[Product] (ListPrice) INCLUDE(ProductNumber, Name, Color);
    

    One can use the following query based on the meta.vIndexes (created in a previous post) to look at the indexes created:

    -- sys metadata - index columns
    SELECt IND.db_name
    , IND.schema_name
    , IND.table_name
    , IND.index_name
    , IND.index_type
    , IND.principal_type
    , IND.auto_created
    FROM meta.vIndexes IND
    WHERE IND.schema_name = 'SalesLT'
      AND IND.table_name = 'Product'
      AND IND.index_name IN ('IX_SalesLT_Product_Color ','IX_SalesLT_Product_Color_Size'
    ,'IX_SalesLT_Product_ListPrice_IC')
    ORDER BY IND.table_name
    , IND.index_name
    
    Output:
    db_name schema_name table_name index_name index_type principal_type auto_created
    AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color NONCLUSTERED S False
    AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color_Size NONCLUSTERED S False
    AdventureWorks01-... SalesLT Product IX_SalesLT_Product_ListPrice_IC NONCLUSTERED S False

    After this model can be created further indexes as needed. It's always a good idea to take a "copy" of the indexes created (or keep a history of the scripts run for indexes' maintenance). This best practice is now more important, when the system can drop indexes as it considers fit. 

    Don't forget to clean up the changes made if the indexes aren't needed anymore:

    -- cleaning after
    DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color;
    DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color_Size;
    DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_ListPrice_IC;
    

    So, after these tests, the standard syntax for index's maintenance seems to work also on SQL databases, with all the implications deriving from this (e.g. porting of scripts, database objects, etc.)

    Happy coding!

    Previous Post <<||>> Next Post

    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.