Showing posts with label warehouse. Show all posts
Showing posts with label warehouse. Show all posts

26 March 2025

💠🏭🗒️Microsoft Fabric: Polaris SQL Pool [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Unfortunately, besides the references papers, there's almost no material that could be used to enhance the understanding of the concepts presented. 

Last updated: 26-Mar-2025

Read and Write Operations in Polaris [2]

[Microsoft Fabric] Polaris SQL Pool

  • {def} distributed SQL query engine that powers Microsoft Fabric's data warehousing capabilities
    • designed to unify data warehousing and big data workloads while separating compute and state for seamless cloud-native operations
    • based on a robust DCP 
      • designed to execute read-only queries in a scalable, dynamic and fault-tolerant way [1]
      • a highly-available micro-service architecture with well-defined responsibilities [2]
        • data and query processing is packaged into units (aka tasks) 
          • can be readily moved across compute nodes and re-started at the task level
        • widely-partitioned data with a flexible distribution model [2]
        • a task-level "workflow-DAG" that is novel in spanning multiple queries [2]
        • a framework for fine-grained monitoring and flexible scheduling of tasks [2]
  • {component} SQL Server Front End (SQL-FE)
    • responsible for 
      • compilation
      • authorization
      • authentication
      • metadata
        • used by the compiler to 
          • {operation} generate the search space (aka MEMO) for incoming queries
          • {operation} bind metadata to data cells
          • leveraged to ensure the durability of the transaction manifests at commit [2]
            • only transactions that successfully commit need to be actively tracked to ensure consistency [2]
            • any manifests and data associated with aborted transactions are systematically garbage-collected from OneLake through specialized system tasks [2]
  • {component} SQL Server Backend (SQL-BE)
    • used to perform write operations on the LST [2]
      • inserting data into a LST creates a set of Parquet files that are then recorded in the transaction manifest [2]
      • a transaction is represented by a single manifest file that is modified concurrently by (one or more) SQL BEs [2]
        • SQL BE leverages the Block Blob API provided by ADLS to coordinate the concurrent writes  [2]
        • each SQL BE instance serializes the information about the actions it performed, either adding a Parquet file or removing it [2]
          • the serialized information is then uploaded as a block to the manifest file
          • uploading the block does not yet make any visible changes to the file [2]
            • each block is identified by a unique ID generated on the writing SQL BE [2]
        • after completion, each SQL BE returns the ID of the block(s) it wrote to the Polaris DCP [2]
          • the block IDs are then aggregated by the Polaris DCP and returned to the SQL FE as the result of the query [2]
      • the SQL FE further aggregates the block IDs and issues a Commit Block operation against storage with the aggregated block IDs [2]
        • at this point, the changes to the file on storage will become effective [2]
      • changes to the manifest file are not visible until the Commit operation on the SQL FE
        • the Polaris DCP can freely restart any part of the operation in case there is a failure in the node topology [2]
      • the IDs of any blocks written by previous attempts are not included in the final list of block IDs and are discarded by storage [2]
    • [read operations] SQL BE is responsible for reconstructing the table snapshot based on the set of manifest files managed in the SQL FE
      • the result is the set of Parquet data files and deletion vectors that represent the snapshot of the table [2]
        • queries over these are processed by the SQL Server query execution engine [2]
        • the reconstructed state is cached in memory and organized in such a way that the table state can be efficiently reconstructed as of any point in time [2]
          • enables the cache to be used by different operations operating on different snapshots of the table [2]
          • enables the cache to be incrementally updated as new transactions commit [2]
  • {feature} supports explicit user transactions
    • can execute multiple statements within the same transaction in a consistent way
      • the manifest file associated with the current transaction captures all the (reconciled) changes performed by the transaction [2]
        • changes performed by prior statements in the current transaction need to be visible to any subsequent statement inside the transaction (but not outside of the transaction) [2]
    • [multi-statement transactions] in addition to the committed set of manifest files, the SQL BE reads the manifest file of the current transaction and then overlays these changes on the committed manifests [1]
    • {write operations} the behavior of the SQL BE depends on the type of the operation.
      • insert operations 
        • only add new data and have no dependency on previous changes [2]
        • the SQL BE can serialize the metadata blocks holding information about the newly created data files just like before [2]
        • the SQL FE, instead of committing only the IDs of the blocks written by the current operation, will instead append them to the list of previously committed blocks
          • ⇐ effectively appends the data to the manifest file [2]
    • {update|delete operations} 
      • handled differently 
        • ⇐ since they can potentially further modify data already modified by a prior statement in the same transaction [2]
          • e.g. an update operation can be followed by another update operation touching the same rows
        • the final transaction manifest should not contain any information about the parts from the first update that were made obsolete by the second update [2]
      • SQL BE leverages the partition assignment from the Polaris DCP to perform a distributed rewrite of the transaction manifest to reconcile the actions of the current operation with the actions recorded by the previous operation [2]
        • the resulting block IDs are sent again to the SQL FE where the manifest file is committed using the (rewritten) block IDs [2]
  • {concept} Distributed Query Processor (DQP)
    • responsible for 
      • distributed query optimization
      • distributed query execution
      • query execution topology management
  • {concept} Workload Management (WLM)
    •  consists of a set of compute servers that are, simply, an abstraction of a host provided by the compute fabric, each with a dedicated set of resources (disk, CPU and memory) [2]
      • each compute server runs two micro-services
        • {service} Execution Service (ES) 
          • responsible for tracking the life span of tasks assigned to a compute container by the DQP [2]
        • {service} SQL Server instance
          • used as the back-bone for execution of the template query for a given task  [2]
            • ⇐ holds a cache on top of local SSDs 
              • in addition to in-memory caching of hot data
            • data can be transferred from one compute server to another
              • via dedicated data channels
              • the data channel is also used by the compute servers to send results to the SQL FE that returns the results to the user [2]
              • the life cycle of a query is tracked via control flow channels from the SQL FE to the DQP, and the DQP to the ES [2]
  • {concept} cell data abstraction
    • the key building block that enables to abstract data stores
      • abstracts DQP from the underlying store [1]
      • any dataset can be mapped to a collection of cells [1]
      • allows distributing query processing over data in diverse formats [1]
      • tailored for vectorized processing when the data is stored in columnar formats [1] 
      • further improves relational query performance
    • 2-dimenstional
      • distributions (data alignment)
      • partitions (data pruning)
    • each cell is self-contained with its own statistics [1]
      • used for both global and local QO [1]
      • cells can be grouped physically in storage [1]
      • queries can selectively reference either cell dimension or even individual cells depending on predicates and type of operations present in the query [1]
    • {concept} distributed query processing (DQP) framework
      • operates at the cell level 
      • agnostic to the details of the data within a cell
        • data extraction from a cell is the responsibility of the (single node) query execution engine, which is primarily SQL Server, and is extensible for new data types [1], [2]
  • {concept} dataset
    • logically abstracted as a collection of cells [1] 
    • can be arbitrarily assigned to compute nodes to achieve parallelism [1]
    • uniformly distributed across a large number of cells 
      • [scale-out processing] each dataset must be distributed across thousands of buckets or subsets of data objects,
      •  such that they can be processed in parallel across nodes
  • {concept} session
    • supports a spectrum of consumption models, ranging from serverless ad-hoc queries to long-standing pools or clusters [1]
    • all data are accessible from any session [1]
      • multiple sessions can access all underlying data concurrently  [1]
  • {concept} Physical Metadata layer
    • new layer introduced in the SQL Server storage engine [2]
See also: Polaris

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) [link]
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions [link]
[3] Gjnana P Duvvuri (2024) Microsoft Fabric Warehouse Deep Dive into Polaris Analytic Engine [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R2] Patrick Pichler (2023) Data Warehouse (Polaris) vs. Data Lakehouse (Spark) in Microsoft Fabric [link]
[R3] Tiago Balabuch (2023) Microsoft Fabric Data Warehouse - The Polaris engine [link]

Acronyms:
CPU - Central Processing Unit
DAG - Directed Acyclic Graph
DB - Database
DCP - Distributed Computation Platform 
DQP - Distributed Query Processing 
DWH - Data Warehouses 
ES - Execution Service
LST - Log-Structured Table
SQL BE - SQL Backend
SQL FE - SQL Frontend
SSD - Solid State Disk
WAL - Write-Ahead Log
WLM - Workload Management

25 March 2025

🏭🗒️Microsoft Fabric: Security in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Mar-2024

[Microsoft Fabric] Security in Warehouse
  • {def} suite of technologies aimed at safeguarding sensitive information in Fabric [1]
    • leverages SQL engine’s security features [1]
      • allows for security mechanism at the warehouse level [1]
      • ⇐ the warehouse and SQL analytics endpoint items also allow for the defining of native SQL security [4]
        • the permissions configured only apply to the queries executed against the respective surfaces [4]
      • the access to OneLake data is controlled separately through OneLake data access roles [4]
        • {recommendation} to ensure users with SQL specific permissions don't see data they don't have SQL access to, don't include those users in a OneLake data access role [4]
    • supports a range of data protection features that enable administrators to shield sensitive data from unauthorized access [1]
      • ⇐ across warehouses and SQL analytics endpoints without necessitating changes to applications [1]
    • {type} object-level security (OLS)
      • permissions governing DML operations [1]
        • applies to tables and views
        • ⇐ when denied, the user will be prevented from performing the respective operation
        • SELECT
          • allows users to view the data within the object [1]
        • INSERT
          • allows users to insert data in the object [1]
        • UPDATE
          • allows users to update data within the object [1]
        • DELETE
          • allows users to delete the data within the object [1]
      • permissions can be granted, revoked or denied on database objects [1]
        •  tables and views
        • GRANT
          • permission is granted to user or role [1]
        • DENY
          • permissions is denied to user or role [1]
        • REVOKE
          • permissions is revoked to user or role [1]
        • ALTER
          • grants the user the ability to change the definition of the object [1]
        • CONTROL
          • grants the user all rights to the object [1]
      • {principle} least privilege
        • users and applications should only be given the permissions needed in order for them to complete the task
    • {type} column-level security (CLS)
      • allows to restrict column access to sensitive data [1]
        • provides granular control over who can access specific pieces of data [1]
          •  enhances the overall security of the data warehouse [1]
      • steps
        • identify the sensitive columns [1]
        • define access roles [1]
        • assign roles to users [1]
        • implement access control [1]
          • restrict access to ta column based on user's role [1]
    • {type} row-level security (RLS)
      • provides granular control over access to rows in a table based on group membership or execution context [1]
        • using WHERE clause filters [1]
      • works by associating a function (aka security predicate) with a table [1]
        • defined to return true or false based on certain conditions [1]
          • ⇐ typically involving the values of one or more columns in the table [1]
          • when a user attempts to access data in the table, the security predicate function is invoked [1]
            • if the function returns true, the row is accessible to the user; otherwise, the row doesn't show up in the query results [1]
        • the predicate can be as simple/complex as required [1]
        • the process is transparent to the user and is enforced automatically by SQL Server
          • ⇐ ensures consistent application of security rules [1]
      • implemented in two main steps:
        • filter predicates 
          • an inline table-valued function that filters the results based on the predicate defined [1]
        • security policy
          • invokes an inline table-valued function to protect access to the rows in a table [1]
            • because access control is configured and applied at the warehouse level, application changes are minimal - if any [1]
            • users can directly have access to the tables and can query their own data [1]
      • {recommendation} create a separate schema for predicate functions, and security policies [1]
      • {recommendation} avoid type conversions in predicate functions [1]
      • {recommendation} to maximize performance, avoid using excessive table joins and recursion in predicate functions [1]
    • {type} dynamic data masking (DDM) 
      • allows to limits data exposure to nonprivileged users by obscuring sensitive data [1]
        • e.g. email addresses 
      • {benefit} enhance the security and manageability of the data [1]
      • {feature} real-time masking
        • when querying sensitive data, DDM applies dynamic masking to it in real time [1]
          • the actual data is never exposed to unauthorized users, thus enhancing the security of your data [1]
        • straightforward to implement [1]
        • doesn’t require complex coding, making it accessible for users of all skill levels [1]
        • {benefit} the data in the database isn’t changed when DDM is applied
          •   the actual data remains intact and secure, while nonprivileged users only see a masked version of the data [1]
      • {operation} define masking rule
        • set up at column level [1]
        • offers a suite of features [1]
          • comprehensive and partial masking capabilities [1]
          • supports several masking types
            • help prevent unauthorized viewing of sensitive data [1]
              • by enabling administrators to specify how much sensitive data to reveal [1]
                •   minimal effect on the application layer [1]
            • applied to query results, so the data in the database isn't changed 
              •   allows many applications to mask sensitive data without modifying existing queries  [1]
          • random masking function designed for numeric data [1]
        • {risk} unprivileged users with query permissions can infer the actual data since the data isn’t physically obfuscated [1]
      • {recommendation} DDM should be used as part of a comprehensive data security strategy [1]
        • should include
          • the proper management of object-level security with SQL granular permissions [1]
          • adherence to the principle of minimal required permissions [1]
    • {concept} Dynamic SQL 
      • allows T-SQL statements to be generated within a stored procedure or a query itself [1]
        • executed via sp_executesql stored procedure
      • {risk} SQL injection attacks
        • use  QUOTENAME to sanitize inputs [1]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} granted through the Fabric workspace roles
      • the role automatically translates to a corresponding role in SQL that grants equivalent write access [4]
      • {recommendation} if a user needs write access to all warehouses and endpoints, assign the user to a workspace role [4]
        • use the Contributor role unless the user needs to assign other users to workspace roles [4]
      • {recommendation} grant direct access through SQL permissions if the user only needs to write to specific warehouses or endpoints [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to write to some or all the data [4]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} grant read access through the ReadData permission, granted as part of the Fabric workspace roles [4]
      •  ReadData permission maps the user to a SQL role that gives SELECT permissions on all tables in the warehouse or lakehouse
        • helpful if the user needs to see all or most of the data in the lakehouse or warehouse [4]
        • any SQL DENY permissions set on a particular lakehouse or warehouse still apply and limit access to tables [4]
        • row and column level security can be set on tables to restrict access at a granular level [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to read to some or all the data [4]
    • if the user needs access only to a specific lakehouse or warehouse, the share feature provides access to only the shared item [4]
      • during the share, users can choose to give only Read permission or Read + ReadData 
        • granting Read permission allows the user to connect to the warehouse or SQL analytics endpoint but gives no table access [4]
        • granting users the ReadData permissions gives them full read access to all tables in the warehouse or SQL analytics endpoint
      • ⇐ additional SQL security can be configured to grant or deny access to specific tables [4]

    References:
    [1] Microsoft Learn (2024) Secure a Microsoft Fabric data warehouse [link]
    [2] Data Mozart (2025) Lock Up! Understanding Data Access Options in Microsoft Fabric, by Nikola Ilic [link]
    [3] Microsoft Learn (2024) Security in Microsoft Fabric [link]
    [4] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R2] Microsoft Learn (2025) Fabric: Security for data warehousing in Microsoft Fabric [link]
    [R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

    Acronyms:
    CLS - Column-Level Security
    DDM - Dynamic Data Masking
    DML - Data Manipulation Language 
    MF - Microsoft Fabric
    OLS - Object-Level Security
    RLS - Row-Level Security
    SQL - Structured Query Language

    17 March 2025

    🏭🗒️Microsoft Fabric: Z-Order [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 17-Mar-2024

    [Microsoft Fabric] Z-Order
    • {def} technique to collocate related information in the same set of files [2]
      • ⇐ reorganizes the layout of each data file so that similar column values are strategically collocated near one another for maximum efficiency [1]
      • {benefit} efficient query performance
        • reduces the amount of data to read [2] for certain queries
          • when the data is appropriately ordered, more files can be skipped [3]
          • particularly important for the ordering of multiple columns [3]
      • {benefit} data skipping
        • automatically skips irrelevant data, further enhancing query speeds
          • via data-skipping algorithms [2]
      • {benefit} flexibility
        • can be applied to multiple columns, making it versatile for various data schemas
      • aims to produce evenly-balanced data files with respect to the number of tuples
        • ⇐ but not necessarily data size on disk [2]
          • ⇐ the two measures are most often correlated [2]
            • ⇐ but there can be situations when that is not the case, leading to skew in optimize task times [2]
      • via ZORDER BY clause 
        • applicable to columns with high cardinality commonly used in query predicates [2]
        • multiple columns can be specified as a comma-separated list
          • {warning} the effectiveness of the locality drops with each extra column [2]
            • has tradeoffs
              • it’s important to analyze query patterns and select the right columns when Z Ordering data [3]
          • {warning} using columns that do not have statistics collected on them is  ineffective and wastes resources [2] 
            • statistics collection can be configured on certain columns by reordering columns in the schema, or by increasing the number of columns to collect statistics on [2]
        • {characteristic} not idempotent
          • every time is executed, it will try to create a new clustering of data in all files in a partition [2]
            • it includes new and existing files that were part of previous z-ordering [2]
        • compatible with v-order
      • {concept} [Databricks] liquid clustering 
        • replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance [4] [6]
          • not compatible with the respective features [4] [6]
        • tables created with liquid clustering enabled have numerous Delta table features enabled at creation [4] [6]
        • provides flexibility to redefine clustering keys without rewriting existing data [4] [6]
          • ⇒ allows data layout to evolve alongside analytic needs over time [4] [6]
        • applies to 
          • streaming tables 
          • materialized views
        • {scenario} tables often filtered by high cardinality columns [4] [6]
        • {scenario} tables with significant skew in data distribution [4] [6]
        • {scenario} tables that grow quickly and require maintenance and tuning effort [4] [6]
        • {scenario} tables with concurrent write requirements [4] [6]
        • {scenario} tables with access patterns that change over time [4] [6]
        • {scenario} tables where a typical partition key could leave the table with too many or too few partitions [4] [6]

      References:
      [1] Bennie Haelen & Dan Davis (2024) Delta Lake Up & Running: Modern Data Lakehouse Architectures with Delta Lake
      [2] Delta Lake (2023) Optimizations [link]
      [3] Delta Lake (2023) Delta Lake Z Order, by Matthew Powers [link]
      [4] Delta Lake (2025) Use liquid clustering for Delta tables [link]
      [5] Databricks (2025) Delta Lake table format interoperability [link]
      [6] Microsoft Learn (2025) Use liquid clustering for Delta tables [link]

      Resources:
      [R1] Azure Guru (2024) Z Order in Delta Lake - Part 1 [link]
      [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

      Acronyms:
      MF - Microsoft Fabric 

      🏭🗒️Microsoft Fabric: V-Order [Notes]

      Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

      Last updated: 17-Mar-2024

      [Microsoft Fabric] V-Order
      • {def} write time optimization to the parquet file format that enables fast reads under the MF compute engine [2]
        • all parquet engines can read the files as regular parquet files [2]
        • results in a smaller and therefore faster files to read [5]
          • {benefit} improves read performance 
          • {benefit} decreases storage requirements
          • {benefit} optimizes resources' usage
            • reduces the compute resources required for reading data
              • e.g. network bandwidth, disk I/O, CPU usage
        • still conforms to the open-source Parquet file format [5]
          • they can be read by non-Fabric tools [5]
        • delta tables created and loaded by Fabric items automatically apply V-Order
          • e.g. data pipelines, dataflows, notebooks [5]
        • delta tables and its features are orthogonal to V-Order [2]
          •  e.g. Z-Order, compaction, vacuum, time travel
          • table properties and optimization commands can be used to control the v-order of the partitions [2]
        • compatible with Z-Order [2]
        • not all files have this optimization applied [5]
          • e.g. Parquet files uploaded to a Fabric lakehouse, or that are referenced by a shortcut 
          • the files can still be read, the read performance likely won't be as fast as an equivalent Parquet file that's had V-Order applied [5]
        • required by certain features
          • [hash encoding] to assign a numeric identifier to each unique value contained in the column [5]
        • {command} OPTIMIZE 
          • optimizes a Delta table to coalesce smaller files into larger ones [5]
          • can apply V-Order to compact and rewrite the Parquet files [5]
      • [warehouse] 
        • works by applying certain operations on Parquet files
          • special sorting
          • row group distribution
          • dictionary encoding
          • compression 
        • enabled by default
        •  ⇒ compute engines require less network, disk, and CPU resources to read data from storage [1]
          • provides cost efficiency and performance [1]
            • the effect of V-Order on performance can vary depending on tables' schemas, data volumes, query, and ingestion patterns [1]
          • fully-compliant to the open-source parquet format [1]
            • ⇐ all parquet engines can read it as regular parquet files [1]
        • required by certain features
          • [Direct Lake mode] depends on V-Order
        • {operation} disable V-Order
          • causes any new Parquet files produced by the warehouse engine to be created without V-Order optimization [3]
          • irreversible operation
            •  once disabled, it cannot be enabled again [3]
          • {scenario} write-intensive warehouses
            • warehouses dedicated to staging data as part of a data ingestion process [1]
          • {warning} consider the effect of V-Order on performance before deciding to disable it [1]
            • {recommendation} test how V-Order affects the performance of data ingestion and queries before deciding to disable it [1]
          • via ALTER DATABASE CURRENT SET VORDER = OFF; [3]
        • {operation} check current status
          • via  SELECT name, is_vorder_enabled FROM sys.databases; [post]
      • {feature} [lakehouse] Load to Table
        • allows to load a single file or a folder of files to a table [6]
        • tables are always loaded using the Delta Lake table format with V-Order optimization enabled [6]
      • [Direct Lake semantic model] 
        • data is prepared for fast loading into memory [5]
          • makes less demands on capacity resources [5]
          • results in faster query performance [5]
            • because less memory needs to be scanned [5]

      References:
      [1] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
      [2] Microsoft Learn (2024) Delta Lake table optimization and V-Order [link]
      [3] Microsoft Learn (2024) Disable V-Order on Warehouse in Microsoft Fabric [link]
      [4] Miles Cole (2024) To V-Order or Not: Making the Case for Selective Use of V-Order in Fabric Spark [link]
      [5] Microsoft Learn (2024) Understand storage for Direct Lake semantic models [link]
      [6] Microsoft Learn (2025] Fabric: Load to Delta Lake table [link]

      Resources:
      [R1] Serverless.SQL (2024) Performance Analysis of V-Ordering in Fabric Warehouse: On or Off?, by Andy Cutler [link]
      [R2] Redgate (2023 Microsoft Fabric: Checking and Fixing Tables V-Order Optimization, by Dennes Torres [link]
      [R3] Sandeep Pawar (2023) Checking If Delta Table in Fabric is V-order Optimized [link]
      [R4] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

      Acronyms:
      MF - Microsoft Fabric

      🏭🗒️Microsoft Fabric: Caching in Warehouse [Notes]

      Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

      Last updated: 17-Mar-2024

      [Microsoft Fabric] Caching
      • {def} technique that improves the performance of data processing by storing frequently accessed data and metadata in a faster storage layer [1]
        • e.g. local memory, local SSD disk
        • ⇐ subsequent requests can be served faster, directly from the cache [1]
          • if a set of data has been previously accessed by a query, any subsequent queries will retrieve that data directly from the in-memory cache [1]
          • local memory operations are notably faster compared to fetching data from remote storage [1]
          • ⇐ significantly diminishes IO latency [1]
        • fully transparent to the user
        • consistently active and operates seamlessly in the background [1]
        • orchestrated and upheld by MF
          • it doesn't offer users the capability to manually clear the cache [1] 
        • provides transactional consistency
          • ensures that any modifications to the data in storage after it has been initially loaded into the in-memory cache, will result in consistent data [1]
        • when the cache reaches its capacity threshold and fresh data is being read for the first time, objects that have remained unused for the longest duration will be removed from the cache [1]
          • process is enacted to create space for the influx of new data and maintain an optimal cache utilization strategy [1] 
      • {type} in-memory cache
        • data in cache is organized in a compressed columnar format (aka columnar storage) [1]
          • ⇐ optimized for analytical queries
          • each column of data is stored separately [1]
            • {benefit} allows for better compression [1]
              • since similar data values are stored together [1]
            • {benefit} reduces the memory footprint
        • when queries need to perform operations on a specific column, the engine can work more efficiently speeding up queries' execution [1]
          • it doesn't have to process unnecessary data from other columns
          • can perform operations on multiple columns simultaneously [1]
            • taking advantage of modern multi-core processors [1]
        • when retrieves data from storage, the data is transformed from its original file-based format into highly optimized structures in in-memory cache [1]
        • {scenario} analytical workloads where queries involve scanning large amounts of data to perform aggregations, filtering, and other data manipulations [1]
      • {type} disk cache
        • complementary extension to the in-memory cache
        • any data loaded into the in-memory cache is also serialized to the SSD cache [1]
          • data removed from the in-memory cache remains within the SSD cache for an extended period
          • when subsequent query requests the data, it is retrieved from the SSD cache into the in-memory cache quicker [1]
      • {issue} cold run (aka cold cache) performance
        • the first 1-3 executions of a query perform noticeably slower than subsequent executions [2]
          • if the first run's performance is crucial, try manually creating statistics (aka pre-warming the cache) [2]
          • otherwise, one can rely on the automatic statistics generated in the first query run and leveraged in subsequent runs [2]
          • as long as underlying data does not change significantly [2]
      • differentiated from
        • [Kusto] caching policy [link]
        • [Apache Spark] intelligent cache [link]
        • [Power BI] query caching [link]
        • [Azure] caching [link]
        References:
        [1] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]
        [2] Microsoft Learn (2024) Fabric Data Warehouse performance guidelines [link]
        [3] Sandeep Pawar (2023) Pre-Warming The Direct Lake Dataset For Warm Cache Import-Like Performance [link]

        Resources:
        [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

        Acronyms:
        IO - Input/Output
        MF - Microsoft Fabric
        SSD - Solid State Drive

        16 March 2025

        💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part XII: Databases)

        After reviewing the server and database properties, the next step is to review the database configurations using the sys.databases metadata. There's a such object even for warehouse and lakehouse, respectively for mirrored databases.

        Except the last line from the query, all the other attributes are supported by all above mentioned repositories. 

        -- database metadata
        SELECT db.name
        , db.database_id
        --, db.source_database_id
        , db.owner_sid
        , db.create_date
        , db.compatibility_level
        , db.collation_name
        --, db.user_access
        , db.user_access_desc
        --, db.is_read_only
        --, db.is_auto_close_on
        --, db.is_auto_shrink_on
        --, db.state
        , db.state_desc
        --, db.is_in_standby
        --, db.is_cleanly_shutdown
        --, db.is_supplemental_logging_enabled
        --, db.snapshot_isolation_state
        , db.snapshot_isolation_state_desc
        --, db.is_read_committed_snapshot_on
        --, db.recovery_model
        , db.recovery_model_desc
        , db.page_verify_option
        , db.page_verify_option_desc
        , db.is_auto_create_stats_on
        --, db.is_auto_create_stats_incremental_on
        , db.is_auto_update_stats_on
        --, db.is_auto_update_stats_async_on
        --, db.is_ansi_null_default_on
        --, db.is_ansi_nulls_on
        --, db.is_ansi_padding_on
        --, db.is_ansi_warnings_on
        --, db.is_arithabort_on
        --, db.is_concat_null_yields_null_on
        --, db.is_numeric_roundabort_on
        --, db.is_quoted_identifier_on
        --, db.is_recursive_triggers_on
        --, db.is_cursor_close_on_commit_on
        --, db.is_local_cursor_default
        , db.is_fulltext_enabled
        --, db.is_trustworthy_on
        , db.is_db_chaining_on
        --, db.is_parameterization_forced
        , db.is_master_key_encrypted_by_server
        --, db.is_query_store_on
        --, db.is_published
        --, db.is_subscribed
        --, db.is_merge_published
        --, db.is_distributor
        --, db.is_sync_with_backup
        , db.service_broker_guid
        --, db.is_broker_enabled
        --, db.log_reuse_wait
        , db.log_reuse_wait_desc
        --, db.is_date_correlation_on
        --, db.is_cdc_enabled
        --, db.is_encrypted
        --, db.is_honor_broker_priority_on
        --, db.replica_id
        --, db.group_database_id
        --, db.resource_pool_id
        --, db.default_language_lcid
        --, db.default_language_name
        --, db.default_fulltext_language_lcid
        --, db.default_fulltext_language_name
        --, db.is_nested_triggers_on
        --, db.is_transform_noise_words_on
        --, db.two_digit_year_cutoff
        --, db.containment
        --, db.containment_desc
        , db.target_recovery_time_in_seconds
        --, db.delayed_durability
        --, db.delayed_durability_desc
        --, db.is_memory_optimized_elevate_to_snapshot_on
        --, db.is_federation_member
        --, db.is_remote_data_archive_enabled
        --, db.is_mixed_page_allocation_on
        , db.is_temporal_history_retention_enabled
        --, db.catalog_collation_type
        , db.catalog_collation_type_desc
        , db.physical_database_name
        --, db.is_result_set_caching_on
        , db.is_accelerated_database_recovery_on
        --, db.is_tempdb_spill_to_remote_store
        --, db.is_stale_page_detection_on
        , db.is_memory_optimized_enabled
        , db.is_data_retention_enabled
        --, db.is_ledger_on
        --, db.is_change_feed_enabled
        --, db.is_data_lake_replication_enabled
        --, db.is_change_streams_enabled
        --, db.data_lake_log_publishing
        , db.data_lake_log_publishing_desc
        , db.is_vorder_enabled
        --, db.is_optimized_locking_on
        FROM sys.databases db
        

        Output (consolidated):

        Attribute Warehouse Mirrored Db Lakehouse SQL database
        name Warehouse Test 001 MirroredDatabase_1 Lakehouse_DWH SQL DB Test...
        database_id 5 6 7 28
        owner_sid 0x01010000000000051... 0x01010000000000051... x01010000000000051... AAAAAAWQAAAAAA...
        create_date 2025-02-22T18:56:28.700 2025-03-16T14:57:56.600 2025-03-16T15:07:59.563 2025-02-22T03:01:53.7130000
        compatibility_level 160 160 160 160
        collation_name Latin1_General_100_BIN2_UTF8 Latin1_General_100_BIN2_UTF8 Latin1_General_100_BIN2_UTF8 SQL_Latin1_General_CP1_CI_AS
        user_access_desc MULTI_USER MULTI_USER MULTI_USER MULTI_USER
        state_desc ONLINE ONLINE ONLINE ONLINE
        snapshot_isolation_state_desc ON ON ON ON
        recovery_model_desc SIMPLE SIMPLE SIMPLE FULL
        page_verify_option 0 0 0 2
        page_verify_option_desc NONE NONE NONE CHECKSUM
        is_auto_create_stats_on 1 1 1 1
        is_auto_update_stats_on 1 1 1 1
        is_fulltext_enabled 1 1 1 1
        is_db_chaining_on 0 0 0 0
        is_master_key_encrypted_by_server 0 0 0 0
        service_broker_guid 1F2261FC-5031-... 7D882362-567E-... 0D8938AB-BA79-... 2b74bed3-4405-...
        log_reuse_wait_desc NOTHING NOTHING NOTHING NOTHING
        target_recovery_time_in_seconds 60 60 60 60
        is_temporal_history_retention_enabled 1 1 1 1
        catalog_collation_type_desc DATABASE_DEFAULT DATABASE_DEFAULT DATABASE_DEFAULT SQL_Latin1_General_CP1_CI_AS
        physical_database_name Warehouse Test 001 MirroredDatabase_1 Lakehouse_DWH 3f4a3e79-e53e-...
        is_accelerated_database_recovery_on 1 1 1 1
        is_memory_optimized_enabled 1 1 1 1
        is_data_retention_enabled 1 1 1 1
        data_lake_log_publishing_desc AUTO UNSUPPORTED UNSUPPORTED UNSUPPORTED
        is_vorder_enabled 1 1 1 0

        The output for the SQL database was slightly different formatted and is_vorder_enabled is not available. Otherwise, the above query can be used for all environments. 

        All attributes except the last two are known from the earlier versions of SQL Server. is_vorder_enabled reflects the current status of V-Order [2], while data_lake_log_publishing_desc reflects the current state of Delta Lake log publishing [3].

        Consolidating the output from different sources helps identify the differences and one can easily use Excel formulas for this.

        Previous Post <<||>> Next Post

        References:
        [1] Microsoft Learn (2025) SQL Server 2022: sys.databases (Transact-SQL) [link]

        [2] Microsoft Learn (2025) SQL Server 2022: Disable V-Order on Warehouse in Microsoft Fabric [link

        [3] Microsoft Learn (2025) SQL Server 2022: Delta Lake logs in Warehouse in Microsoft Fabric [link

        25 February 2025

        🏭💠🗒️Microsoft Fabric: T-SQL Notebook [Notes]

        Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

        Last updated: 25-Feb-2024

        [Microsoft Fabric] T-SQL notebook

        • {def} notebook that enables to write and run T-SQL code within a notebook [1]
        • {feature} allows to manage complex queries and write better markdown documentation [1]
        • {feature} allows the direct execution of T-SQL on
          • connected warehouse
          • SQL analytics endpoint
          • ⇐ queries can be run directly on the connected endpoint [1]
            • multiple connections are allowed [1]
        • allows running cross-database queries to gather data from multiple warehouses and SQL analytics endpoints [1]
        • the code is run by the primary warehouse
          • used as default in commands which supports three-part naming, though no warehouse was provided [1]
          • three-part naming consists of 
            • database name
              • the name of the warehouse or SQL analytics endpoint [1]
            • schema name
            • table name
        • {feature} autogenerate T-SQL code using the code template from the object explorer's context [1] menu
        • {concept} code cells
          • allow to create and run T-SQL code
            • each code cell is executed in a separate session [1]
              • {limitation} the variables defined in one cell are not available in another cell [1]
              • one can check the execution summary after the code is executed [1]
            • cells can be run individually or together [1]
            • one cell can contain multiple lines of code [1]
              • users can select and run subparts of a cell’s code [1]
          • {feature} Table tab
            • lists the records from the returned result set
              • if the execution contains multiple result set, you can switch from one to another via the dropdown menu [1]
        • a query can be saved as 
          • view
            • via 'Save as' view
            • {limitation} does not support three-part naming [1]
              • the view is always created in the primary warehouse [1]
                • by setting the warehouse as the primary warehouse [1]
          • table
            • via 'Save as' table
            • saved as CTAS 
          • ⇐ 'Save as' is only available for the selected query text
            • the query text must be selected before using the Save as options
        • {limitation} doesn’t support 
          • parameter cell
            • the parameter passed from pipeline or scheduler can't be used [1]
          • {feature} Recent Run 
            • {workaround} use the current data warehouse monitoring feature to check the execution history of the T-SQL notebook [1]
          • {feature} the monitor URL inside the pipeline execution
          • {feature} snapshot 
          • {feature} Git support 
          • {feature} deployment pipeline support 

        References:
        [1] Microsoft Learn (2025) T-SQL support in Microsoft Fabric notebooks [link
        [2] Microsoft Learn (2025) Create and run a SQL Server notebook [link
        [3] Microsoft Learn (2025) T-SQL surface area in Microsoft Fabric [link
        [4] Microsoft Fabric Updates Blog (2024) Announcing Public Preview of T-SQL Notebook in Fabric [link]

        Resources:
        [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

        Acronyms
        CTAS - Create Table as Select
        T-SQL - Transact SQL

        23 January 2025

        💎🏭SQL Reloaded: Number of Records V (via Cursors, a Solution for Warehouses in Microsoft Fabric)

        After deploying the sample warehouse available in Microsoft Fabric, I tried to check the number of records available in the deployed tables under the dbo schema. Surprisingly, the sys.partitions.count column has 0 values for all the tables associated with the respective schema (see post). 

        There are only a few tables available, and taking a record count for each table should be enough, which is relatively simple with the undocumented sp_MSForEachTable. Unfortunately, this approach doesn't work neither, so one needs to revert to the use of old-fashioned cursors (as I used to do in SQL Server 2000):

        -- number of records via cursor
        DECLARE @table_name nvarchar(150)
        DECLARE @sql nvarchar(250)
        DECLARE @number_records bigint 
        DECLARE @number_tables int, @iterator int
        
        DROP TABLE IF EXISTS dbo.#tables;
        
        CREATE TABLE dbo.#tables (
          ranking int NOT NULL
        , table_name nvarchar(150) NOT NULL
        , number_records bigint
        )
        
        INSERT INTO #tables
        SELECT row_number() OVER(ORDER BY object_id) ranking
        , concat(schema_name(schema_id),'.', name) table_name
        , NULL number_records
        FROM sys.tables obj
        WHERE obj.schema_id = schema_id('dbo')
        ORDER BY table_name
        
        SET @iterator = 1
        SET @number_tables = IsNull((SELECT count(*) FROM #tables), 0)
        
        WHILE (@iterator <= @number_tables)
        BEGIN 
            SET @table_name = (SELECT table_name FROM #tables WHERE ranking = @iterator)
            SET @sql = CONCAT(N'SELECT @NumberRecords = count(*) FROM ', @table_name)
        
        	BEGIN TRY
        		--get the number of records
        		EXEC sp_executesql @Query = @sql
        		, @params = N'@NumberRecords bigint OUTPUT'
        		, @NumberRecords = @number_records OUTPUT
        
        		IF IsNull(@number_records, 0)> 0  
        		BEGIN
                        SET @sql = 'UPDATE #tables' 
                     + ' SET number_records = ' + Str(@number_records)
                     + ' WHERE table_name = ''' + @table_name + '''';
        
        		 EXEC(@sql)
        		END 
        	END TRY
        	BEGIN CATCH  
        	 -- no action needed in case of error
                END CATCH;
        
        	SET @iterator = @iterator + 1
        END
        
        SELECT *
        FROM dbo.#tables;
        
        --DROP TABLE IF EXISTS dbo.#tables;
        
        Results:
        ranking table_name number_records
        1 dbo.Date 5844
        2 dbo.Geography 305179
        3 dbo.HackneyLicense 42958
        4 dbo.Time 86400
        5 dbo.Weather 526330
        6 dbo.Trip 2838927
        7 dbo.Medallion 13668

        Comments:
        1) It's a lot of code for a simple task, though the code can be easily duplicated and adapted for similar requirements. Unfortunately, it can lead in time also to many instances of the same code. When possible, one should consider maybe encapsulating the logic in a stored procedure. 
        2) It's usually a good idea to check how many records are available in the tables used for testing, as this can impact queries' performance and tables' appropriateness for the tests performed. Moreover, it's a good idea to understand the volume of data when taking over or working with a database. 
        3) If one removes the row_number function, the code should run also in SQL Server 2000. Similar solutions were used then for retrieving the record count.
        4) Microsoft recommends not to drop the temporary tables explicitly, but let SQL Server handle this cleanup automatically and take thus advantage of the Optimistic Latching Algorithm, which helps prevent contention on TempDB [1]..
        5) There are others who stumbled over this issue (see [1]).
        6) The solution has been tested successfully also in SQL databases.
        7) The whole code must be run together because the temporary table seems to have only a transitory scope! An attempt to rerun the last SELECT from #tables raises the error: "Invalid object name '#tables'"

        Happy coding!

        Previous Post <<||>> Next Post

        References:
        [1] Koen Verbeeck (2024) Get row counts of all tables in a Microsoft Fabric warehouse [link]
        [2] Haripriya SB (2024) Do NOT drop #temp tables (link)

        Related Posts Plugin for WordPress, Blogger...

        About Me

        My photo
        Koeln, NRW, Germany
        IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.