SQL Troubles: warehouse

Showing posts with label warehouse. Show all posts

21 June 2025

🏭🗒️Microsoft Fabric: Result Set Caching in SQL Analytics Endpoints [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 21-Jun-2025

[Microsoft Fabric] Result Set Caching in SQL Analytics Endpoints

{def} built-in performance optimization for Warehouse and Lakehouse that improves read latency [1]

fully transparent to the user [3]
persists the final result sets for applicable SELECT T-SQL queries

caches all the data accessed by a query [3]
subsequent runs that "hit" cache will process just the final result set

can bypass complex compilation and data processing of the original query[1]

⇐ returns subsequent queries faster [1]

the cache creation and reuse is applied opportunistically for queries

works on

warehouse tables
shortcuts to OneLake sources
shortcuts to non-Azure sources

the management of cache is handled automatically [1]

regularly evicts cache as needed

as data changes, result consistency is ensured by invalidating cache created earlier [1]

{operation} enable setting

via ALTER DATABASE <database_name> SET RESULT_SET_CACHING ON

{operation} validate setting

via SELECT name, is_result_set_caching_on FROM sys.databases

{operation} configure setting

configurable at item level

once enabled, it can then be disabled

at the item level
for individual queries

e.g. debugging or A/B testing a query

via OPTION ( USE HINT ('DISABLE_RESULT_SET_CACHE')

{default} during the preview, result set caching is off for all items [1]

[monitoring]

via Message Output

applicable to Fabric Query editor, SSMS
the statement "Result set cache was used" is displayed after query execution if the query was able to use an existing result set cache

via queryinsights.exec_requests_history system view

result_cache_hit displays indicates result set cache usage for each query execution [1]

{value} 2: the query used result set cache (cache hit)
{value} 1: the query created result set cache
{value} 0: the query wasn't applicable for result set cache creation or usage [1]

{reason} the cache no longer exists
{reason} the cache was invalidated by a data change, disqualifying it for reuse [1]
{reason} query isn't deterministic

isn't eligible for cache creation [1]

{reason} query isn't a SELECT statement

[warehousing]

{scenario} analytical queries that process large amounts of data to produce a relatively small result [1]
{scenario} workloads that trigger the same analytical queries repeatedly [1]

the same heavy computation can be triggered multiple times, even though the final result remains the same [1]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2025) Result set caching (preview) [link]

[2] Microsoft Fabric Update Blog (2025) Result Set Caching for Microsoft Fabric Data Warehouse (Preview) [link|aka]

[3] Microsoft Learn (2025) In-memory and disk caching [link]

[4] Microsoft Learn (2025) Performance guidelines in Fabric Data Warehouse [link]

Resources:
[R1] Microsoft Fabric (2025) Fabric Update - June 2025 [link]

Acronyms:

MF - Microsoft Fabric

SSMS - SQL Server Management Studio

23 May 2025

🏭🗒️Microsoft Fabric: Warehouse Snapshots [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 23-May-2025

[Microsoft Fabric] Warehouse Snapshots

{def} read-only representation of a warehouse at a specific point in time [1]
allows support for analytics, reporting, and historical analysis scenarios without worrying about the volatility of live data updates [1]

provide a consistent and stable view of data [1]
ensuring that analytical workloads remain unaffected by ongoing changes or ETL operations [1]

{benefit} guarantees data consistency

the dataset remains unaffected by ongoing ETL processes [1]

{benefit} immediate roll-Forward updates

can be seamlessly rolled forward on demand to reflect the latest state of the warehouse

⇒ {benefit} consumers access the same snapshot using a consistent connection string, even from third-party tools [1]
⇐ updates are applied immediately, as if in a single, atomic transaction [1]

{benefit} facilitates historical analysis

snapshots can be created on an hourly, daily, or weekly basis to suit their business requirements [1]

{benefit} enhanced reporting

provides a point-in-time reliable dataset for precise reporting [1]

⇐ free from disruptions caused by data modifications [1]

{benefit} doesn't require separate storage [1]

relies on source Warehouse [1]

{limit} doesn't support database objects
{limit} capture a state within the last 30 days
{operation} create snapshot

via New warehouse snapshot
multiple snapshots can be created for the same parent warehouse [1]

appear as child items of the parent warehouse in the workspace view [1]
the queries run against provide the current version of the data being accessed [1]

{operation} read properties

via
GET https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{warehousesnapshotId} Authorization: Bearer <bearer token>

{operation} update snapshot timestamp

allows users to roll forward data instantly, ensuring consistency [1]

use current state

via ALTER DATABASE [<snapshot name>] SET TIMESTAMP = CURRENT_TIMESTAMP;

use point in time

ALTER DATABASE snapshot SET TIMESTAMP = 'YYYY-MM-DDTHH:MM:SS.SS'//UTC time

queries that are in progress during point in time update will complete against the version of data they were started against [1]

{operation} rename snapshot
{operation} delete snapshot

via DELETE
when the parent warehouse gets deleted, the snapshot is also deleted [1]

{operation} modify source table

DDL changes to source will only impact queries in the snapshot against tables affected [1]

{operation} join multiple snapshots

the resulting snapshot date will be applied to each warehouse connection [1]

{operation} retrieve metadata

via sys.databases [1]

[permissions] inherited from the source warehouse [1]

⇐ any permission changes in the source warehouse applies instantly to the snapshot [1]
security updates on source database will be rendered immediately to the snapshot databases [1]

{limitation} can only be created against new warehouses [1]

created after Mar-2025

{limitation} do not appear in SSMS Object Explorer but will show up in the database selection dropdown [1]
{limitation} datetime can be set to any date in the past up to 30 days or database creation time (whichever is later) [1]
{limitation} modified objects after the snapshot timestamp become invalid in the snapshot [1]

applies to tables, views, and stored procedures [1]

{limitation} must be recreated if the data warehouse is restored [1]

{limitation} aren’t supported on the SQL analytics endpoint of the Lakehouse [1]
{limitation} aren’t supported as a source for OneLake shortcuts [1]
[Power BI]{limitation} require Direct Query or Import mode [1]

don’t support Direct Lake

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2025) Fabric: Warehouse Snapshots in Microsoft Fabric (Preview) [link]

[2] Microsoft Learn (2025) Warehouse snapshots (preview) [link]

[3] Microsoft Learn (2025) Create and manage a warehouse snapshot (preview) [link]

Resources:

Acronyms:

DDL - Data Definition Language
ETL - Extract, Transfer, Load

MF - Microsoft Fabric
SSMS - SQL Server Management Studio

26 April 2025

🏭🗒️Microsoft Fabric: Deployment Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 26-Apr-2025

[Microsoft Fabric] Deployment Pipelines

{def} a structured process that enables content creators to manage the lifecycle of their organizational assets [5]

enable creators to develop and test content in the service before it reaches the users [5]

can simplify the deployment process to development, test, and production workspaces [5]
one Premium workspace is assigned to each stage [5]
each stage can have

different configurations [5]
different databases or different query parameters [5]

{action} create pipeline

from the deployment pipelines entry point in Fabric [5]

creating a pipeline from a workspace automatically assigns it to the pipeline [5]

{action} define how many stages it should have and what they should be called [5]

{default} has three stages

e.g. Development, Test, and Production
the number of stages can be changed anywhere between 2-10
{action} add another stage,
{action} delete stage
{action} rename stage

by typing a new name in the box

{action} share a pipeline with others

users receive access to the pipeline and become pipeline admins [5]

⇐ the number of stages are permanent [5]

can't be changed after the pipeline is created [5]

{action} add content to the pipeline [5]

done by assigning a workspace to the pipeline stage [5]

the workspace can be assigned to any stage [5]

{action|optional} make a stage public

{default} the final stage of the pipeline is made public
a consumer of a public stage without access to the pipeline sees it as a regular workspace [5]

without the stage name and deployment pipeline icon on the workspace page next to the workspace name [5]

{action} deploy to an empty stage

when finishing the work in one pipeline stage, the content can be deployed to the next stage [5]

deployment can happen in any direction [5]

{option} full deployment

deploy all content to the target stage [5]

{option} selective deployment

allows select the content to deploy to the target stage [5]

{option} backward deployment

deploy content from a later stage to an earlier stage in the pipeline [5]
{restriction} only possible when the target stage is empty [5]

{action} deploy content between pages [5]

content can be deployed even if the next stage has content

paired items are overwritten [5]

{action|optional} create deployment rules

when deploying content between pipeline stages, allow changes to content while keeping some settings intact [5]
once a rule is defined or changed, the content must be redeployed

the deployed content inherits the value defined in the deployment rule [5]
the value always applies as long as the rule is unchanged and valid [5]

{feature} deployment history

allows to see the last time content was deployed to each stage [5]
allows to to track time between deployments [5]

{concept} pairing

{def} the process by which an item in one stage of the deployment pipeline is associated with the same item in the adjacent stage

applies to reports, dashboards, semantic models
paired items appear on the same line in the pipeline content list [5]

⇐ items that aren't paired, appear on a line by themselves [5]

the items remain paired even if their name changes
items added after the workspace is assigned to a pipeline aren't automatically paired [5]

⇐ one can have identical items in adjacent workspaces that aren't paired [5]

[lakehouse]

can be removed as a dependent object upon deployment [3]
supports mapping different Lakehouses within the deployment pipeline context [3]
{default} a new empty Lakehouse object with same name is created in the target workspace [3]

⇐ if nothing is specified during deployment pipeline configuration
notebook and Spark job definitions are remapped to reference the new lakehouse object in the new workspace [3]
{warning} a new empty Lakehouse object with same name still is created in the target workspace [3]
SQL Analytics endpoints and semantic models are provisioned
no object inside the Lakehouse is overwritten [3]
updates to Lakehouse name can be synchronized across workspaces in a deployment pipeline context [3]

[notebook] deployment rules can be used to customize the behavior of notebooks when deployed [4]

e.g. change notebook's default lakehouse [4]
{feature} auto-binding

binds the default lakehouse and attached environment within the same workspace when deploying to next stage [4]

[environment] custom pool is not supported in deployment pipeline

the configurations of Compute section in the destination environment are set with default values [6]
⇐ subject to change in upcoming releases [6]

[warehouse]

[database project] ALTER TABLE to add a constraint or column

{limitation} the table will be dropped and recreated when deploying, resulting in data loss

{recommendation} do not create a Dataflow Gen2 with an output destination to the warehouse

⇐ deployment would be blocked by a new item named DataflowsStagingWarehouse that appears in the deployment pipeline [10]

SQL analytics endpoint is not supported

[Eventhouse]

{limitation} the connection must be reconfigured in destination that use Direct Ingestion mode [8]

[EventStream]

{limitation} limited support for cross-workspace scenarios

{recommendation} make sure all EventStream destinations within the same workspace [8]

KQL database

applies to tables, functions, materialized views [7]

KQL queryset

⇐ tabs, data sources [7]

[real-time dashboard]

data sources, parameters, base queries, tiles [7]

[SQL database]

includes the specific differences between the individual database objects in the development and test workspaces [9]

can be also used with

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) Get started with deployment pipelines [link]

[2] Microsoft Learn (2024) Implement continuous integration and continuous delivery (CI/CD) in Microsoft Fabric [link]

[3] Microsoft Learn (2024) Lakehouse deployment pipelines and git integration (Preview) [link]

[4] Microsoft Learn (2024) Notebook source control and deployment [link]

[5] Microsoft Learn (2024) Introduction to deployment pipelines [link]

[6] Environment Git integration and deployment pipeline [link]

[7] Microsoft Learn (2024) Microsoft Learn (2024) Real-Time Intelligence: Git integration and deployment pipelines (Preview) [link]

[8] Microsoft Learn (2024) Eventstream CI/CD - Git Integration and Deployment Pipeline [link]

[9] Microsoft Learn (2024) Get started with deployment pipelines integration with SQL database in Microsoft Fabric [link]

[10] Microsoft Learn (2025) Source control with Warehouse (preview) [link]

Resources:

Acronyms:
CLM - Content Lifecycle Management
UAT - User Acceptance Testing

06 April 2025

🏭🗒️Microsoft Fabric: Query Optimizer in Warehouse [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 6-Apr-2025

[Microsoft Fabric] Hints in Warehouse

{def} keywords that users can add to SQL statements to provide additional information or instructions to the query optimizer [2]

options or strategies specified for enforcement by the SQL Server query processor [1]

applicable to SELECT, INSERT, UPDATE, or DELETE statements [1]

{benefit} help improve the performance, scalability, or consistency of queries by overriding the default behavior of the query optimizer [2]
{type} join hints

specify what join strategy/algorithm used between two tables [3]

improves the performance of queries that involve large or complex joins [2]
specified in the FROM clause of a query [1]
if a join hint is specified for any two tables, the query optimizer automatically enforces the join order for all joined tables in the query [3]

based on the position of the ON keywords [3]

when a CROSS JOIN is used without the ON clause, parentheses can be used to indicate the join order [3]

looping

via LOOP JOIN
{restriction} can't be specified together with RIGHT or FULL as a join type [3]

hashing

via HASH JOIN

merging

via MERGE JOIN

REPLICATE

causes a broadcast move operation

a specific table to be replicated across all distribution nodes [2]

with INNER or LEFT` join

the broadcast move operation will replicate the right side of the join to all nodes [2]

with RIGHT join

the broadcast move operation will replicate the left side of the join to all nodes [2]

with FULL` join

an estimated plan cannot be created [2]

REDISTRIBUTE [(colsCount)]

ensures two data sources are distributed based on JOIN clause columns [2]
handles multiple join conditions, specified by the first n columns in both tables, where n is the column_count argument [2]
redistributing data optimizes query performance by evenly spreading data across nodes during intermediate steps of execution [2]
the (columns_count) argument is only supported in MF warehouse [2]

{type} query hint

specify that the indicated hints are used in the scope of a query [3]

affect all operators in the statement

[UNION only the last query involving a UNION operation can have the OPTION clause [3]

specified as part of the OPTION clause [3]
Error 8622 occurs if one or more query hints cause the Query Optimizer not to generate a valid plan [3]

used via the OPTION clause at the end of a query [2]

followed by the name of the query hint and its optional parameters in parentheses [2]
multiple hints can be used in the same query, separated by commas

e.g. FORCE ORDER and MAX_GRANT_PERCENT

instruct the QO to preserve the join order specified in the query and to limit the memory grant to 20 percent of the available memory

{hint} HASH GROUP

specifies that the QO should use a hash-based algorithm for the GROUP BY operation [2]
{benefit} can improve the performance of queries that involve large or complex grouping sets [2]

{hint} ORDER GROUP

specifies that the QO should use a sort-based algorithm for the GROUP BY operation [2]
{benefit} can improve the performance of queries that involve small or simple grouping sets [2]

{hint} MERGE UNION

specifies that the QO should use a merge-based algorithm for the UNION or UNION ALL operation [2]
{benefit} can improve the performance of queries that involve sorted inputs [2]

{hint} HASH UNION

specifies that the query optimizer should use a hash-based algorithm for the UNION or UNION ALL operation [2]
{benefit} can improve the performance of queries that involve unsorted or large inputs [2]

{hint} CONCAT UNION

specifies that the QO should use a concatenation-based algorithm for the UNION or UNION ALL operation [2]
{benefit} can improves the performance of queries that involve distinct or small inputs[2]

{hint} FORCE ORDER

specifies that the QO should preserve the join order specified in the query [2]
{benefit} can improves the performance or consistency of queries that involve complex join conditions or hints [2]

{hint} FORCE SINGLE NODE PLAN/FORCE DISTRIBUTED PLAN

allows to choose whether to force a single node plan or a distributed plan for query’s execution [2]

{hint} USE HINT

adds one or more extra hints to the query processor, where the hints are specified with a hint name inside single quotation marks inside OPTION clause [2] OPTION(USE HINT(‘ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES’))
used with several hint names, changing the behavior of CE derivation

{hint name} ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES

applies when calculating cardinality estimates for AND predicates for filters [2]
MF assumes full correlation among filters when a high level of underestimation on AND predicates for filters is observed [2]
[SQL Server] equivalent TF 4137 [4]

{hint name} ASSUME_FULL_INDEPENDENCE_FOR_FILTER_ESTIMATES

applies when calculating cardinality estimates for AND predicates for filters [2]
MF assumes full independence among filters [2]

if a high level of overestimation on AND predicates for filters is observed, this hint can help produce a better estimate [2]

{hint name} ASSUME_PARTIAL_CORRELATION_FOR_FILTER_ESTIMATES

applies when calculating cardinality estimates for AND predicates for filters [2]
{default} MF assumes partial correlation among filters [2]

⇐ it is unlikely that this hint will help improve the estimates [2]

{hint name} ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS

applies when calculating cardinality estimate for joins [2]
uses Simple Containment assumption instead of the default Base Containment assumption [2]
[SQL Server] equivalent TF 9476 [4]

{type} table hints

none enforced currently

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2025) SQL: Hints (T-SQL) [link]

[2] Microsoft Fabric Updates Blog (2025) Hints in Fabric Data Warehouse [link]
[3] Microsoft Learn (2025) SQL: OPTION clause (T-SQL) [link]
[4] Microsoft Support (2016) KB3189813 - Update introduces USE HINT query hint argument in SQL Server 2016 [link]

Resources:
[R1] Microsoft Learn (2025) SQL Server: Query hints (T-SQL) [link]
[R2] Most Useful Query Hints [link]

Acronyms:
MF - Microsoft Fabric
TF - Trace Flag
QO - Query Optimizer

26 March 2025

💠🏭🗒️Microsoft Fabric: Polaris SQL Pool [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Unfortunately, besides the references papers, there's almost no material that could be used to enhance the understanding of the concepts presented.

Last updated: 26-Mar-2025

Read and Write Operations in Polaris [2]

[Microsoft Fabric] Polaris SQL Pool

{def} distributed SQL query engine that powers Microsoft Fabric's data warehousing capabilities

designed to unify data warehousing and big data workloads while separating compute and state for seamless cloud-native operations
based on a robust DCP

designed to execute read-only queries in a scalable, dynamic and fault-tolerant way [1]
a highly-available micro-service architecture with well-defined responsibilities [2]

data and query processing is packaged into units (aka tasks)

can be readily moved across compute nodes and re-started at the task level

widely-partitioned data with a flexible distribution model [2]
a task-level "workflow-DAG" that is novel in spanning multiple queries [2]
a framework for fine-grained monitoring and flexible scheduling of tasks [2]

{component} SQL Server Front End (SQL-FE)

responsible for

compilation
authorization
authentication
metadata

used by the compiler to

{operation} generate the search space (aka MEMO) for incoming queries
{operation} bind metadata to data cells
leveraged to ensure the durability of the transaction manifests at commit [2]

only transactions that successfully commit need to be actively tracked to ensure consistency [2]
any manifests and data associated with aborted transactions are systematically garbage-collected from OneLake through specialized system tasks [2]

{component} SQL Server Backend (SQL-BE)

used to perform write operations on the LST [2]

inserting data into a LST creates a set of Parquet files that are then recorded in the transaction manifest [2]
a transaction is represented by a single manifest file that is modified concurrently by (one or more) SQL BEs [2]

SQL BE leverages the Block Blob API provided by ADLS to coordinate the concurrent writes [2]
each SQL BE instance serializes the information about the actions it performed, either adding a Parquet file or removing it [2]

the serialized information is then uploaded as a block to the manifest file
uploading the block does not yet make any visible changes to the file [2]

each block is identified by a unique ID generated on the writing SQL BE [2]

after completion, each SQL BE returns the ID of the block(s) it wrote to the Polaris DCP [2]

the block IDs are then aggregated by the Polaris DCP and returned to the SQL FE as the result of the query [2]

the SQL FE further aggregates the block IDs and issues a Commit Block operation against storage with the aggregated block IDs [2]

at this point, the changes to the file on storage will become effective [2]

changes to the manifest file are not visible until the Commit operation on the SQL FE

the Polaris DCP can freely restart any part of the operation in case there is a failure in the node topology [2]

the IDs of any blocks written by previous attempts are not included in the final list of block IDs and are discarded by storage [2]

[read operations] SQL BE is responsible for reconstructing the table snapshot based on the set of manifest files managed in the SQL FE

the result is the set of Parquet data files and deletion vectors that represent the snapshot of the table [2]

queries over these are processed by the SQL Server query execution engine [2]
the reconstructed state is cached in memory and organized in such a way that the table state can be efficiently reconstructed as of any point in time [2]

enables the cache to be used by different operations operating on different snapshots of the table [2]
enables the cache to be incrementally updated as new transactions commit [2]

{feature} supports explicit user transactions

can execute multiple statements within the same transaction in a consistent way

the manifest file associated with the current transaction captures all the (reconciled) changes performed by the transaction [2]

changes performed by prior statements in the current transaction need to be visible to any subsequent statement inside the transaction (but not outside of the transaction) [2]

[multi-statement transactions] in addition to the committed set of manifest files, the SQL BE reads the manifest file of the current transaction and then overlays these changes on the committed manifests [1]
{write operations} the behavior of the SQL BE depends on the type of the operation.

insert operations

only add new data and have no dependency on previous changes [2]
the SQL BE can serialize the metadata blocks holding information about the newly created data files just like before [2]
the SQL FE, instead of committing only the IDs of the blocks written by the current operation, will instead append them to the list of previously committed blocks

⇐ effectively appends the data to the manifest file [2]

{update|delete operations}

handled differently

⇐ since they can potentially further modify data already modified by a prior statement in the same transaction [2]

e.g. an update operation can be followed by another update operation touching the same rows

the final transaction manifest should not contain any information about the parts from the first update that were made obsolete by the second update [2]

SQL BE leverages the partition assignment from the Polaris DCP to perform a distributed rewrite of the transaction manifest to reconcile the actions of the current operation with the actions recorded by the previous operation [2]

the resulting block IDs are sent again to the SQL FE where the manifest file is committed using the (rewritten) block IDs [2]

{concept} Distributed Query Processor (DQP)

responsible for

distributed query optimization
distributed query execution
query execution topology management

{concept} Workload Management (WLM)

consists of a set of compute servers that are, simply, an abstraction of a host provided by the compute fabric, each with a dedicated set of resources (disk, CPU and memory) [2]

each compute server runs two micro-services

{service} Execution Service (ES)

responsible for tracking the life span of tasks assigned to a compute container by the DQP [2]

{service} SQL Server instance

used as the back-bone for execution of the template query for a given task [2]

⇐ holds a cache on top of local SSDs

in addition to in-memory caching of hot data

data can be transferred from one compute server to another

via dedicated data channels

the data channel is also used by the compute servers to send results to the SQL FE that returns the results to the user [2]
the life cycle of a query is tracked via control flow channels from the SQL FE to the DQP, and the DQP to the ES [2]

{concept} cell data abstraction

the key building block that enables to abstract data stores

abstracts DQP from the underlying store [1]
any dataset can be mapped to a collection of cells [1]
allows distributing query processing over data in diverse formats [1]
tailored for vectorized processing when the data is stored in columnar formats [1]
further improves relational query performance

2-dimenstional

distributions (data alignment)
partitions (data pruning)

each cell is self-contained with its own statistics [1]

used for both global and local QO [1]
cells can be grouped physically in storage [1]
queries can selectively reference either cell dimension or even individual cells depending on predicates and type of operations present in the query [1]

{concept} distributed query processing (DQP) framework

operates at the cell level
agnostic to the details of the data within a cell

data extraction from a cell is the responsibility of the (single node) query execution engine, which is primarily SQL Server, and is extensible for new data types [1], [2]

{concept} dataset

logically abstracted as a collection of cells [1]
can be arbitrarily assigned to compute nodes to achieve parallelism [1]
uniformly distributed across a large number of cells

[scale-out processing] each dataset must be distributed across thousands of buckets or subsets of data objects,
such that they can be processed in parallel across nodes

{concept} session

supports a spectrum of consumption models, ranging from serverless ad-hoc queries to long-standing pools or clusters [1]
all data are accessible from any session [1]

multiple sessions can access all underlying data concurrently [1]

{concept} Physical Metadata layer

new layer introduced in the SQL Server storage engine [2]

25 March 2025

🏭🗒️Microsoft Fabric: Security in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 25-Mar-2024

[Microsoft Fabric] Security in Warehouse

{def} suite of technologies aimed at safeguarding sensitive information in Fabric [1]

leverages SQL engine’s security features [1]

allows for security mechanism at the warehouse level [1]
⇐ the warehouse and SQL analytics endpoint items also allow for the defining of native SQL security [4]

the permissions configured only apply to the queries executed against the respective surfaces [4]

the access to OneLake data is controlled separately through OneLake data access roles [4]

{recommendation}don’t include users with SQL specific permissions in a OneLake data access role [4]

allows to ensure users don't see data they don't have SQL access to [5r]

supports a range of data protection features that enable administrators to shield sensitive data from unauthorized access [1]

⇐ across warehouses and SQL analytics endpoints without necessitating changes to applications [1]

{type} object-level security (OLS)

permissions governing DML operations [1]

applies to tables and views
⇐ when denied, the user will be prevented from performing the respective operation
SELECT

allows users to view the data within the object [1]

INSERT

allows users to insert data in the object [1]

UPDATE

allows users to update data within the object [1]

DELETE

allows users to delete the data within the object [1]

permissions can be granted, revoked or denied on database objects [1]

tables and views
GRANT

permission is granted to user or role [1]

DENY

permissions is denied to user or role [1]

REVOKE

permissions is revoked to user or role [1]

ALTER

grants the user the ability to change the definition of the object [1]

CONTROL

grants the user all rights to the object [1]

{principle} least privilege

users and applications should only be given the permissions needed in order for them to complete the task

{type} column-level security (CLS)

allows to restrict column access to sensitive data [1]

provides granular control over who can access specific pieces of data [1]

⇒ enhances the overall security of the data warehouse [1]

steps

identify the sensitive columns [1]
define access roles [1]
assign roles to users [1]
implement access control [1]

restrict access to ta column based on user's role [1]

{type} row-level security (RLS)

provides granular control over access to rows in a table based on group membership or execution context [1]

using WHERE clause filters [1]

works by associating a function (aka security predicate) with a table [1]

defined to return true or false based on certain conditions [1]

⇐ typically involving the values of one or more columns in the table [1]
when a user attempts to access data in the table, the security predicate function is invoked [1]

if the function returns true, the row is accessible to the user; otherwise, the row doesn't show up in the query results [1]

the predicate can be as simple/complex as required [1]
the process is transparent to the user and is enforced automatically by SQL Server

⇐ ensures consistent application of security rules [1]

implemented in two main steps:

filter predicates

an inline table-valued function that filters the results based on the predicate defined [1]

security policy

invokes an inline table-valued function to protect access to the rows in a table [1]

because access control is configured and applied at the warehouse level, application changes are minimal - if any [1]
users can directly have access to the tables and can query their own data [1]

{recommendation} create a separate schema for predicate functions and security policies [1]
{recommendation} avoid type conversions in predicate functions [1]
{recommendation}avoid using excessive table joins and recursion in predicate functions [1]

allows to maximize performance,

{type} dynamic data masking (DDM)

allows to limits data exposure to nonprivileged users by obscuring sensitive data [1]

e.g. email addresses

{benefit} enhance the security and manageability of the data [1]
{feature} real-time masking

when querying sensitive data, DDM applies dynamic masking to it in real time [1]

the actual data is never exposed to unauthorized users, thus enhancing the security of your data [1]

straightforward to implement [1]
doesn’t require complex coding, making it accessible for users of all skill levels [1]
{benefit} the data in the database isn’t changed when DDM is applied

⇒ the actual data remains intact and secure, while nonprivileged users only see a masked version of the data [1]

{operation} define masking rule

set up at column level [1]
offers a suite of features [1]

comprehensive and partial masking capabilities [1]
supports several masking types

help prevent unauthorized viewing of sensitive data [1]

by enabling administrators to specify how much sensitive data to reveal [1]

⇒ minimal effect on the application layer [1]

applied to query results, so the data in the database isn't changed

⇒ allows many applications to mask sensitive data without modifying existing queries [1]

random masking function designed for numeric data [1]

{risk} unprivileged users with query permissions can infer the actual data since the data isn’t physically obfuscated [1]

{recommendation} DDM should be used as part of a comprehensive data security strategy [1]

should include

the proper management of object-level security with SQL granular permissions [1]
adherence to the principle of minimal required permissions [1]

{concept} Dynamic SQL

allows T-SQL statements to be generated within a stored procedure or a query itself [1]

executed via sp_executesql stored procedure

{risk} SQL injection attacks

use QUOTENAME to sanitize inputs [1]

write access to a warehouse or SQL analytics endpoint

{approach} granted through the Fabric workspace roles

the role automatically translates to a corresponding role in SQL that grants equivalent write access [4]
{recommendation} if a user needs write access to all warehouses and endpoints, assign the user to a workspace role [4]

use the Contributor role unless the user needs to assign other users to workspace roles [4]

{recommendation} grant direct access through SQL permissions if the user only needs to write to specific warehouses or endpoints [4]

{approach} grant read access to the SQL engine, and grant custom SQL permissions to write to some or all the data [4]

write access to a warehouse or SQL analytics endpoint

{approach} grant read access through the ReadData permission, granted as part of the Fabric workspace roles [4]

ReadData permission maps the user to a SQL role that gives SELECT permissions on all tables in the warehouse or lakehouse

helpful if the user needs to see all or most of the data in the lakehouse or warehouse [4]
any SQL DENY permissions set on a particular lakehouse or warehouse still apply and limit access to tables [4]
row and column level security can be set on tables to restrict access at a granular level [4]

{approach} grant read access to the SQL engine, and grant custom SQL permissions to read to some or all the data [4]

if the user needs access only to a specific lakehouse or warehouse, the share feature provides access to only the shared item [4]

during the share, users can choose to give only Read permission or Read + ReadData

granting Read permission allows the user to connect to the warehouse or SQL analytics endpoint but gives no table access [4]
granting users the ReadData permissions gives them full read access to all tables in the warehouse or SQL analytics endpoint

⇐ additional SQL security can be configured to grant or deny access to specific tables [4]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) Secure a Microsoft Fabric data warehouse [link]

[2] Data Mozart (2025) Lock Up! Understanding Data Access Options in Microsoft Fabric, by Nikola Ilic [link]
[3] Microsoft Learn (2024) Security in Microsoft Fabric [link]
[4] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R2] Microsoft Learn (2025) Fabric: Security for data warehousing in Microsoft Fabric [link]

[R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

Acronyms:

CLS - Column-Level Security

DDM - Dynamic Data Masking
DML - Data Manipulation Language
MF - Microsoft Fabric

OLS - Object-Level Security

RLS - Row-Level Security
SQL - Structured Query Language

17 March 2025

🏭🗒️Microsoft Fabric: Z-Order [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 17-Mar-2024

[Microsoft Fabric] Z-Order

{def} technique to collocate related information in the same set of files [2]

⇐ reorganizes the layout of each data file so that similar column values are strategically collocated near one another for maximum efficiency [1]
{benefit} efficient query performance

reduces the amount of data to read [2] for certain queries

when the data is appropriately ordered, more files can be skipped [3]
particularly important for the ordering of multiple columns [3]

{benefit} data skipping

automatically skips irrelevant data, further enhancing query speeds

via data-skipping algorithms [2]

{benefit} flexibility

can be applied to multiple columns, making it versatile for various data schemas

aims to produce evenly-balanced data files with respect to the number of tuples

⇐ but not necessarily data size on disk [2]

⇐ the two measures are most often correlated [2]

⇐ but there can be situations when that is not the case, leading to skew in optimize task times [2]

via ZORDER BY clause

applicable to columns with high cardinality commonly used in query predicates [2]
multiple columns can be specified as a comma-separated list

{warning} the effectiveness of the locality drops with each extra column [2]

has tradeoffs

it’s important to analyze query patterns and select the right columns when Z Ordering data [3]

{warning} using columns that do not have statistics collected on them is ineffective and wastes resources [2]

statistics collection can be configured on certain columns by reordering columns in the schema, or by increasing the number of columns to collect statistics on [2]

{characteristic} not idempotent

every time is executed, it will try to create a new clustering of data in all files in a partition [2]

it includes new and existing files that were part of previous z-ordering [2]

compatible with v-order

{concept} [Databricks] liquid clustering

replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance [4] [6]

not compatible with the respective features [4] [6]

tables created with liquid clustering enabled have numerous Delta table features enabled at creation [4] [6]
provides flexibility to redefine clustering keys without rewriting existing data [4] [6]

⇒ allows data layout to evolve alongside analytic needs over time [4] [6]

applies to

streaming tables
materialized views

{scenario} tables often filtered by high cardinality columns [4] [6]
{scenario} tables with significant skew in data distribution [4] [6]
{scenario} tables that grow quickly and require maintenance and tuning effort [4] [6]
{scenario} tables with concurrent write requirements [4] [6]
{scenario} tables with access patterns that change over time [4] [6]
{scenario} tables where a typical partition key could leave the table with too many or too few partitions [4] [6]

Previous Post <<||>> Next Post

References:

[1] Bennie Haelen & Dan Davis (2024) Delta Lake Up & Running: Modern Data Lakehouse Architectures with Delta Lake

[2] Delta Lake (2023) Optimizations [link]

[3] Delta Lake (2023) Delta Lake Z Order, by Matthew Powers [link]
[4] Delta Lake (2025) Use liquid clustering for Delta tables [link]
[5] Databricks (2025) Delta Lake table format interoperability [link]
[6] Microsoft Learn (2025) Use liquid clustering for Delta tables [link]

Resources:
[R1] Azure Guru (2024) Z Order in Delta Lake - Part 1 [link]

[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
MF - Microsoft Fabric

🏭🗒️Microsoft Fabric: V-Order [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 17-Mar-2024

[Microsoft Fabric] V-Order

{def} write time optimization to the parquet file format that enables fast reads under the MF compute engine [2]

all parquet engines can read the files as regular parquet files [2]
results in a smaller and therefore faster files to read [5]

{benefit} improves read performance
{benefit} decreases storage requirements
{benefit} optimizes resources' usage

reduces the compute resources required for reading data

e.g. network bandwidth, disk I/O, CPU usage

still conforms to the open-source Parquet file format [5]

they can be read by non-Fabric tools [5]

delta tables created and loaded by Fabric items automatically apply V-Order

e.g. data pipelines, dataflows, notebooks [5]

delta tables and its features are orthogonal to V-Order [2]

e.g. Z-Order, compaction, vacuum, time travel
table properties and optimization commands can be used to control the v-order of the partitions [2]

compatible with Z-Order [2]
not all files have this optimization applied [5]

e.g. Parquet files uploaded to a Fabric lakehouse, or that are referenced by a shortcut
the files can still be read, the read performance likely won't be as fast as an equivalent Parquet file that's had V-Order applied [5]

required by certain features

[hash encoding] to assign a numeric identifier to each unique value contained in the column [5]

{command} OPTIMIZE

optimizes a Delta table to coalesce smaller files into larger ones [5]
can apply V-Order to compact and rewrite the Parquet files [5]

[warehouse]

works by applying certain operations on Parquet files

special sorting
row group distribution
dictionary encoding
compression

enabled by default
⇒ compute engines require less network, disk, and CPU resources to read data from storage [1]

provides cost efficiency and performance [1]

the effect of V-Order on performance can vary depending on tables' schemas, data volumes, query, and ingestion patterns [1]

fully-compliant to the open-source parquet format [1]

⇐ all parquet engines can read it as regular parquet files [1]

required by certain features

[Direct Lake mode] depends on V-Order

{operation} disable V-Order

causes any new Parquet files produced by the warehouse engine to be created without V-Order optimization [3]
irreversible operation

once disabled, it cannot be enabled again [3]

{scenario} write-intensive warehouses

warehouses dedicated to staging data as part of a data ingestion process [1]

{warning} consider the effect of V-Order on performance before deciding to disable it [1]

{recommendation} test how V-Order affects the performance of data ingestion and queries before deciding to disable it [1]

via ALTER DATABASE CURRENT SET VORDER = OFF; [3]

{operation} check current status

via SELECT name, is_vorder_enabled FROM sys.databases; [post]

{feature} [lakehouse] Load to Table

allows to load a single file or a folder of files to a table [6]
tables are always loaded using the Delta Lake table format with V-Order optimization enabled [6]

[Direct Lake semantic model]

data is prepared for fast loading into memory [5]

makes less demands on capacity resources [5]
results in faster query performance [5]

because less memory needs to be scanned [5]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
[2] Microsoft Learn (2024) Delta Lake table optimization and V-Order [link]
[3] Microsoft Learn (2024) Disable V-Order on Warehouse in Microsoft Fabric [link]
[4] Miles Cole (2024) To V-Order or Not: Making the Case for Selective Use of V-Order in Fabric Spark [link]
[5] Microsoft Learn (2024) Understand storage for Direct Lake semantic models [link]

[6] Microsoft Learn (2025] Fabric: Load to Delta Lake table [link]

Resources:
[R1] Serverless.SQL (2024) Performance Analysis of V-Ordering in Fabric Warehouse: On or Off?, by Andy Cutler [link]
[R2] Redgate (2023 Microsoft Fabric: Checking and Fixing Tables V-Order Optimization, by Dennes Torres [link]
[R3] Sandeep Pawar (2023) Checking If Delta Table in Fabric is V-order Optimized [link]

[R4] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
MF - Microsoft Fabric

🏭🗒️Microsoft Fabric: Caching in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 17-Mar-2024

[Microsoft Fabric] Caching

{def} technique that improves the performance of data processing by storing frequently accessed data and metadata in a faster storage layer [1]

e.g. local memory, local SSD disk
⇐ subsequent requests can be served faster, directly from the cache [1]

if a set of data has been previously accessed by a query, any subsequent queries will retrieve that data directly from the in-memory cache [1]
local memory operations are notably faster compared to fetching data from remote storage [1]
⇐ significantly diminishes IO latency [1]

fully transparent to the user
consistently active and operates seamlessly in the background [1]
orchestrated and upheld by MF

it doesn't offer users the capability to manually clear the cache [1]

provides transactional consistency

ensures that any modifications to the data in storage after it has been initially loaded into the in-memory cache, will result in consistent data [1]

when the cache reaches its capacity threshold and fresh data is being read for the first time, objects that have remained unused for the longest duration will be removed from the cache [1]

process is enacted to create space for the influx of new data and maintain an optimal cache utilization strategy [1]

{type} in-memory cache

data in cache is organized in a compressed columnar format (aka columnar storage) [1]

⇐ optimized for analytical queries
each column of data is stored separately [1]

{benefit} allows for better compression [1]

since similar data values are stored together [1]

{benefit} reduces the memory footprint

when queries need to perform operations on a specific column, the engine can work more efficiently speeding up queries' execution [1]

it doesn't have to process unnecessary data from other columns
can perform operations on multiple columns simultaneously [1]

taking advantage of modern multi-core processors [1]

when retrieves data from storage, the data is transformed from its original file-based format into highly optimized structures in in-memory cache [1]
{scenario} analytical workloads where queries involve scanning large amounts of data to perform aggregations, filtering, and other data manipulations [1]

{type} disk cache

complementary extension to the in-memory cache
any data loaded into the in-memory cache is also serialized to the SSD cache [1]

data removed from the in-memory cache remains within the SSD cache for an extended period
when subsequent query requests the data, it is retrieved from the SSD cache into the in-memory cache quicker [1]

{issue} cold run (aka cold cache) performance

the first 1-3 executions of a query perform noticeably slower than subsequent executions [2]

if the first run's performance is crucial, try manually creating statistics (aka pre-warming the cache) [2]
otherwise, one can rely on the automatic statistics generated in the first query run and leveraged in subsequent runs [2]
as long as underlying data does not change significantly [2]

differentiated from

[Kusto] caching policy [link]
[Apache Spark] intelligent cache [link]
[Power BI] query caching [link]
[Azure] caching [link]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]

[2] Microsoft Learn (2024) Fabric Data Warehouse performance guidelines [link]

[3] Sandeep Pawar (2023) Pre-Warming The Direct Lake Dataset For Warm Cache Import-Like Performance [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
IO - Input/Output
MF - Microsoft Fabric
SSD - Solid State Drive

SQL Troubles

Pages

21 June 2025

🏭🗒️Microsoft Fabric: Result Set Caching in SQL Analytics Endpoints [Notes] 🆕

23 May 2025

🏭🗒️Microsoft Fabric: Warehouse Snapshots [Notes] 🆕

26 April 2025

🏭🗒️Microsoft Fabric: Deployment Pipelines [Notes]

06 April 2025

🏭🗒️Microsoft Fabric: Query Optimizer in Warehouse [Notes] 🆕

26 March 2025

💠🏭🗒️Microsoft Fabric: Polaris SQL Pool [Notes]

25 March 2025

🏭🗒️Microsoft Fabric: Security in Warehouse [Notes]

17 March 2025

🏭🗒️Microsoft Fabric: Z-Order [Notes]

🏭🗒️Microsoft Fabric: V-Order [Notes]

🏭🗒️Microsoft Fabric: Caching in Warehouse [Notes]

About Me