Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Unfortunately, besides the references papers, there's almost no material that could be used to enhance the understanding of the concepts presented.
Last updated: 26-Mar-2025
![]() |
Read and Write Operations in Polaris [2] |
[Microsoft Fabric] Polaris SQL Pool
- {def} distributed SQL query engine that powers Microsoft Fabric's data warehousing capabilities
- designed to unify data warehousing and big data workloads while separating compute and state for seamless cloud-native operations
- based on a robust DCP
- designed to execute read-only queries in a scalable, dynamic and fault-tolerant way [1]
- a highly-available micro-service architecture with well-defined responsibilities [2]
- data and query processing is packaged into units (aka tasks)
- can be readily moved across compute nodes and re-started at the task level
- widely-partitioned data with a flexible distribution model [2]
- a task-level "workflow-DAG" that is novel in spanning multiple queries [2]
- a framework for fine-grained monitoring and flexible scheduling of tasks [2]
- {component} SQL Server Front End (SQL-FE)
- responsible for
- compilation
- authorization
- authentication
- metadata
- used by the compiler to
- {operation} generate the search space (aka MEMO) for incoming queries
- {operation} bind metadata to data cells
- leveraged to ensure the durability of the transaction manifests at commit [2]
- only transactions that successfully commit need to be actively tracked to ensure consistency [2]
- any manifests and data associated with aborted transactions are systematically garbage-collected from OneLake through specialized system tasks [2]
- {component} SQL Server Backend (SQL-BE)
- used to perform write operations on the LST [2]
- inserting data into a LST creates a set of Parquet files that are then recorded in the transaction manifest [2]
- a transaction is represented by a single manifest file that is modified concurrently by (one or more) SQL BEs [2]
- SQL BE leverages the Block Blob API provided by ADLS to coordinate the concurrent writes [2]
- each SQL BE instance serializes the information about the actions it performed, either adding a Parquet file or removing it [2]
- the serialized information is then uploaded as a block to the manifest file
- uploading the block does not yet make any visible changes to the file [2]
- each block is identified by a unique ID generated on the writing SQL BE [2]
- after completion, each SQL BE returns the ID of the block(s) it wrote to the Polaris DCP [2]
- the block IDs are then aggregated by the Polaris DCP and returned to the SQL FE as the result of the query [2]
- the SQL FE further aggregates the block IDs and issues a Commit Block operation against storage with the aggregated block IDs [2]
- at this point, the changes to the file on storage will become effective [2]
- changes to the manifest file are not visible until the Commit operation on the SQL FE
- the Polaris DCP can freely restart any part of the operation in case there is a failure in the node topology [2]
- the IDs of any blocks written by previous attempts are not included in the final list of block IDs and are discarded by storage [2]
- [read operations] SQL BE is responsible for reconstructing the table snapshot based on the set of manifest files managed in the SQL FE
- the result is the set of Parquet data files and deletion vectors that represent the snapshot of the table [2]
- queries over these are processed by the SQL Server query execution engine [2]
- the reconstructed state is cached in memory and organized in such a way that the table state can be efficiently reconstructed as of any point in time [2]
- enables the cache to be used by different operations operating on different snapshots of the table [2]
- enables the cache to be incrementally updated as new transactions commit [2]
- {feature} supports explicit user transactions
- can execute multiple statements within the same transaction in a consistent way
- the manifest file associated with the current transaction captures all the (reconciled) changes performed by the transaction [2]
- changes performed by prior statements in the current transaction need to be visible to any subsequent statement inside the transaction (but not outside of the transaction) [2]
- [multi-statement transactions] in addition to the committed set of manifest files, the SQL BE reads the manifest file of the current transaction and then overlays these changes on the committed manifests [1]
- {write operations} the behavior of the SQL BE depends on the type of the operation.
- insert operations
- only add new data and have no dependency on previous changes [2]
- the SQL BE can serialize the metadata blocks holding information about the newly created data files just like before [2]
- the SQL FE, instead of committing only the IDs of the blocks written by the current operation, will instead append them to the list of previously committed blocks
- ⇐ effectively appends the data to the manifest file [2]
- {update|delete operations}
- handled differently
- ⇐ since they can potentially further modify data already modified by a prior statement in the same transaction [2]
- e.g. an update operation can be followed by another update operation touching the same rows
- the final transaction manifest should not contain any information about the parts from the first update that were made obsolete by the second update [2]
- SQL BE leverages the partition assignment from the Polaris DCP to perform a distributed rewrite of the transaction manifest to reconcile the actions of the current operation with the actions recorded by the previous operation [2]
- the resulting block IDs are sent again to the SQL FE where the manifest file is committed using the (rewritten) block IDs [2]
- {concept} Distributed Query Processor (DQP)
- responsible for
- distributed query optimization
- distributed query execution
- query execution topology management
- {concept} Workload Management (WLM)
- consists of a set of compute servers that are, simply, an abstraction of a host provided by the compute fabric, each with a dedicated set of resources (disk, CPU and memory) [2]
- each compute server runs two micro-services
- {service} Execution Service (ES)
- responsible for tracking the life span of tasks assigned to a compute container by the DQP [2]
- {service} SQL Server instance
- used as the back-bone for execution of the template query for a given task [2]
- ⇐ holds a cache on top of local SSDs
- in addition to in-memory caching of hot data
- data can be transferred from one compute server to another
- via dedicated data channels
- the data channel is also used by the compute servers to send results to the SQL FE that returns the results to the user [2]
- the life cycle of a query is tracked via control flow channels from the SQL FE to the DQP, and the DQP to the ES [2]
- {concept} cell data abstraction
- the key building block that enables to abstract data stores
- abstracts DQP from the underlying store [1]
- any dataset can be mapped to a collection of cells [1]
- allows distributing query processing over data in diverse formats [1]
- tailored for vectorized processing when the data is stored in columnar formats [1]
- further improves relational query performance
- 2-dimenstional
- distributions (data alignment)
- partitions (data pruning)
- each cell is self-contained with its own statistics [1]
- used for both global and local QO [1]
- cells can be grouped physically in storage [1]
- queries can selectively reference either cell dimension or even individual cells depending on predicates and type of operations present in the query [1]
- {concept} distributed query processing (DQP) framework
- operates at the cell level
- agnostic to the details of the data within a cell
- data extraction from a cell is the responsibility of the (single node) query execution engine, which is primarily SQL Server, and is extensible for new data types [1], [2]
- {concept} dataset
- logically abstracted as a collection of cells [1]
- can be arbitrarily assigned to compute nodes to achieve parallelism [1]
- uniformly distributed across a large number of cells
- [scale-out processing] each dataset must be distributed across thousands of buckets or subsets of data objects,
- such that they can be processed in parallel across nodes
- {concept} session
- supports a spectrum of consumption models, ranging from serverless ad-hoc queries to long-standing pools or clusters [1]
- all data are accessible from any session [1]
- multiple sessions can access all underlying data concurrently [1]
- {concept} Physical Metadata layer
- new layer introduced in the SQL Server storage engine [2]
References:
[1] Josep Aguilar-Saborit et al (2020)
POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the
VLDB Endowment PVLDB 13(12) [link]
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support
Transactions [link]
[3] Gjnana P Duvvuri (2024) Microsoft Fabric Warehouse Deep Dive
into Polaris Analytic Engine [link]
[R2] Patrick Pichler (2023) Data Warehouse (Polaris) vs. Data Lakehouse (Spark) in Microsoft Fabric [link]
[R3] Tiago Balabuch (2023) Microsoft Fabric Data Warehouse - The Polaris engine [link]
DAG - Directed Acyclic Graph
DB - Database
DCP - Distributed Computation Platform
DQP - Distributed Query Processing
ES - Execution Service
LST - Log-Structured Table
SSD - Solid State Disk
WAL - Write-Ahead Log