Disclaimer: This is work in progress intended to consolidate information
from various sources and may deviate from them. Please consult the sources
for the exact content!
Unfortunately, besides the references papers, there's almost no material
that could be used to enhance the understanding of the concepts
presented.
Last updated: 26-Mar-2025
|
Read and Write Operations in Polaris [2]
|
[Microsoft Fabric] Polaris SQL Pool
-
{def} distributed SQL query engine that powers Microsoft Fabric's data
warehousing capabilities
-
designed to unify data warehousing and big data workloads while separating
compute and state for seamless cloud-native operations
- based on a robust DCP
-
designed to execute read-only queries in a scalable, dynamic and
fault-tolerant way [1]
-
a highly-available micro-service architecture with well-defined
responsibilities [2]
-
data and query processing is packaged into units (aka tasks)
-
can be readily moved across compute nodes and re-started at the task
level
- widely-partitioned data with a flexible distribution model [2]
-
a task-level "workflow-DAG" that is novel in spanning multiple queries
[2]
-
a framework for fine-grained monitoring and flexible scheduling of
tasks [2]
- {component} SQL Server Front End (SQL-FE)
- responsible for
- compilation
- authorization
- authentication
- metadata
- used by the compiler to
-
{operation} generate the search space (aka MEMO) for incoming
queries
- {operation} bind metadata to data cells
-
leveraged to ensure the durability of the transaction manifests at
commit [2]
-
only transactions that successfully commit need to be actively
tracked to ensure consistency [2]
-
any manifests and data associated with aborted transactions are
systematically garbage-collected from OneLake through specialized
system tasks [2]
- {component} SQL Server Backend (SQL-BE)
- used to perform write operations on the LST [2]
-
inserting data into a LST creates a set of Parquet files that are then
recorded in the transaction manifest [2]
-
a transaction is represented by a single manifest file that is modified
concurrently by (one or more) SQL BEs [2]
-
SQL BE leverages the Block Blob API provided by ADLS to coordinate the
concurrent writes [2]
-
each SQL BE instance serializes the information about the actions it
performed, either adding a Parquet file or removing it [2]
-
the serialized information is then uploaded as a block to the
manifest file
-
uploading the block does not yet make any visible changes to the
file [2]
-
each block is identified by a unique ID generated on the writing
SQL BE [2]
-
after completion, each SQL BE returns the ID of the block(s) it wrote
to the Polaris DCP [2]
-
the block IDs are then aggregated by the Polaris DCP and returned to
the SQL FE as the result of the query [2]
-
the SQL FE further aggregates the block IDs and issues a Commit Block
operation against storage with the aggregated block IDs [2]
-
at this point, the changes to the file on storage will become
effective [2]
-
changes to the manifest file are not visible until the Commit operation
on the SQL FE
-
the Polaris DCP can freely restart any part of the operation in case
there is a failure in the node topology [2]
-
the IDs of any blocks written by previous attempts are not included in
the final list of block IDs and are discarded by storage [2]
-
[read operations] SQL BE is responsible for reconstructing the table
snapshot based on the set of manifest files managed in the SQL FE
-
the result is the set of Parquet data files and deletion vectors that
represent the snapshot of the table [2]
-
queries over these are processed by the SQL Server query execution
engine [2]
-
the reconstructed state is cached in memory and organized in such a
way that the table state can be efficiently reconstructed as of any
point in time [2]
-
enables the cache to be used by different operations operating on
different snapshots of the table [2]
-
enables the cache to be incrementally updated as new transactions
commit [2]
- {feature} supports explicit user transactions
-
can execute multiple statements within the same transaction in a
consistent way
-
the manifest file associated with the current transaction captures all
the (reconciled) changes performed by the transaction [2]
-
changes performed by prior statements in the current transaction need
to be visible to any subsequent statement inside the transaction (but
not outside of the transaction) [2]
-
[multi-statement transactions] in addition to the committed set of
manifest files, the SQL BE reads the manifest file of the current
transaction and then overlays these changes on the committed manifests [1]
-
{write operations} the behavior of the SQL BE depends on the type of the
operation.
- insert operations
-
only add new data and have no dependency on previous changes [2]
-
the SQL BE can serialize the metadata blocks holding information about
the newly created data files just like before [2]
-
the SQL FE, instead of committing only the IDs of the blocks written
by the current operation, will instead append them to the list of
previously committed blocks
- ⇐ effectively appends the data to the manifest file [2]
- {update|delete operations}
- handled differently
-
⇐ since they can potentially further modify data already modified by a
prior statement in the same transaction [2]
-
e.g. an update operation can be followed by another update operation
touching the same rows
-
the final transaction manifest should not contain any information
about the parts from the first update that were made obsolete by the
second update [2]
-
SQL BE leverages the partition assignment from the Polaris DCP to
perform a distributed rewrite of the transaction manifest to reconcile
the actions of the current operation with the actions recorded by the
previous operation [2]
-
the resulting block IDs are sent again to the SQL FE where the
manifest file is committed using the (rewritten) block IDs [2]
- {concept} Distributed Query Processor (DQP)
- responsible for
- distributed query optimization
- distributed query execution
- query execution topology management
- {concept} Workload Management (WLM)
-
consists of a set of compute servers that are, simply, an
abstraction of a host provided by the compute fabric, each with a
dedicated set of resources (disk, CPU and memory) [2]
- each compute server runs two micro-services
- {service} Execution Service (ES)
-
responsible for tracking the life span of tasks assigned to a
compute container by the DQP [2]
- {service} SQL Server instance
-
used as the back-bone for execution of the template query for a
given task [2]
- ⇐ holds a cache on top of local SSDs
- in addition to in-memory caching of hot data
- data can be transferred from one compute server to another
- via dedicated data channels
-
the data channel is also used by the compute servers to send
results to the SQL FE that returns the results to the user [2]
-
the life cycle of a query is tracked via control flow channels
from the SQL FE to the DQP, and the DQP to the ES [2]
-
{concept} cell data abstraction
- the key building block that enables to abstract data stores
- abstracts DQP from the underlying store [1]
- any dataset can be mapped to a collection of cells [1]
-
allows distributing query processing over data in diverse
formats [1]
-
tailored for vectorized processing when the data is stored in
columnar formats [1]
- further improves relational query performance
- 2-dimenstional
- distributions (data alignment)
- partitions (data pruning)
- each cell is self-contained with its own statistics [1]
- used for both global and local QO [1]
- cells can be grouped physically in storage [1]
-
queries can selectively reference either cell dimension or even
individual cells depending on predicates and type of operations present
in the query [1]
-
{concept} distributed query processing (DQP) framework
- operates at the cell level
- agnostic to the details of the data within a cell
-
data extraction from a cell is the responsibility of the (single node)
query execution engine, which is primarily SQL Server, and is
extensible for new data types [1], [2]
-
{concept} dataset
- logically abstracted as a collection of cells [1]
-
can be arbitrarily assigned to compute nodes to achieve parallelism [1]
- uniformly distributed across a large number of cells
-
[scale-out processing] each dataset must be distributed across thousands
of buckets or subsets of data objects,
- such that they can be processed in parallel across nodes
-
{concept} session
-
supports a spectrum of consumption models, ranging from serverless ad-hoc
queries to long-standing pools or clusters [1]
- all data are accessible from any session [1]
-
multiple sessions can access all underlying data concurrently [1]
- {concept} Physical Metadata layer
- new layer introduced in the SQL Server storage engine [2]
References:
[1] Josep Aguilar-Saborit et al (2020)
POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the
VLDB Endowment PVLDB 13(12) [link]
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support
Transactions [link]
[3] Gjnana P Duvvuri (2024) Microsoft Fabric Warehouse Deep Dive
into Polaris Analytic Engine [link]
Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [
link]
[R2] Patrick Pichler (2023) Data Warehouse (Polaris) vs. Data
Lakehouse (Spark) in Microsoft Fabric [
link]
[R3] Tiago Balabuch (2023) Microsoft Fabric Data Warehouse - The
Polaris engine [
link]
Acronyms:
CPU - Central Processing Unit
DAG - Directed Acyclic Graph
DB -
Database
DCP - Distributed Computation Platform
DQP -
Distributed Query Processing
DWH - Data Warehouses
ES - Execution Service
LST -
Log-Structured Table
SQL BE - SQL Backend
SQL FE - SQL Frontend
SSD - Solid State Disk
WAL - Write-Ahead Log
WLM - Workload Management