SQL Troubles: isolation

Showing posts with label isolation. Show all posts

31 March 2024

🏭🗒️Microsoft Fabric: Polaris [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Last updated: 31-Mar-2024

[Microsoft Fabric] Polaris

{def} cloud-native analytical query engine over the data lake that follows a stateless micro-service architecture and is designed to execute queries in a scalable, dynamic and fault-tolerant way [1], [2]

the engine behind the serverless SQL pool [1] and Microsoft Fabric [2]
petabyte-scale execution [1]
highly-available micro-service architecture

data and query processing is packaged into units (aka tasks) [1]

can be readily moved across compute nodes and re-started at the task level [1]

can run directly over data in HDFS and in managed transactional stores [1]

[Azure Synapse] designed initially to execute read-only queries [1]

⇐ the architecture behind serverless SQL pool
uses a completely new scale-out framework based on a distributed SQL Server query engine [1]

fully compatible with T-SQL
leverages SQL Server single-node runtime and QO [1]

[Microsoft Fabric] extended with a complete transaction manager that executes general CRUD transactions [2]

incl. updates, deletes and bulk loads [2]
based on [delta tables] and [delta lake]

the delta lake supports currently only transactions within one table [4]

⇐ the architecture behind lakehouses

{goal} converge DWH and big data workloads [1]

the query engine scales-out for relational data and heterogeneous datasets stored in DFSs[1]

needs a clean abstraction over the underlying data type and format, capturing just what’s needed for efficiently parallelizing data processing

{goal} separate compute and state for cloud-native execution [1]

all services within a pool are stateless

data is stored durably in remote storage and is abstracted via data cells [1]

⇐ data is naturally decoupled from compute nodes

the metadata and transactional log state is off-loaded to centralized services [[1]
multiple compute pools can transactionally access the same logical database [1]

{goal} cloud-first [2]

{benefit} leverages elasticity
transactions need to be resilient to node failures on dynamically changing topologies [2]

⇒ the storage engine disaggregates the source of truth for execution state (including data, metadata and transactional state) from compute nodes [2]

must ensure disaggregation of metadata and transactional state from compute nodes [2]

⇐ to ensure that the life span of a transaction is resilient to changes in the backend compute topology [2]

⇐ can change dynamically to take advantage of the elastic nature of the cloud or to handle node failures [2]

{goal} use optimized native columnar, immutable and open storage format [2]

uses delta format

⇐ optimized to handle read-heavy workloads with low contention [2]

{goal} leverage the full potential of vectorized query processing for SQL [2]
{goal} support zero-copy data sharing with other services in the lake [2]
{goal} support read-heavy workloads with low contention [2]
{goal} support lineage-based features [2]

by taking advantage of delta table capabilities

{goal} provide full SQL SI transactional support [2]

{benefit} all traditional DWH requirements are met [2]

incl. multi-table and multi-statement transactions [2]

⇐ Polaris is the only system that supports this [2]
the design is optimized for analytics, specifically read- and insert-intensive workloads [2]
mixes of transactions are supported as well

{objective} no cross-component state sharing [2]

{principle} encapsulation of state within each component to avoid sharing state across nodes [2]
SI and the isolation of state across components allows to execute transactions as if they were queries [2]

⇒ makes read and write transactions indistinguishable [2]

⇒ allows to fully leverage its optimized distributed execution framework [2]

{objective} support snapshot Isolation (SI) semantics [2]

implemented over versioned data
allows reads (R) and writes (W) to proceed concurrently over their own data snapshot

R/W never conflict, and W/W of active transactions only conflict if they modify the same data [2]

⇐ all W transactions are serializable, leading to a serial schedule in increasing order of log record IDs [4]

follows from the commit protocol for write transactions, where only one transaction can write the record with each record ID [4]

⇐ R transactions at the snapshot isolation level create no contention

⇒ any number of R transactions can run concurrently [4]

the immutable data representation in LSTs allows dealing with failures by simply discarding data and metadata files that represent uncommitted changes [2]

similar to how temporary tables are discarded during query processing failures [2]

{feature} resize live workloads [1]

scales resources with the workloads automatically

{feature} deliver predictable performance at scale [1]

scales computational resources based on workloads' needs

{feature} efficiently handle both relational and unstructured data [1]
{feature} flexible, fine-grained task monitoring

a task is the finest grain of execution

{feature} global resource-aware scheduling

enables much better resource utilization and concurrency than traditional DWHs

capable of handling partial query restarts
maintains a global view of multiple queries

it is planned to build on this a global view with autonomous workload management features

{feature} multi-layered data caching model

leverages

SQL Server buffer pools for cashing columnar data
SSD caching

the delta table and its log are are immutable, they can be safely cached on cluster nodes [4]

{feature} tracks data lineage natively

the transaction log can also be used to audit logging based on the commit Info records [4]

{feature} versioning

maintain all versions as data is updated [1]

{feature} time-travel

{benefit} allows users query point-in-time snapshots
{benefit)} allows to roll back erroneous updates to the data.

{feature} table cloning

{benefit} allows to create a point-in-time snapshot of the data based on its metadata

{concept} state

allows to drive the end-to-end life cycle of a SQL statement with transactional guarantees and top tier performance [1]
comprised of

cache
metadata
transaction logs
data

[on-premises architecture] all state is in the compute layer

relies on small, highly stable and homogenous clusters with dedicated hardware for Tier-1 performance
{downside} expensive
{downside} hard to maintain
{downside} limited scalability

cluster capacity is bounded by machine sizes because of the fixed topology

{concept}[stateful architecture]

the state of inflight transactions is stored in the compute node and is not hardened into persistent storage until the transaction commits [1]

⇒ when a compute node fails, the state of non-committed transactions is lost [1]

⇒ the in-flight transactions fail as well [1]

often also couples metadata describing data distributions and mappings to compute nodes [1]

⇒ a compute node effectively owns responsibility for processing a subset of the data [1]

its ownership cannot be transferred without a cluster restart [1]

{downside} resilience to compute node failure and elastic assignment of data to compute are not possible [1]

{concept} stateless compute architecture

requires that compute nodes hold no state information [1]

⇒ all data, transactional logs and metadata need to be externalized [1]

{benefit} allows applications to

partially restart the execution of queries in the event of compute node failures [1]
adapt to online changes of the cluster topology without failing in-flight transactions [1]

caches need to be as close to the compute as possible [1]

since they can be lazily reconstructed from persisted data they don’t necessarily need to be decoupled from compute [1]

the coupling of caches and compute does not make the architecture stateful [1]

{concept} [cloud] decoupling of compute and storage

provides more flexible resource scaling

the 2 layers can scale up and down independently adapting to user needs [1]
customers pay for the compute needed to query a working subset of the data [1]

is not the same as decoupling compute and state [1]

if any of the remaining state held in compute cannot be reconstructed from external services, then compute remains stateful [1]

Previous Post <<||>> Next Post

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) (link)
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[3] Advancing Analytics (2021) Azure Synapse Analytics - Polaris Whitepaper Deep-Dive (link)
[4] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment 13(12) (link)

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ADLS - Azure Data Lake Storage

CRUD - Create, Read, Update, Delete

DCP - Distributed Computation Platform

DFS - Distributed File System

DWH - data warehouse

HDFS - Hadoop DFS
R/W - read/write

SI - Semantic Isolation

SSD - Solid-State Drive

SQL Troubles

Pages

31 March 2024

🏭🗒️Microsoft Fabric: Polaris [Notes]

About Me