Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 31-Mar-2024
Polaris
- {definition} cloud-native analytical query engine over the data lake that follows a stateless micro-service architecture and is designed to execute queries in a scalable, dynamic and fault-tolerant way [1], [2]
- the engine behind the serverless SQL pool [1] and Microsoft Fabric [2]
- petabyte-scale execution [1]
- highly-available micro-service architecture
- data and query processing is packaged into units (aka tasks) [1]
- can be readily moved across compute nodes and re-started at the task level [1]
- can run directly over data in HDFS and in managed transactional stores [1]
- [Azure Synapse] designed initially to execute read-only queries [1]
- ⇐ the architecture behind serverless SQL pool
- uses a completely new scale-out framework based on a distributed SQL Server query engine [1]
- fully compatible with T-SQL
- leverages SQL Server single-node runtime and QO [1]
- [Microsoft Fabric] extended with a complete transaction manager that executes general CRUD transactions [2]
- incl. updates, deletes and bulk loads [2]
- based on [delta tables] and [delta lake]
- the delta lake supports currently only transactions within one table [4]
- ⇐ the architecture behind lakehouses
- {goal} converge DWH and big data workloads [1]
- the query engine scales-out for relational data and heterogeneous datasets stored in DFSs[1]
- needs a clean abstraction over the underlying data type and format, capturing just what’s needed for efficiently parallelizing data processing
- {goal} separate compute and state for cloud-native execution [1]
- all services within a pool are stateless
- data is stored durably in remote storage and is abstracted via data cells [1]
- ⇐ data is naturally decoupled from compute nodes
- the metadata and transactional log state is off-loaded to centralized services [[1]
- multiple compute pools can transactionally access the same logical database [1]
- {goal} cloud-first [2]
- {benefit} leverages elasticity
- transactions need to be resilient to node failures on dynamically changing topologies [2]
- ⇒ the storage engine disaggregates the source of truth for execution state (including data, metadata and transactional state) from compute nodes [2]
- must ensure disaggregation of metadata and transactional state from compute nodes [2]
- ⇐ to ensure that the life span of a transaction is resilient to changes in the backend compute topology [2]
- ⇐ can change dynamically to take advantage of the elastic nature of the cloud or to handle node failures [2]
- {goal} use optimized native columnar, immutable and open storage format [2]
- uses delta format
- ⇐ optimized to handle read-heavy workloads with low contention [2]
- {goal} leverage the full potential of vectorized query processing for SQL [2]
- {goal} support zero-copy data sharing with other services in the lake [2]
- {goal} support read-heavy workloads with low contention [2]
- {goal} support lineage-based features [2]
- by taking advantage of delta table capabilities
- {goal} provide full SQL SI transactional support [2]
- {benefit} all traditional DWH requirements are met [2]
- incl. multi-table and multi-statement transactions [2]
- ⇐ Polaris is the only system that supports this [2]
- the design is optimized for analytics, specifically read- and insert-intensive workloads [2]
- mixes of transactions are supported as well
- {objective} no cross-component state sharing [2]
- {principle} encapsulation of state within each component to avoid sharing state across nodes [2]
- SI and the isolation of state across components allows to execute transactions as if they were queries [2]
- ⇒ makes read and write transactions indistinguishable [2]
- ⇒ allows to fully leverage its optimized distributed execution framework [2]
- {objective} support snapshot Isolation (SI) semantics [2]
- implemented over versioned data
- allows reads (R) and writes (W) to proceed concurrently over their own data snapshot
- R/W never conflict, and W/W of active transactions only conflict if they modify the same data [2]
- ⇐ all W transactions are serializable, leading to a serial schedule in increasing order of log record IDs [4]
- follows from the commit protocol for write transactions, where only one transaction can write the record with each record ID [4]
- ⇐ R transactions at the snapshot isolation level create no contention
- ⇒ any number of R transactions can run concurrently [4]
- the immutable data representation in LSTs allows dealing with failures by simply discarding data and metadata files that represent uncommitted changes [2]
- similar to how temporary tables are discarded during query processing failures [2]
- {feature} resize live workloads [1]
- scales resources with the workloads automatically
- {feature} deliver predictable performance at scale [1]
- scales computational resources based on workloads' needs
- {feature} efficiently handle both relational and unstructured data [1]
- {feature} flexible, fine-grained task monitoring
- a task is the finest grain of execution
- {feature} global resource-aware scheduling
- enables much better resource utilization and concurrency than traditional DWHs
- capable of handling partial query restarts
- maintains a global view of multiple queries
- it is planned to build on this a global view with autonomous workload management features
- {feature} multi-layered data caching model
- leverages
- SQL Server buffer pools for cashing columnar data
- SSD caching
- the delta table and its log are are immutable, they can be safely cached on cluster nodes [4]
- {feature} tracks data lineage natively
- the transaction log can also be used to audit logging based on the commit Info records [4]
- {feature} versioning
- maintain all versions as data is updated [1]
- {feature} time-travel
- {benefit} allows users query point-in-time snapshots
- {benefit)} allows to roll back erroneous updates to the data.
- {feature} table cloning
- {benefit} allows to create a point-in-time snapshot of the data based on its metadata
- {concept} state
- allows to drive the end-to-end life cycle of a SQL statement with transactional guarantees and top tier performance [1]
- comprised of
- cache
- metadata
- transaction logs
- data
- [on-premises architecture] all state is in the compute layer
- relies on small, highly stable and homogenous clusters with dedicated hardware for Tier-1 performance
- {downside} expensive
- {downside} hard to maintain
- {downside} limited scalability
- cluster capacity is bounded by machine sizes because of the fixed topology
- {concept}[stateful architecture]
- the state of inflight transactions is stored in the compute node and is not hardened into persistent storage until the transaction commits [1]
- ⇒ when a compute node fails, the state of non-committed transactions is lost [1]
- ⇒ the in-flight transactions fail as well [1]
- often also couples metadata describing data distributions and mappings to compute nodes [1]
- ⇒ a compute node effectively owns responsibility for processing a subset of the data [1]
- its ownership cannot be transferred without a cluster restart [1]
- {downside} resilience to compute node failure and elastic assignment of data to compute are not possible [1]
- {concept} [stateless compute architecture]
- requires that compute nodes hold no state information [1]
- ⇒ all data, transactional logs and metadata need to be externalized [1]
- {benefit} allows applications to
- partially restart the execution of queries in the event of compute node failures [1]
- adapt to online changes of the cluster topology without failing in-flight transactions [1]
- caches need to be as close to the compute as possible [1]
- since they can be lazily reconstructed from persisted data they don’t necessarily need to be decoupled from compute [1]
- the coupling of caches and compute does not make the architecture stateful [1]
- {concept} [cloud] decoupling of compute and storage
- provides more flexible resource scaling
- the 2 layers can scale up and down independently adapting to user needs [1]
- customers pay for the compute needed to query a working subset of the data [1]
- is not the same as decoupling compute and state [1]
- if any of the remaining state held in compute cannot be reconstructed from external services, then compute remains stateful [1]
References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) (link)
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[3] Advancing Analytics (2021) Azure Synapse Analytics - Polaris Whitepaper Deep-Dive (link)
[4] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment 13(12) (link)
No comments:
Post a Comment