Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!
Last updated: 9-Feb-2024
[Microsoft Fabric] Data Partitioning (aka Sharding)
- {definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
- allows a logical database to be partitioned across multiple physical servers [1]
- each partition is referred to as a shard
- the largest tables are partitioned across multiple database servers [1]
- when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]
- partitioning is based on a Key Value
- e.g. such as a user ID
- proven technique for achieving data processing on a massive scale [1]
- solution used at the largest websites
- e.g. Facebook, Twitter
- usually associated with rapid growth
- ⇒ the approach needs to be dynamic [1]
- the only way to scale a relational database to massive web use [1]
- together with caching and replication [1]
- {drawback} involves significant operational complexities and compromises [1]
- the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
- requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.
- {drawback} high operational costs [1]
- {drawback} application complexity
- it’s up to the application code to route SQL requests to the correct shard [1]
- ⇒ a dynamic routing layer must be implemented
- ⇐ most massive websites are adding shards as they grow [1]
- layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]
- {drawback} crippled SQL
- [sharded database] it is not possible to issue a SQL statement that operates across shards [1]
- ⇒ usually SQL statements are limited to row-level access [1]
- ⇒ only programmers can query the database as a whole [1]
- joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]
- {drawback} loss of transactional integrity
- ACID transactions against multiple shards are not possible and/or not practical [1]
- ⇐ {exception} there are database systems that support 2PC
- involves considerable troubleshooting as conflicts and bottlenecks can occur [1]
- {drawback} operational complexity.
- load balancing across shards becomes extremely problematic
- adding new shards requires a complex rebalancing of data [1]
- changing the database schema requires a rolling operation across all the shards [1]
- ⇒ can lead to transitory inconsistencies in the schema [1]
- a sharded database entails a huge amount of operational effort and administrator skill [1]
- {concept} CAP (Consistency, Availability, and Partition) theorem
- in a distributed database system, one can have at most only two of CAP tolerance [1]
- consistency
- every user of the database has an identical view of the data at any given instant [1]
- availability
- in the event of a failure, the database remains operational [1]
- partition tolerance
- the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]
- {concept} partitioning
- {def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]
- ⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.
- a partition is a set of replicas)
- {type} [stateless services] a logical unit that contains one or more instances of a service [5]
- partitioning a stateless service is a very rare scenario
- scalability and availability are normally achieved by adding more instances
- {subtype} externally persisted state
- persists its state externally [5]
- e.g. databases in Azure SQL Database
- {subtype} computation-only services
- service that do not manage any persistent state e.g. calculator or image thumbnailing [5]
- {type} scalable stateful services
- partition state (data)
- a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
- the state must be accessed and stored
- ⇒ bound by
- network bandwidth limits
- system memory limits
- disk storage limits
- {scenario} run into resource constraints in a running cluster
- {recommendation} scale out the cluster to accommodate the new requirements [4]
- {concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]
- ⇐ addresses the significant challenges in developing and managing cloud applications
- places the partitions on different nodes [5]
- allows partitions to grow to a node's resource limit
- ⇐ partitions are rebalances across nodes [5]
- {benefit} ensures the continued efficient use of hardware resources [5]
- {default} makes sure that there is about the same number of primary and secondary replicas on each node
- ⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
- hot and cold spots may appear in a cluster
- ⇐ it should be preferably avoided
- {recommendation} partition the state so is evenly distributed across all partitions [5]
- {recommendation} report load from each of the replicas for the service [5]
- provides the capability to report load consumed by services [5]
- e.g. amount of memory, number of records
- detects which partitions server higher loads than others [5]
- ⇐ based on the metrics reported
- rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded [5]
- ⇐ it's not always possible to know how much data will be in a given partition
- {recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]
- {benefit} prevents situations described in the voting example [5]
- {recommendation} report load
- {benefit} helps smooth out temporary differences in access or load over time [5]
- {recommendation} choose an optimal number of partitions to begin with
- ⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]
- ⇐ assuming the maximum number of partitions is a valid approach [5]
- ⇒ one may end up needing more partitions than initially considered [5]
- ⇐ {constraint} the partition count can't be changed after the fact [5]
- ⇒ apply more advanced partition approaches
- e.g. creating a new service instance of the same service type
- e.g. implement client-side logic that routes the requests to the correct service instance
[4] Microsoft Fabric (2024) Data sharding policy [link]
[5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
[6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]