09 February 2025

🏭🗒️Microsoft Fabric: Sharding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2024

 [Microsoft Fabric] Data Partitioning (aka Sharding)

  • {definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
  • allows a logical database to be partitioned across multiple physical servers [1]
    • each partition is referred to as a shard
    • the largest tables are partitioned across multiple database servers [1]
      • when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]
    • partitioning is based on a Key Value
      • e.g. such as a user ID
  • proven technique for achieving data processing on a massive scale [1]
    • solution used at the largest websites 
      • e.g. Facebook, Twitter
      • usually associated with rapid growth
        •  ⇒ the approach needs to be dynamic [1]
    • the only way to scale a relational database to massive web use [1]
      • together with caching and replication [1]
  • {drawback} involves significant operational complexities and compromises [1]
    • the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
    • requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.
  • {drawback} high operational costs [1]
  • {drawback} application complexity
    • it’s up to the application code to route SQL requests to the correct shard [1]
      • ⇒ a dynamic routing layer must be implemented
        • ⇐ most massive websites are adding shards as they grow [1]
        • layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]
  • {drawback} crippled SQL
    • [sharded database] it is not possible to issue a SQL statement that operates across shards [1]
      • ⇒ usually SQL statements are limited to row-level access [1]
      • ⇒ only programmers can query the database as a whole [1]
      • joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]
  • {drawback} loss of transactional integrity
    • ACID transactions against multiple shards are not possible and/or not practical [1]
      • ⇐ {exception} there are database systems that support 2PC
        • involves considerable troubleshooting as conflicts and bottlenecks can occur [1]
  • {drawback} operational complexity. 
    • load balancing across shards becomes extremely problematic
      • adding new shards requires a complex rebalancing of data [1]
      • changing the database schema requires a rolling operation across all the shards [1]
        • ⇒ can lead to transitory inconsistencies in the schema [1]
    • a sharded database entails a huge amount of operational effort and administrator skill [1]
  • {concept} CAP (Consistency, Availability, and Partition) theorem 
    • in a distributed database system, one can have at most only two of CAP tolerance [1]
    • consistency
      • every user of the database has an identical view of the data at any given instant [1]
    • availability
      • in the event of a failure, the database remains operational [1]
    • partition tolerance
      • the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]
  • {concept} partitioning
    • {def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]
      • ⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.
        • a partition is a set of replicas)
    • {type} [stateless services] a logical unit that contains one or more instances of a service [5]
      • partitioning a stateless service is a very rare scenario
      • scalability and availability are normally achieved by adding more instances
      • {subtype} externally persisted state
        • persists its state externally [5]
          • e.g. databases in Azure SQL Database
      • {subtype} computation-only services
        • service that do not manage any persistent state e.g. calculator or image thumbnailing [5]
    • {type} scalable stateful services
      • partition state (data)
      • a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
      • the state must be accessed and stored
        • ⇒ bound by
          • network bandwidth limits
          • system memory limits
          • disk storage limits
        • {scenario} run into resource constraints in a running cluster
          • {recommendation} scale out the cluster to accommodate the new requirements [4]
  • {concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]
    • ⇐ addresses the significant challenges in developing and managing cloud applications
    • places the partitions on different nodes [5]
      • allows partitions to grow to a node's resource limit
        • ⇐ partitions are rebalances across nodes [5]
      • {benefit} ensures the continued efficient use of hardware resources [5]
    • {default} makes sure that there is about the same number of primary and secondary replicas on each node
      • ⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
      • hot and cold spots may appear in a cluster
          • ⇐ it should be preferably avoided
        • {recommendation} partition the state so is evenly distributed across all partitions [5]
        • {recommendation} report load from each of the replicas for the service [5]
    • provides the capability to report load consumed by services [5]
      • e.g.  amount of memory, number of records
      • detects which partitions server higher loads than others [5]
        • ⇐ based on the metrics reported
      • rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded  [5]
      • ⇐ it's not always possible to know how much data will be in a given partition
        • {recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]
          • {benefit} prevents situations described in the voting example [5]
        • {recommendation} report load
          • {benefit} helps smooth out temporary differences in access or load over time [5]
    • {recommendation} choose an optimal number of partitions to begin with
      • ⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]
        • ⇐ assuming the maximum number of partitions is a valid approach [5]
      • ⇒ one may end up needing more partitions than initially considered [5]
        • ⇐ {constraint} the partition count can't be changed after the fact [5]
          •  ⇒ apply more advanced partition approaches
            • e.g. creating a new service instance of the same service type
            • e.g. implement client-side logic that routes the requests to the correct service instance

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
[2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed
[3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
[4] Microsoft Fabric (2024) Data sharding policy [link]
[5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
[6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

Acronyms:
ACID - atomicity, consistency, isolation, durability
2PC - Two Phase Commit
CAP - Consistency, Availability, Partition

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.