Showing posts with label sharding. Show all posts
Showing posts with label sharding. Show all posts

09 February 2025

🏭🗒️Microsoft Fabric: Sharding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2024

 [Microsoft Fabric] Data Partitioning (aka Sharding)

  • {definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
  • allows a logical database to be partitioned across multiple physical servers [1]
    • each partition is referred to as a shard
    • the largest tables are partitioned across multiple database servers [1]
      • when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]
    • partitioning is based on a Key Value
      • e.g. such as a user ID
  • proven technique for achieving data processing on a massive scale [1]
    • solution used at the largest websites 
      • e.g. Facebook, Twitter
      • usually associated with rapid growth
        •  ⇒ the approach needs to be dynamic [1]
    • the only way to scale a relational database to massive web use [1]
      • together with caching and replication [1]
  • {drawback} involves significant operational complexities and compromises [1]
    • the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
    • requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.
  • {drawback} high operational costs [1]
  • {drawback} application complexity
    • it’s up to the application code to route SQL requests to the correct shard [1]
      • ⇒ a dynamic routing layer must be implemented
        • ⇐ most massive websites are adding shards as they grow [1]
        • layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]
  • {drawback} crippled SQL
    • [sharded database] it is not possible to issue a SQL statement that operates across shards [1]
      • ⇒ usually SQL statements are limited to row-level access [1]
      • ⇒ only programmers can query the database as a whole [1]
      • joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]
  • {drawback} loss of transactional integrity
    • ACID transactions against multiple shards are not possible and/or not practical [1]
      • ⇐ {exception} there are database systems that support 2PC
        • involves considerable troubleshooting as conflicts and bottlenecks can occur [1]
  • {drawback} operational complexity. 
    • load balancing across shards becomes extremely problematic
      • adding new shards requires a complex rebalancing of data [1]
      • changing the database schema requires a rolling operation across all the shards [1]
        • ⇒ can lead to transitory inconsistencies in the schema [1]
    • a sharded database entails a huge amount of operational effort and administrator skill [1]
  • {concept} CAP (Consistency, Availability, and Partition) theorem 
    • in a distributed database system, one can have at most only two of CAP tolerance [1]
    • consistency
      • every user of the database has an identical view of the data at any given instant [1]
    • availability
      • in the event of a failure, the database remains operational [1]
    • partition tolerance
      • the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]
  • {concept} partitioning
    • {def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]
      • ⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.
        • a partition is a set of replicas)
    • {type} [stateless services] a logical unit that contains one or more instances of a service [5]
      • partitioning a stateless service is a very rare scenario
      • scalability and availability are normally achieved by adding more instances
      • {subtype} externally persisted state
        • persists its state externally [5]
          • e.g. databases in Azure SQL Database
      • {subtype} computation-only services
        • service that do not manage any persistent state e.g. calculator or image thumbnailing [5]
    • {type} scalable stateful services
      • partition state (data)
      • a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
      • the state must be accessed and stored
        • ⇒ bound by
          • network bandwidth limits
          • system memory limits
          • disk storage limits
        • {scenario} run into resource constraints in a running cluster
          • {recommendation} scale out the cluster to accommodate the new requirements [4]
  • {concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]
    • ⇐ addresses the significant challenges in developing and managing cloud applications
    • places the partitions on different nodes [5]
      • allows partitions to grow to a node's resource limit
        • ⇐ partitions are rebalances across nodes [5]
      • {benefit} ensures the continued efficient use of hardware resources [5]
    • {default} makes sure that there is about the same number of primary and secondary replicas on each node
      • ⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
      • hot and cold spots may appear in a cluster
          • ⇐ it should be preferably avoided
        • {recommendation} partition the state so is evenly distributed across all partitions [5]
        • {recommendation} report load from each of the replicas for the service [5]
    • provides the capability to report load consumed by services [5]
      • e.g.  amount of memory, number of records
      • detects which partitions server higher loads than others [5]
        • ⇐ based on the metrics reported
      • rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded  [5]
      • ⇐ it's not always possible to know how much data will be in a given partition
        • {recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]
          • {benefit} prevents situations described in the voting example [5]
        • {recommendation} report load
          • {benefit} helps smooth out temporary differences in access or load over time [5]
    • {recommendation} choose an optimal number of partitions to begin with
      • ⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]
        • ⇐ assuming the maximum number of partitions is a valid approach [5]
      • ⇒ one may end up needing more partitions than initially considered [5]
        • ⇐ {constraint} the partition count can't be changed after the fact [5]
          •  ⇒ apply more advanced partition approaches
            • e.g. creating a new service instance of the same service type
            • e.g. implement client-side logic that routes the requests to the correct service instance

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
[2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed
[3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
[4] Microsoft Fabric (2024) Data sharding policy [link]
[5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
[6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

Acronyms:
ACID - atomicity, consistency, isolation, durability
2PC - Two Phase Commit
CAP - Consistency, Availability, Partition

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part IX: Translating SQL to KQL - More Joins)

The last post exemplified the use of "explain" to translate queries from SQL to KQL. The current post attempts to test the feature based on the various join constructs available in SQL by using the NewSales and Products tables. The post presumes that the reader as a basic understanding of the join types from SQL-based environments. 

Inner Join

// full join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// full join in KQL
NewSales
| join kind=inner (
    Products
    | project ProductKey, ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| project CustomerKey, ProductKey, ProductName
| limit 10

A full join is probably the most used type of join given that fact tables presume the existence of dimensions, even if poor data warehousing design can lead also to exception. The join retrieves all the data matching from both tables, including the eventual duplicates from both sides of the join. 

Left Join

// left join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     LEFT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// left join in KQL
NewSales
| join kind=leftouter (
    Products
    | project ProductKey
        , Product = ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| where isnull(Product) 
| project CustomerKey
    , ProductKey
    , ProductName
| limit 10

A left join retrieves all the records from the left table, typically the fact table, independently whether records were found in the dimension table. One can check whether mismatches exist by retrieving the records where no match was found.

Right Join

// right join in SQL
--
explain
SELECT NewSales.CustomerKey
, Products.ProductKey
, Products.ProductName
FROM NewSales 
     RIGHT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// right join in KQL
NewSales
| join kind=rightouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
| where isnull(ProductKey) 
| project CustomerKey
    , DimProductKey
    , DimProductName
| limit 10

A right join retrieves the records from the dimension together with the matches from the fact table, independently whether a match was found in the fact table. 

Full Outer Join

// full outer join in SQL
--
explain
SELECT NewSales.CustomerKey
, Coalesce(NewSales.ProductKey, Products.ProductKey) ProductKey
, Coalesce(NewSales.ProductName, Products.ProductName) ProductName
FROM NewSales 
     FULL OUTER JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 


// full outer join in KQL
NewSales
| join kind=fullouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
//| where isnull(ProductKey) 
| project CustomerKey
    , ProductKey = coalesce(ProductKey, DimProductKey)
    , ProductName = coalesce(ProductName, DimProductName)
| limit 10

A full outer join retrieves all the data from both sides of the join independently on whether a match is found. In RDBMS this type of join performs poorly especially when further joins are considered, respectively when many records are involved on both sides of the join. Therefore it should be avoided when possible, though in many cases it might be the only feasible solution. There are also alternatives that involve a UNION between a LEFT JOIN and a RIGHT JOIN, the letter retrieving only the records which is not found in the fact table (see last query from a previous post). This can be a feasible solution when data sharding is involved. 

Notes:
1) If one ignores the unnecessary logic introduced by the translation via explain, the tool is excellent for learning KQL. It would be interesting to understand why the tool used a certain complex translation over another, especially when there's a performance benefit in the use of a certain piece of code.
2) Also in SQL-based queries it's recommended to start with the fact table, respectively with the table having the highest cardinality and/or the lowest level of detail, though the database engine might find an optimal plan independently of which table was written first.

Happy coding!

Previous Post <<||>> Next Post

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.