SQL Troubles: partitions

Showing posts with label partitions. Show all posts

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 14-Feb-2024

[Microsoft Fabric] Partitions

{def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards)

a pattition is defined based on one or more fields
each partition contains a subset of the data
each partitions can be stored and processed independently

{goal} improve performance, scalability, and manageability of large data tables
{benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]

e.g., date ranges, regions, categories, entities

{benefit} allows to improve queries' performance as they can target specific partitions [2]

reduces the amount of data scanned [2]
improves queries' performance [2]

{benefit} allows for more efficient data loading [2]
{benefit} facilitates the management of big tables [2]

maintenance tasks can be performed on individual partitions [2]
obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]

applies to

backups
indexing
allows optimizing query performance for specific subsets of data
statistics

performance can be affected by

the choice of partition columns for a delta table [1]
the number and size of partitions of the partition column [1]
a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]

⇐ negatively impacts performance of the metadata discovery scan for changes [1]
{recommendation} if the cardinality of a column is high, choose another column for partitioning [1]

the size of each partition can also affect performance

{recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
{recommendation} follow the best practices for delta tables maintenance [1]
a large volume of small-sized parquet files increases the time it takes to sync the changes [1]

⇒ leads to large number of parquet files in a delta table due to [1]

over-partitioning

partitions with high number of unique values [1]
{recommendation} choose a partition column that

doesn't have a high cardinality [1]
results in individual partition size of at least 1 GB [1]

many small files

batch and streaming data ingestion rates might result in small files

depends on the frequency and size of changes being written to a lakehouse [1]

{recommendation} implement regular lakehouse table maintenance [1]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]

[2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance

written [link]

[3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]

[4]

Resources

[R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]

[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

09 February 2025

🏭🗒️Microsoft Fabric: Sharding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 9-Feb-2024

[Microsoft Fabric] Data Partitioning (aka Sharding)

{definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
allows a logical database to be partitioned across multiple physical servers [1]

each partition is referred to as a shard
the largest tables are partitioned across multiple database servers [1]

when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]

partitioning is based on a Key Value

e.g. such as a user ID

proven technique for achieving data processing on a massive scale [1]

solution used at the largest websites

e.g. Facebook, Twitter
usually associated with rapid growth

⇒ the approach needs to be dynamic [1]

the only way to scale a relational database to massive web use [1]

together with caching and replication [1]

{drawback} involves significant operational complexities and compromises [1]

the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.

{drawback} high operational costs [1]
{drawback} application complexity

it’s up to the application code to route SQL requests to the correct shard [1]

⇒ a dynamic routing layer must be implemented

⇐ most massive websites are adding shards as they grow [1]
layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]

{drawback} crippled SQL

[sharded database] it is not possible to issue a SQL statement that operates across shards [1]

⇒ usually SQL statements are limited to row-level access [1]
⇒ only programmers can query the database as a whole [1]
joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]

{drawback} loss of transactional integrity

ACID transactions against multiple shards are not possible and/or not practical [1]

⇐ {exception} there are database systems that support 2PC

involves considerable troubleshooting as conflicts and bottlenecks can occur [1]

{drawback} operational complexity.

load balancing across shards becomes extremely problematic

adding new shards requires a complex rebalancing of data [1]
changing the database schema requires a rolling operation across all the shards [1]

⇒ can lead to transitory inconsistencies in the schema [1]

a sharded database entails a huge amount of operational effort and administrator skill [1]

{concept} CAP (Consistency, Availability, and Partition) theorem

in a distributed database system, one can have at most only two of CAP tolerance [1]
consistency

every user of the database has an identical view of the data at any given instant [1]

availability

in the event of a failure, the database remains operational [1]

partition tolerance

the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]

{concept} partitioning

{def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]

⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.

a partition is a set of replicas)

{type} [stateless services] a logical unit that contains one or more instances of a service [5]

partitioning a stateless service is a very rare scenario
scalability and availability are normally achieved by adding more instances
{subtype} externally persisted state

persists its state externally [5]

e.g. databases in Azure SQL Database

{subtype} computation-only services

service that do not manage any persistent state e.g. calculator or image thumbnailing [5]

{type} scalable stateful services

partition state (data)
a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
the state must be accessed and stored

⇒ bound by

network bandwidth limits
system memory limits
disk storage limits

{scenario} run into resource constraints in a running cluster

{recommendation} scale out the cluster to accommodate the new requirements [4]

{concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]

⇐ addresses the significant challenges in developing and managing cloud applications
places the partitions on different nodes [5]

allows partitions to grow to a node's resource limit

⇐ partitions are rebalances across nodes [5]

{benefit} ensures the continued efficient use of hardware resources [5]

{default} makes sure that there is about the same number of primary and secondary replicas on each node

⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
hot and cold spots may appear in a cluster

⇐ it should be preferably avoided

{recommendation} partition the state so is evenly distributed across all partitions [5]
{recommendation} report load from each of the replicas for the service [5]

provides the capability to report load consumed by services [5]

e.g. amount of memory, number of records
detects which partitions server higher loads than others [5]

⇐ based on the metrics reported

rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded [5]
⇐ it's not always possible to know how much data will be in a given partition

{recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]

{benefit} prevents situations described in the voting example [5]

{recommendation} report load

{benefit} helps smooth out temporary differences in access or load over time [5]

{recommendation} choose an optimal number of partitions to begin with

⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]

⇐ assuming the maximum number of partitions is a valid approach [5]

⇒ one may end up needing more partitions than initially considered [5]

⇐ {constraint} the partition count can't be changed after the fact [5]

⇒ apply more advanced partition approaches

e.g. creating a new service instance of the same service type
e.g. implement client-side logic that routes the requests to the correct service instance

Previous Post <<||>> Next Post

References:

[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data

[2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed

[3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
[4] Microsoft Fabric (2024) Data sharding policy [link]
[5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
[6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ACID - atomicity, consistency, isolation, durability

2PC - Two Phase Commit

CAP - Consistency, Availability, Partition

25 December 2024

💎🏭SQL Reloaded: Number of Records IV (via sys.partitions DMV)

To get the exact number of records in a table one can use the COUNT (see post) or the more recent COUNT_BIG function, though for big tables this can be an inefficient operation for the database engine:

-- number of records via COUNT
SELECT count(*) row_count
FROM SalesLT.Product

Moreover, sometimes the operation needs to be repeated for a number of tables, e.g. dropdown tables in Dynamics 365 for Finance and Operations (D365 F&O). Writing the query as a UNION allows to export the data as a single table and do comparisons (e.g. in Excel). The same approach can be used also when multiple columns are used for grouping, though one must account for the additional columns in the other subqueries. However, the more tables are involved, the more difficult it becomes to maintain the query over time.

-- number of records via COUNT for multiple tables
SELECT 'SalesLT.Product' table_name
, count(*) row_count
FROM SalesLT.Product
UNION ALL
SELECT 'SalesLT.ProductDescription' table_name
, count(*) row_count
FROM SalesLT.ProductDescription
UNION ALL
SELECT 'SalesLT.ProductModel' table_name
, count(*) row_count
FROM SalesLT.ProductModel

There are many scenarios in which it's needed to get an approximate of the number of records available in a table and doing a record count might prove to be too expensive. For a quick and dirty solution one can use the sys.partitions DMV instead:

-- number of records via DMV for single object
SELECT object_id
, OBJECT_NAME(object_id) object_name
, OBJECT_SCHEMA_NAME(object_id) schema_name
, SUM(Rows) AS row_count
, data_compression_desc AS compression_type
, COUNT(*) partitions_count
FROM sys.partitions 
WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
  AND OBJECT_ID('SalesLT.Product') = object_id
GROUP BY object_id
, data_compression_desc
ORDER BY row_count DESC;

The query is based on sys.partitions table [1] which contains a row for each partition of all the tables and most types of indexes in the database. The documentation mentions that "rows" indicates the approximate number of rows in the considered partition.

Alternatively, one can bring more tables into the query to extend its range of applicability.

-- number of records via DMVs
SELECT S.name + '.' + T.name SearchName
, S.Name SchemaName
, T.name TableName
, P.row_count
, P.compression_type
, P.partitions_count
FROM sys.tables T
     LEFT JOIN (
        SELECT object_id
        , SUM(Rows) AS row_count
        , data_compression_desc AS compression_type
        , COUNT(*) partitions_count
        FROM sys.partitions 
        WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
        --AND OBJECT_ID('SalesLT.Product') = object_id
        GROUP BY object_id
        , data_compression_desc
     ) P
    ON T.object_id = P.object_id
     JOIN sys.schemas as S
	   on S.schema_id = T.schema_id
WHERE S.Name = 'SalesLT'
  AND T.Name LIKE 'Product%'
ORDER BY row_count DESC;

The data can be exported regularly to give an idea how tables' cardinality changes over time. One can find this useful as part of the loading process in data warehouses or other solutions (e.g. data migrations).

By using a FULL JOIN instead of a LEFT JOIN one can retrieve only the tables that have records.

One should consider only the tables in scope, and eventually remove the records associated with the system objects (e.g. sys or information_schema upon case).

 -- constraints to be added in the WHERE clause to remove the records related to system objects
 AND OBJECT_NAME(object_id) NOT LIKE 'sys%'
 AND OBJECT_NAME(object_id) NOT LIKE 'queue_%' 
 AND OBJECT_NAME(object_id) NOT LIKE 'filestream_tombstone%'

There are also scenarios in which the count is needed only for a subset of the data. It's the case of D365 F&O (in which the number of records is needed by DataAreaId (aka company) or another field. A solution can be built using the sp_MSForEachTable stored procedure (see the last query from this post) and a cursor.

Notes:
1) Unfortunately, in Microsoft Fabric warehouses the sys.partitions.rows is 0 for all user tables and currently also the sp_MSForEachTable stored procedure can't be used to retrieve the number of records for all tables. However, one can create an old-fashioned cursor for iterating though the collection of tables in scope.
2) The code used in this post is available also in the GitHub repository.

Happy coding and Merry Christmas!

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2024) sys.partitions (Transact-SQL) [link]
[2] Microsoft Learn (2024) COUNT_BIG (Transact-SQL) [link]

20 April 2024

⚡️🗒️Power BI: Visual Calculations [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 20-Apr-2024

[feature] Visual Calculations (aka Visual Calcs)

{definition} a type of DAX calculation that's defined and executed directly on a visual [1]
{benefit} make it easier to create calculations that were previously hard to create

leads to simpler DAX, easier maintenance, and better performance [1]
reuse the results from its components [2]

still in public preview

⇒ must be enabled manually

are stored on the visual

⇒ can refer to any data in the visual

incl. columns, measures, or other visual calculations [1]

⇒ anything in the model must be added to the visual before the visual calculation can refer to it [1]
⇒ they can refer to the visual structure

⇒ leads to more flexibility

combine the simplicity of context from calculated columns with the on-demand calculation flexibility from measures [1]

⇐ the context is "visible"

operate on aggregated data instead of the detail level [1]

⇒ leads to performance benefits

introduce a set of functions specific to visual calculations [1]

{category} medium-level functions

{function} COLLAPSE

the calculation is evaluated at a higher level of the axis [1]

{function} COLLAPSEALL

the calculation is evaluated at the total level of the axis [1]

{function} EXPAND

the calculation is evaluated at a lower level of the axis [1]

{function}EXPANDALL

the calculation is evaluated at the leaf level of the axis [1]

{function} FIRST

refers to the first row of an axis [1]

{function} ISATLEVEL

reports whether a specified column is present at the current level.

{function} LAST

refers to the last row of an axis [1]

{function} NEXT

refers to a next row of an axis [1]

{function} PREVIOUS

refers to a previous row of an axis [1]

{function} RANGE

refers to a slice of rows of an axis [1]

{category} high-level functions

{function} MOVINGAVERAGE

adds a moving average on an axis [1]

{function} RUNNINGSUM

adds a running sum on an axis [1]

{category} low-level functions

⇐ {exception} are available in standard DAX
{function} INDEX
{function} OFFSET
{function} RANK
{function} ROWNUMBER
{function}WINDOW
{function} ORDERBY
{function} PARTITIONBY
{function} MATCHBY

{default} most of them are evaluated row-by-row [1]

⇐ like a calculated column
there's no need to add an aggregation function [1]

it's better not to add such aggregates when they're not necessary [1]

{operation} create calculation

adds the visual calculation to the visual

{operation} hide calculation

calculations that aren't needed in the visual can be hidden [2]

{operation} copy calculation

copies the calculation between visuals and if intermediary steps are not there, they will be copied as well [planned] [2]

{feature} templates

ready available calculation constructs
{benefit} make it easier to write common calculations [1]

{feature|planned} support for Scanner API [2]
{feature} explore

new experience that allows to explore data in a focused way [4]
allows adding visual calculations to visuals [4]

{feature} parameter pickers

allows to create visual calculations faster by picking parameters [3]
{limitation} only available for required parameters on functions that are exclusive to visual calculations (and select other functions) that have a defined list of options [3]

required parameters that can take any text, or numerical value will not get a parameter picker, and neither will many DAX functions [3]

{parameter} axis

influences how the visual calculation traverses the visual matrix [1]
{default} set to the first axis in the visual
{value} ROWS

the visual calculation is evaluated row-by-row in the visual matrix, from top to bottom. [1]

{value} COLUMNS

the visual calculation is evaluated row-by-row in the visual matrix, from left to right [1]

{value} ROWS COLUMNS

calculates vertically across rows from top to bottom, continuing column by column from left to right [1]

{value} COLUMNS ROWS

calculates horizontally across columns from left to right, continuing row by row from top to bottom [1]

{warning} not all visuals provide all axes, and some visuals provide no axes [1]

references to a non-existent or invalid axis is permissible and will be ignored [3]

{parameter} reset

influences if and when the function resets its value to 0 or switches to a different scope while traversing the visual matrix [1]
expects there to be multiple levels on the axis [1]

⇐ use PARTITIONBY if there's only one level on the axis [1]

{value|default} NONE

means the visual calculation is never restarted [1]

{value} HIGHESTPARENT

resets the calculation when the value of the highest parent on the axis changes [1]

{value} LOWESTPARENT

resets the calculations when the value of the lowest parent on the axis changes [1]

{value} numerical value

refers to the fields on the axis, with the highest field being one [1]

{limitation} functions that rely on model relationships aren't available

e.g. USERELATIONSHIP, RELATED or RELATEDTABLE

{limitation} not all visual types are supported [1]

⇐ for the full list of limitations see [1]

{limitation} one can't filter on visual calculations [1]
{limitation} underlying data can't be exported [1]
{limitation} don't support conditional formatting

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Power BI: Using visual calculations [preview] [link]
[2] SSBI Central (2024) Visual Calculations - Making DAX easier, with Jeroen ter Heerdt [link]
[3] Microsoft Power BI Updates (2025) Power BI June 2025 Feature Summary [link]
[4] Microsoft Learn (2025) Power BI: Use Explore (preview) in the Power BI service [link]

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them.

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific).

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc. The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage.

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble.

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc.

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content).

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply.

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs.

Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort.

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible.

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

31 January 2024

🏭🗒️Microsoft Fabric: Parquet Format [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 31-Jan-2024

[Microsoft Fabric] Parquet format

{definition} open source, column-oriented data file format designed for efficient data storage and retrieval [1]

provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk [1]
designed to be a common interchange format for both batch and interactive workloads [1]

{characteristic} open source file format

similar to other columnar-storage file formats available in Hadoop [1]

e.g. RCFile, ORC

became an industry standard

{benefit} provides interoperability across multiple tools

{characteristic} language agnostic [1]

different programming languages can be used to manipulate the data

{characteristic} column-based format [1]

files are organized by column

⇐ rather than by row
⇒ saves storage space and speeds up analytics queries [1]

reads only the needed columns

⇐ non-relevant data are skipped
⇒ greatly minimizes the IO [1]

aggregation queries are less time-consuming compared to row-oriented databases [1]

{benefit} increased data throughput and performance [1]

⇒ recommended for analytical workloads

{characteristic} highly efficient data compression/decompression [1]

supports flexible compression options and efficient encoding schemes [1]

data can be compressed by using one of the several codecs available [1]

⇒ different data files can be compressed differently [1]

reduced storage requirements [1]

by at least one-third on large datasets
⇒ {benefit} saves on cloud storage space

greatly improves scan and deserialization time [1]

⇒ {benefit} reduces the processing costs

{downside} can be slower to write than row-based file formats

primarily because they contain metadata about the file contents
though have fast read times

{characteristic} supports complex data types and advanced nested data structures [1]

implemented using the record-shredding and assembly algorithm

accommodates complex data structures that can be used to store the data [1]
optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types [1]

the approach is best especially for those queries that need to read certain columns from a large table [1]

{characteristic} cloud-ready

works best with interactive and serverless technologies [1]

{characteristic} immutable

a file can't be update to modify the column name, reorder or drop columns [2]

⇐ requires rewriting the whole file [2]

{characteristic} binary-based file

⇒ not easily readable (by humans)

{characteristic} self-describing

contains metadata about schema and structure
{concept} row groups (aka segments)

contains data from the same columns

{constraint} column names are case sensitive

{concept} file footer

stores metadata statistics for each row group [2]

min/max statistics
the number of rows
can be leveraged by data processing engines to run queries more efficiently [2]

⇐ depending on the query, entire row group can be skipped [2]

{concept} file header

large datasets can be split across multiple parquet files

⇐ the structure can be flat or hierarchical
managing multiple files has several challenges
the files can be used to define a table (aka parquet table)

⇐ {constraint} the files must have the same definition

⇐ schema enforcement must be coded manually [2]

{limitation} [Data Lake] no support for ACID transactions [2]

⇒ easy to corrupt [2]

partially written files will break any subsequent read operations

the compute engine will try to read in the corrupt files and error out [2]
corrupted files must be manually identified and deleted manually to fix the issues [2]

{limitation} it's not easy to delete rows from it [2]

requires reading all the data, filtering out the data not needed, and then rewriting the entire table [2]

{limitation} doesn't support DML transactions [2]
{limitation} there is no change data feed [2]
{limitation} slow file listing [2]

small files require excessive I/O overhead

ideally the files should be between 64 MB and 1 GB
ideally the files should be compacted into larger files (aka small file compaction, bin-packing)

{limitation} expensive footer reads to gather statistics for file skipping [2]

fetching all the footers and building the file-level metadata for the entire table is slow [2]

⇐ it requires a file-listing operation [2]

the effectiveness of data skipping depends on how many files can be can skipped when performing a query [2]

{limitation} doesn't support schema enforcement [2]
{limitation} doesn't support check constraints [2]
{limitation} doesn't support data versioning [2]

{concept} table partitioning

{definition} common optimization approach used to store the data of the same table in different directories, with partitioning column values encoded in the path of each partition directory [6]
{recommendation} avoid partitioning by columns with very high cardinality

{concept} bin-packing (aka compaction, bin-compaction)

aims to produce evenly-balanced data files with respect to their size on disk,

⇐ but not necessarily in respect to the number of tuples per file [7]

requires an algorithm that efficiently organizes the files into equal size containers [6]
{characteristic} idempotent

if it is run twice on the same dataset, the second run has no effect [7]

{feature} [Microsoft Fabric] V-order

{definition} write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engines [3]
applies special sorting, row group distribution, dictionary encoding and compression on parquet files [3]

requires less compute engines resources in to read it [3]

provides further cost efficiency and performance

has a 15% impact on average write times but provides up to 50% more compression [3]

{characteristic} open-source parquet format compliant

all parquet engines can read it as a regular parquet file [3]
⇐ table properties and optimization commands can be used on control V-Order on its partitions [3]
compatible with other features [3]

applied at parquet file level [3]
enabled by default

{command} OPTIMIZE

merges all changes into bigger, consolidated parquet files (aka bin-compaction) [3]
[Spark] dynamically optimizes partitions while generating files with a default 128 MB size [5]

the target file size may be changed per workload requirements using configurations [5]

properly designing the table physical structure based on the ingestion frequency and expected read patterns is likely more important than running the optimization command [3]
running the compaction operation brings the data lake in an unusable state for readers [7]
{warning} manually compacting the files is inefficient and error prone [7]

no way to differentiate files that contain new data from files that contain existing data that was just compacted into new files [7]

[Delta Lake] when ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, VORDER sequentially [3]

Previous Post <<||>> Next Post

Resources:
[1] Databricks (2023) What is Parquet? (link)
[2] Data Lake (2023) Delta Lake vs. Parquet Comparison (link)
[3] Data Mozart (2023) Parquet file format – everything you need to know! (link)
[4] Microsoft Learn (2023) Query Parquet files using serverless SQL pool in Azure Synapse Analytics (link)
[5] Microsoft Learn (2023) Lakehouse tutorial: Prepare and transform data in the lakehouse (link)
[6] Apache Spark (2023) Spark SQL Guide (link)
[7] Delta Lake (2023) Delta Lake Small File Compaction with OPTIMIZE (link)
[8] Delta Lake (2023) Optimizations (link)

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ACID - atomicity, consistency, isolation, durability
IO - Input/Output
MF - Microsoft Fabric

ORC - Optimized Row Columnar
RCFile - Record Columnar File

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part III: Quality Acceptance Criteria III)

Data Migrations Series

Repeatability

Repeatability is the degree with which a DM can be repeated and obtain consistent results between repetitions. Even if a DM is supposed to be a one-time activity for a project, to guarantee a certain level of quality it’s important to consider several iterations in which the data requirements are refined and made sure that the data can be imported as needed into the target system(s). Considered as a process, as long the data and the rules haven’t changed, the results should be the same or have the expected level of deviation from expectations.

This requirement is important especially for the data migrated during UAT and Go-Live, time during which the input data and rules need to remain frozen (even if small changes in the data can still occur). In fact, that’s the role of UAT – to assure that the data have the expected quality and when compared to the previous dry-run, that it attains the expected level of consistency.

Reusability

Reusability is the degree to which the whole solution, parts of the logic or data can be reused for multiple purposes. Master data and the logic associated with them have high reusability potential as they tend to be referenced by multiple entities.

Modularity

Modularity is the degree to which a solution is composed of discrete components such that a change to one component has minimal impact on other components. It applies to the solution itself but also to the degree to which the logic for the various entities is partitioned so to assure a minimal impact.

Partitionability

Partitionability is the degree to which data or logic can be partitioned to address the various requirements. Despite the assurance that the data will be migrated only once, in practice this assumption can be easily invalidated. It’s enough to increase the system freeze by a few days and/or to have transaction data that suddenly requires master data not considered. Even if the deltas can be migrated in system manually, it’s probably recommended to migrate them using the same logic. Moreover, the performing of incremental loads can be a project requirement.

Data might need to be partitioned into batches to improve processing’s performance. Partitioning the logic based on certain parameters (e.g. business unit, categorical values) allows more flexibility in handling other requirements (e.g. reversibility, performance, testability, reusability).

Performance

Performance refers to the degree a piece of software can process data into an amount of time considered as acceptable for the business. It can vary with the architecture and methods used, respectively data volume, veracity, variance, variability, or quality.

Performance is a critical requirement for a DM, especially when considering the amount of time spent on executing the logic during development, tests and troubleshooting, as well for other activities. Performance is important during dry-runs but more important during Go-Live, as it equates with a period during which the system(s) are not available for the users. Upon case, a few hours of delays can have an important impact on the business. In extremis, the delays can sum up to days.

Predictability

Predictability is the degree to which the results and behavior of a solution, respectively the processes involve are predictable based on the design, implementation or other factors considered (e.g. best practices, methodology used, experience, procedures and processes). Highly predictable solutions are desirable, though reaching the required level of performance and quality can be challenging.

The results from the dry-runs can offer an indication on whether the data migrated during UAT and Go-Live provide a certain level of assurance that the DM will be a success. Otherwise, an additional dry-run should be planned during UAT, if the schedule allows it.

Previous Post <> Nest Post

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

Data Warehousing Series

As part of the ETL process with applicability to Data Warehousing, Data Migrations, Data Integrations or similar scenarios the extraction subprocess is responsible for preparing and implementing the logic required to extract the data from the various source systems at the required level of detail. The extraction is done typically based on SQL queries as long one deals with relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files.

One can consider the preparation of the extraction logic as separate design subprocess of the targeted solution. Even if high-level design decisions are considered at the respective level, the low-level design needs to be considered at ETL package level. As part of the process are identified the source of the data in terms of system, tables and attributes to be imported, as well the joins, business and transformation rules that need to be applied on the data. This can involve reengineering the logic from the source system(s) as well data profiling, discovery or exploration activities.

A common practice is to copy the source tables 1:1 into the solution, eventually by considering only the needed attributes to minimize the necessary space, loading time and content’s complexity, even if this would add more effort into the design phase to identify only the needed attributes. If further attributes are identified at a later stage, the packages need to be modified accordingly. If the data volume or the number of unnecessary attributes is neglectable, copying the table 1:1 could prove to be the best strategy.

A second approach is to model within the extraction the (business) entity as designed within the source system. For example, the entity could be split over multiple tables from design or other purposes. Thus, the extraction query will attempt modeling the entity. This approach reduces to some degree the number of tables from the targeted solution, as well the number of ETL packages involved, while providing a clear depiction of the entities involved.

A third approach is to extract the data as needed by the target system, eventually as a mix between master and transaction data, fact which could easily lead to data redundancy with different timeliness and all the consequences resulting from this. This approach is usually met in solutions which require fast data availability in the detriment of design.

Unfortunately, there can be design constraints or choice considerations that could lead to a mix between these approaches. If the impact caused by the mix between the first two approaches is minimal, the third approach can cause more challenges, though it might be a small price to pay as long the considered data are disconnected from other data.

To reduce the redundancy of data, it’s recommended to consider as goal creating a unique source of facts, which can be obtained by minimizing as much as possible the overlaps between tables, respectively entities. Ideally there should be no overlaps. On the other sides the overlaps can be acceptable when the same data are available in more systems and the solution requires all the data to be available.

If the above approaches consider the vertical partitioning of the data, there can be also horizontal partitioning needs especially when a subset of the data is needed or when is needed to partition the data based on a set of values. In addition, one might be forced to include also transformation rules directly into the extraction logic, for example to handle conversion issues or minimize certain design overhead early in the process. In practice it makes sense to link such choices to business rules and document them accordingly.

Previous Post <<||>> Next Post

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses.

11 May 2018

🔬Data Science: K-Means Algorithm (Definitions)

"A top-down grouping method where the number of clusters is defined prior to grouping." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"An algorithm used to assign K centers to represent the clustering of N points (K< N). The points are iteratively adjusted so that each of the N points is assigned to one of the K clusters, and each of the K clusters is the mean of its assigned points." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k = n. The algorithm minimizes the total intra-cluster variance or the squared error function." (Dimitrios G Tsalikakis et al, "Segmentation of Cardiac Magnetic Resonance Images", 2009)

"The k-means algorithm assigns any number of data objects to one of k clusters." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The clustering algorithm that divides a dataset into k groups such that the members in each group are as similar as possible, that is, closest to one another." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"K-Means is a technique for clustering. It works by randomly placing K points, called centroids, and iteratively moving them to minimize the squared distance of elements of a cluster to their centroid." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"It is an iterative algorithm that partition the hole data set into K non overlaping subsets (Clusters). Each data point belongs to only one subset." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)

[Non-scalable K-means:] "A Microsoft Clustering algorithm method that uses a distance measure to assign a data point to its closest cluster." (Microsoft Technet)

"An algorithm that places each value in the cluster with the nearest mean, and in which clusters are formed by minimizing the within-cluster deviation from the mean." (Microsoft, "SSAS Glossary")

03 December 2011

💠SQL Server: Window Functions 🆕

Introduction

In the past, in the absence or in parallel with other techniques, aggregate functions proved to be quite useful in order to solve several types of problems that involve the retrieval of first/last record or the display of details together with averages and other aggregates. Typically their use involves two or more joins between a dataset and an aggregation based on the same dataset or a subset of it. An aggregation can involve one or more columns that make the object of analysis. Sometimes it might be needed multiple such aggregations based on different sets of columns. Each such aggregation involves at least a join. Such queries can become quite complex, though they were a price to pay in order to solve such problems.

Partitions

The introduction of analytic functions in Oracle and of window functions, a similar concept, in SQL Server, allowed the approach of such problems from a different simplified perspective. Central to this feature it’s the partition (of a dataset), its meaning being same as of mathematical partition of a set, defined as a division of a set into non-overlapping and non-empty parts that cover the whole initial set. The introduction of partitions it’s not necessarily something new, as the columns used in a GROUP BY clause determines (implicitly) a partition in a dataset. The difference in analytic/window functions is that the partition is defined explicitly inline together with a ranking or average function evaluated within a partition. If the concept of partition is difficult to grasp, let’s look at the result-set based on two Products (the examples are based on AdventureWorks database):

-- Price Details for 2 Products 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost  
FROM [Production].[ProductCostHistory] A 
WHERE A.ProductID IN (707, 708) 
ORDER BY A.ProductID 
, A.StartDate

In this case a partition is “created” based on the first Product (ProductId = 707), while a second partition is based on the second Product (ProductId = 708). As a parenthesis, another partitioning could be created based on ProductId and StartDate; considering that the two attributes are a key in the table, this will partition the dataset in partitions of 1 record (each partition will have exactly one record).

Details and Averages

In order to exemplify the use of simple versus window aggregate functions, let’s consider a problem in which is needed to display Standard Price details together with the Average Standard Price for each ProductId. When a GROUP BY clause is applied in order to retrieve the Average Standard Cost, the query is written under the form:

-- Average Price for 2 Products 
SELECT A.ProductID  
, AVG(A.StandardCost) AverageStandardCost 
FROM [Production].[ProductCostHistory] A 
WHERE A.ProductID IN (707, 708) 
GROUPBY A.ProductID  
ORDERBY A.ProductID

In order to retrieve the details, the query can be written with the help of a FULL JOIN as follows:

-- Price Details with Average Price for 2 Products - using JOINs 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost 
, B.AverageStandardCost 
, A.StandardCost - B.AverageStandardCost DiffStandardCost 
FROM [Production].[ProductCostHistory] A    
  JOIN ( -- average price        
    SELECT A.ProductID         
    , AVG(A.StandardCost) AverageStandardCost         
    FROM [Production].[ProductCostHistory] A        
    WHERE A.ProductID IN (707, 708)        
    GROUP BY A.ProductID      
) B  
    ON A.ProductID = B.ProductID 
WHERE A.ProductID IN (707, 708) 
ORDERBY A.ProductID 
, A.StartDate

As pointed above the partition is defined by ProductId. The same query written with window functions becomes:

-- Price Details with Average Price for 2 Products - using AVG window function 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost 
, AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) AverageStandardCost 
, A.StandardCost - AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) DiffStandardCost 
FROM [Production].[ProductCostHistory] A 
WHERE A.ProductID IN (707, 708) 
ORDER BY A.ProductID 
, A.StartDate

As can be seen, in the second example, the AVG function is defined using the OVER clause with PartitionId as partition. Even more, the function is used in a formula to calculate the Difference Standard Cost. More complex formulas can be written making use of multiple window functions.

The Last Record

Let’s consider the problem of retrieving the nth record. Because with aggregate functions is easier to retrieve the first or last record, let’s consider that is needed to retrieve the last Standard Price for each ProductId. The aggregate function helps to retrieve the greatest Start Date, which farther helps to retrieve the record containing the Last Standard Price.

-- Last Price Details for 2 Products - using JOINs 
SELECT A.ProductID  
, A.StartDate 
, A.EndDate 
, A.StandardCost 
FROM [Production].[ProductCostHistory] A  
    JOIN ( -- average price          
    SELECT A.ProductID          
    , Max(A.StartDate) LastStartDate          
    FROM [Production].[ProductCostHistory] A          
    WHERE A.ProductID IN (707, 708)          
    GROUP BY A.ProductID      
) B      
   ON A.ProductID = B.ProductID  
  AND A.StartDate = B.LastStartDate 
WHERE A.ProductID IN (707, 708) 
ORDERBY A.ProductID 
,A.StartDate

With window functions the query can be rewritten as follows:

-- Last Price Details for 2 Products - using AVG window function 
SELECT * 
FROM (-- ordered prices      
    SELECT A.ProductID      
    , A.StartDate      
    , A.EndDate      
    , A.StandardCost      
    , RANK() OVER(PARTITION BY A.ProductID ORDER BY A.StartDate DESC) Ranking      
    FROM [Production].[ProductCostHistory] A     
    WHERE A.ProductID IN (707, 708) 
  ) A 
WHERE Ranking = 1 
ORDER BY A.ProductID 
, A.StartDate

   As can be seen, in order to retrieve the Last Standard Price, was considered the RANK function, the results being ordered descending by StartDate. Thus, the Last Standard Price will be always positioned on the first record. Because window functions can’t be used in WHERE clauses, it’s needed to encapsulate the initial logic in a subquery. Similarly could be retrieved the First Standard Price, this time ordering ascending the StartDate. The last query can be easily modified to retrieve the nth records (this can prove to be more difficult with simple average functions), the first/last nth records.

Conclusion

    Without going too deep into details, I shown above two representative scenarios in which solutions based on average functions could be simplified by using window functions. In theory the window functions provide greater flexibility but they have their own trade offs too. In the next posts I will attempt to further detail their use, especially in the context of Statistics.

SQL Troubles

Pages

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

09 February 2025

🏭🗒️Microsoft Fabric: Sharding [Notes]

25 December 2024

💎🏭SQL Reloaded: Number of Records IV (via sys.partitions DMV)

20 April 2024

⚡️🗒️Power BI: Visual Calculations [Notes]

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

31 January 2024

🏭🗒️Microsoft Fabric: Parquet Format [Notes]

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part III: Quality Acceptance Criteria III)

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

11 May 2018

🔬Data Science: K-Means Algorithm (Definitions)

03 December 2011

💠SQL Server: Window Functions 🆕

About Me