SQL Troubles: data warehousing

Showing posts with label data warehousing. Show all posts

20 February 2025

💠🛠️🗒️SQL Server: Nulls [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 20-Feb-2024

[SQL Server] Null

{def} keyword that indicates that the value is unknown [1]

different from an empty or zero value [1]
no two null values are equal [1]

comparisons between two null values, or between a null value and any other value, return unknown because the value of each NULL is unknown [1]

indicates the the value is

unknown
not applicable
to be added later
⇒ can't be used as information that is required to distinguish one row in a table from another row in a table [1]

can be assigned to a value by

explicitly stating NULL in an INSERT or UPDATE statement [1[
leaving a column out of an INSERT statement [1]

{recommendation} test for null values in queries

via IS NULL or IS NOT NULL in the WHERE clause [1]
WHEN present in data, logical and comparison operators can potentially return a third result of UNKNOWN instead of just TRUE or FALSE [1]

⇐ three-valued logic can be the source for many application errors [1]

⇐ parameters and variables not explicitly initialized can cause problems in code

{recommendation} handle null values in logic

via IsNull or Coalesce functions

{constraint} [primary kyes] if any of the columns considered in a primary key contain NULL values, the PRIMARY KEY constraint can’t be created [3]
{constraint} [UNIQUE constraint] allows the columns that make up the constraint to allow NULLs, but it doesn’t allow all key columns to be NULL for more than one row [3]
[data warehouse] nullability of columns

{best practice} define columns as NOT NULL when appropriate

{benefit} helps the Query Optimizer
{benefit} reduces in some cases the storage space required for the data
{benefit} allows SQL Server to avoid unnecessary encoding in columnstore indexes and during batch mode execution [2]

{example} [SQL Server 2000+] bigint column

when the value is defined as NOT NULL , the value fits into a single CPU register

⇒ operations on the value can be performed more quickly

a nullable bigint column requires another, 65th bit to indicate NULL values

SQL Server avoids cross-register data storage by storing some of the row values (usually the highest or lowest values) in main memory using special markers to indicate it in the data that resides in the CPU cache [2]

⇒ adds extra load during execution

{recommendation} avoid nullable columns in data warehouse environments [2]

⇐ the recommendation can apply also to OLTP databases

there are database designs that enforces not null values for all attributes

e.g. Dynamics AX 2009/365 F&O
{benefit} eliminates the need to test for null values in legacy code

{recommendation} use CHECK and UNIQUE constraints or indexes when overhead introduced by constraints or unique indexes is acceptable [2]
{recommendation} consider using filtered indexes instead of normal indexes for columns with many null values

minimizes the waste of storage space
⇐ understand the characteristics of the columns used in the queries [3]

See also: Null-ifying the World, Dynamic Queries, Handling missing dates

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) SQL Server 2022: NULL and UNKNOWN (T-SQL)
[2] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.

[3] Microsoft SQL Server 2012 Internals, by Kalen Delaney, Bob Beauchemin, Conor Cunningham, Jonathan Kehayias, Benjamin Nevarez & Paul S. Randal, Microsoft Press, ISBN: 978-0-7356-5856-1 , 2013

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 15-Feb-2024

[SQL Server] columnstore indexes (CI)

{def} a technology for storing, retrieving and managing data by using a columnar data format (aka columnstore)

store compressed data on a per-column rather than a per-row basis [5]

{benefit} designed for analytics and data warehousing workloads

data warehousing

{scenario} store fact tables and large dimension tables

⇐ tend to require full table scans rather than table seeks

analytics workloads

{scenario} [SQL Server 2016 SP1] can be used for real-time analytics on operational databases

⇐ an updatable nonclustered columnstore index can be created on a rowstore table

{benefit} performance increase

can achieve up to 100x better performance [4]
offers an order of magnitude better performance than a rowstore index

{feature} uses batch mode execution

improves query performance typically by two to four times

have high performance gains for analytic queries that scan large amounts of data, especially on large tables (>1 million rows)

{benefit} reduces significantly the data warehouse storage costs

{feature} data compression

⇒ provides high compression rates, typically by 10 times

⇒ reduces total I/O from the physical media

⇐ queries often select only a few columns from a table
minimizes or eliminates system I/O bottlenecks

reduces significantly the memory footprint

⇒ query performance can improve

because SQL Server can perform more query and data operations in memory

{benefit} built in memory

⇒ sufficient memory must be available

{benefit} part of the database engine

no special hardware is needed

{concept} columnstore

{def} data structure logically organized as a table with rows and columns, and physically stored in a column-wise data format

stores values from the same domain which commonly have similar values

when a query references a column, then only that column is fetched from disk [3]

⇐ the columns not requested are skipped

⇒ they are not loaded into memory

when a query is executed, the rows must be reconstructed

⇒ row reconstruction takes some time and uses some CPU and memory resources [3]

[SQL Server 2016] columnstore index on rowstore tables

columnstore is updated when data changes in the rowstore table

both indexes work against the same data

{concept}rowstore

{def} data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format

⇐ the traditional way to store relational table data
refers to a table where the underlying data storage format is either

a heap
a clustered index
a memory-optimized table

{concept} rowstore index

performs best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values

⇒ appropriate for transactional workloads

because they tend to require mostly table seeks instead of table scans

{concept} rowgroup

{def} a group of rows that are compressed into columnstore format at the same time

{constraint} has a maximum number of rows per rowgroup, which is 1,048,576 =2^20 rows
contains one column segment for every column in the table
can have more than one delta rowgroup that form the deltastore

e.g. when multiple threads create columnstore indexes using parallel execution plans [5]

⇐ each thread will work with its own subset of data, creating separate rowgroups [5]

[partitions] each table partition has its own set of row groups [5]

⇐ too many partitions may prevent workloads from benefiting from a CCI [11]

⇐ data aren’t pushed into a compressed columnstore segment until the rowgroup limit is reached

{event} rowgroup is compressed

marked as read-only [16]
a compressed rowgroup is considered as fragmented when either

row number < rowgroup limit but dictionary size reached the maximum

nothing can be done to increase the number of rows [15]
the trim_reason is other than DICTIONARY_SIZE

it has nonzero deleted rows that exceeds a minimum threshold [15]

{event} all data from rowgroup deleted

transitions from COMPRESSED into TOMBSTONE state
later removed by the tuple-mover background process

{event} rows in the columnstore indexes can be moved to different locations

row-id in the nonclustered indexes aren’t updated

⇐ the mappings between old and new row locations are stored in an internal structure (aka mapping index)

{event} rowgroup build

all column data are combined on a per-row group basis, encoded and compressed [5]

the rows within a row group can be rearranged if that helps to achieve a better compression rate [5]

{feature} data compression

the table is sliced into rowgroups, and each rowgroup is compresses in a column-wise manner

the number of rows in the rowgroup must be

large enough to improve compression rates
small enough to benefit from in-memory operations

having too many small rowgroups decreases columnstore index’s quality

uses its own compression mechanism

⇒ row or page compression cannot be used on it [3]
[SQL Server 2016] page compression has been removed

⇐ in some cases, page compression disallowed the creation of columnstore indexes with a very large number of columns [5]

{feature} compression delay

computed when a delta rowgroup is closed [7]
keeps the ‘active’ rows in delta rowgroup and only transition these rows to compressed rowgroup after a specified delay [7]

⇐ reduces the overall maintenance overhead of NCCI [7]
⇒ leads to a larger number of delta rowgroups [7]

{best practice} if the workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option [7]
{best practice} [OLTP workload] if > 10% rows are marked deleted in recently compressed rowgroups, then consider a value that accommodates the behavior [7]

via: create nonclustered columnstore index with (compression_delay= 150)

{feature} data encoding

all values in the data are replaced with 64-bit integers using one of two encoding algorithms
{concept} dictionary encoding

stores distinct values from the data in a separate structure (aka dictionary}

every value in a dictionary has a unique ID assigned [5]

the ID is used for replacement

{concept} global dictionary

shared across all segments that belong to the same index partition [5]

{concept} local dictionary

created for individual segments using values that are not present in the global dictionary

{concept} value-based encoding

mainly used for numeric and integer data types that do not have enough duplicated values [5]

dictionary encoding would be inefficient [5]

converts integer and numeric values to a smaller range of 64-bit integers in 2 steps

{step} [numeric data types] are converted to integers using the minimum positive exponent (aka magnitude that allows this conversion) [5]

{goal} convert all numeric values to integers [5]
[integer data types] the smallest negative exponent is chosen that can be applied to all values without losing their precision [5]

{goal} reduce the interval between the minimum and maximum values stored in the segment [5]

{step} the minimum value (aka base value) in the segment is identified and subtracted it from all other values [5]

⇒ makes the minimum value in the segment number 0 [5]

after encoding the data are compressed and stored as a LOB allocation unit

{concept} column segment

{def} a column of data from within the rowgroup
is compressed together and stored on physical media
SQL Server loads an entire segment to memory when it needs to access its data

{concept} segment metadata

store metadata about each segment

e.g. minimum and maximum values
⇐ segments that do not have the required data are skipped [5]

{concept} deltastore

{def} all of the delta rowgroups of a columnstore index
its operations are handled behind the scenes

can be in either states

{state} open (aka open delta store)

accepts new rows and allow modifications and deletions of data

{state} closed (aka closed data store)

a delta store is closed when it reaches its rowgroup limit

{concept} delta rowgroup

{def} a clustered B-tree index that's used only with columnstore indexes
improves columnstore compression and performance by storing rows until the number of rows reaches the rowgroup limit and are then moved into the columnstore
{event} reaches the maximum number of rows

it transitions from an ‘open’ to ‘closed’ state
a closed rowgroup is compressed by the tuple-mover and stored into the columnstore as COMPRESSED rowgroup

{event} compressed

the existing delta rowgroup transitions into TOMBSTONE state to be removed later by the tuple-mover when there is no reference to it

{concept} tuple-mover

background process that checks for closed row group

if it finds a closed rowgroup, it compresses the delta rowgroup and stores it into the columnstore as a COMPRESSED rowgroup

{concept} clustered columnstore index (CCI)

is the primary storage for the entire table
{characteristic) updatable

has two structures that support data modifications

⇐ both use the B-Tree format to store data [5]
⇐ created on demand [5]
delete bitmap

indicates which rows were deleted from a table
upon deletion the row continues to be stored into the rowgroup
during query execution SQL Server checks the delete bitmap and excludes deleted rows from the processing [5]

delta store

includes newly inserted rows
updating a row triggers the deletion of the existing row and insertion of a new version of a row to a delta store

⇒ the update does not change the row data
⇒ the updated row is inserted to a delete bitmap

[partitions] each partition can have a single delete bitmap and multiple delta stores

⇐ this makes each partition self-contained and independent from other partitions

⇒ allows performing a partition switch on tables that have clustered columnstore indexes defined [5]

{feature} supports minimal logging for batch sizes >= rowgroup’s limit [12]
[SQL Server 2017] supports non-persisted computed columns in clustered columnstore indexes [2]
store some data temporarily into a clustered index (aka deltastore) and a btree list of IDs for deleted rows

⇐ {benefit} reduces fragmentation of the column segments and improves performance
combines query results from both the columnstore and the deltastore to return the correct query results

[partitions] too many partitions can hurt the performance of a clustered columnstore index [11]

{concept} nonclustered columnstore index (NCCI)

{def} a secondary index that's created on a rowstore table

is defined as one or more columns of the table and has an optional condition that filters the rows
designed to be used for workloads involving a mix of transactional and analytics workload*
functions the same as a clustered columnstore index

⇐ has same performance optimizations (incl. batchmode operators)
{exception} doesn’t supports persisted computed columns

can’t be created on a columnstore index that has a computed column [2]

however behave differently between the various versions of SQL Server

[SQL Server 2012|2014] {restriction} readonly

contains a copy of part or all of the rows and columns in the underlying table

include a row-id , which is either the address of

a row in a heap table
a clustered index key value

includes all columns from the clustered index even when not explicitly defined in the CREATE statement

the not specified columns will not be available in the sys.index_columns view

[SQL Server 2016] multiple nonclustered rowstore indexes can be created on a columnstore index and perform efficient table seeks on the underlying columnstore

⇒ once created, makes it possible to drop one or more btree nonclustered indexes

enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index

{concept} batch mode execution (aka vector-based execution, vectorized execution)

{def} query processing method used to process multiple rows together in groups of rows, or batches, rather than one row at a time

SQL Server can push a predicate to the columnstore index scan operator, preventing unnecessary rows from being loaded into the batch [5]
queries can process up to 900 rows together

enables efficient query execution (by a 3-4x factor) [4]
⇐ the size of the batches varies to fit into the CPU cache
⇒ reduces the number of times that the CPU needs to request external data from memory or other components [5]

improves the performance of aggregations, which can be calculated on a per-batch rather than a per-row basis [5]
tries to minimize the copy of data between operators by creating and maintaining a special bitmap that indicates if a row is still valid in the batch [5]

⇐ subsequent operators will ignore the non-valid rows
every operator has a queue of work items (batches) to process [5]
worker threads from a shared pool pick items from queues and process them while migrating from operator to operator [5]

is closely integrated with, and optimized around, the columnstore storage format.

columnstore indexes use batch mode execution

⇐ improves query performance typically by two to four times

{concept} tuple mover

single-threaded process that works in the background, preserving system resources

runs every five minutes

converts closed delta stores to row groups that store data in a column-based storage format [5]

can be disabled via trace flag T-634
⇐ the conversion of closed delta stores to row groups can be forced by reorganizing an index [5]

runs in parallel using multiple threads

decreases significantly conversion time at a cost of extra CPU load and memory usage [5]

via: ALTER INDEX REORGANIZE command

it doesn’t prevent other sessions from inserting new data into a table [5]
deletions and data modifications would be blocked for the duration of the operation [5]

{recommendation} consider forcing index reorganization manually to reduce execution, and therefore locking, time [5]

considered fragmented if it has

multiple delta rowgroups
deleted rows

require maintenance like that of regular B-Tree indexes [5]

{issue] partially populated row groups
{issue} overhead of delta store and delete bitmap scans during query execution
rebuilding the columnstore index addresses the issues
the strategy depends on the volatility of the data and the ETL processes implemented in the system [5]

{recommendation} rebuild indexes when a table has a considerable volme of deleted rows and/or a large number of partially populated rowgroups [5]
{recommendation} rebuild partition(s) that still have a large number of rows in open delta stores after the ETL process has completed, especially if the ETL process does not use a bulk insert API [5]

creating/dropping/disabling/rebuilding functions like any other index

columnstore statistics

a statistics object is created at the time of columnstore index creation; however, it is neither populated nor updated afterward [5]

⇐ SQL Server relies on segment information, B-Tree indexes (when available), and column-level statistics when deciding if a columnstore index needs to be used [5]
it is beneficial to create missing column-level statistics on the columns that participate in a columnstore index and are used in query predicates and as join keys [5]

⇐ statistics rarely update automatically on very large tables [5]

⇒ statistics must be updated ‘manually’

[SQL Server 2019] included into the schema-only clone of a database functionality [8]

enable performance troubleshooting without the need to manual capture the statistics information

columnstore indexes has been added to sp_estimate_data_compression_savings. In SQL Server 2019 both
COLUMNSTORE and COLUMNSTORE_ARCHIVE have been added to allow you to estimate the space savings if
either of these indexes are used on a table.

via DBCC CLONEDATABASE

[in-memory tables]

{limitation} a columnstore index must include all the columns and can’t have a filtered condition [2]
{limitation} queries on columnstore indexes run only in InterOP mode, and not in the in-memory native mode [2]

{operation} designing columnstore indexes

{best practice} understand as much as possible data’s characteristics
{best practice} identify workload’s characteristics

{operation} create a clustered columnstore index

via CREATE CLUSTERED COLUMNSTORE INDEX command
not needed to specify any columns in the statement

⇐ the index will include all table columns

{operation} index rebuilding

forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data [5]

all column segments are recreated with row groups fully populated [5]

[<SQL Server 2019] offline operation
[SQL Server 2019 Enterprise] online operation

⇒ higher availability
⇐ pausing and resuming create and rebuild operations are not supported [11]

very resource intensive process
holds a schema modification (Sch-M) lock on the table

⇒ prevents other sessions from accessing it [5]
⇐ the overhead can be mitigated by using table/index partitioning

⇒ indexes will be rebuild on a partition basis for those partition with volatile data [5]

{operation} index reorganization

[<SQL Server 2019] a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups
[SQL Server 2019] a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted

⇐ after merging smaller rowgroups, the index quality should be improved.
the tuple-mover is helped by a background merge task that automatically compresses smaller OPEN delta rowgroups that have existed for some time as determined by an internal threshold, or merges COMPRESSED rowgroups from where a large number of rows has been deleted
via: ALTER INDEX REORGANIZE command

[SQL Server 2016] performs additional defragmentation

removes deleted rows from row groups that have 10 or more percent of the rows logically deleted [5]
merges closed row groups together, keeping the total number of rows less than or equal than rowgroup’s limit [5]
⇐ both processes can be done together [5]

[SQL Server 2014] the only action performed is compressing and moving the data from closed delta stores to rowgroups [5]

⇐ delete bitmap and open delta stores stay intact [5]

via: ALTER INDEX REORGANIZE

uses all available system resources while it is running [5]

⇒ speeds up the execution process
reduce the time during which other sessions cannot modify or delete data in a table [5]

close and compress all open row groups

via: ALTER INDEX REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
row groups aren’t merged during this operation [5]

{operation} estimate compression savings

[SQL Server 2019] COLUMNSTORE and COLUMNSTORE_ARCHIVE added

allows estimating the space savings if either of these indexes are used on a table [8]
{limitation} not available in all editions

via: sp_estimate_data_compression_savings

{operation} [bulk loads] when the number of rows is less than deltastore’s limit, all the rows go directly to the deltastore

[large bulk load] most of the rows go directly to the columnstore without passing through the deltastore

some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup

⇒ the final rows go to the deltastore instead of the columnstore

bulk insert operations provide the number of rows in the batch as part of the API call [5]

best results are achieved by choosing a batch size that is divisible by rowgroup’s limit [5]

⇐ guarantees that every batch produces one or several fully populated row groups [5]

⇒ reduce the total number of row groups in a table [5]
⇒ improves query performance

⇐ the batch size shouldn’t exceed rowgroup’s limit [5]

row groups can be still created on the fly in a manner to similar a bulk insert when the size of the insert batch is close to or exceeds [5]

{operation} [non-bulk operations] trickle inserts go directly to a delta store
{feature} parallel inserts

[SQL Server 2016] requires following conditions for parallel insert on CCI [6]

must specify TABLOCK
no NCI on the clustered columnstore index
no identity column
database compatibility is set to 130

{recommendation} minimize the use of string columns in facts tables [5]

string data use more space
their encoding involves additional overhead during batch mode execution [5]
queries with predicates on string columns may have less efficient execution plans that also require significantly larger memory grants as compared to their non-string counterparts [5]

{recommendation} [SQL Server 2012|2014] do not push string predicates down toward the lowest operators in execution plans.
{recommendation} add another dimension table and replace the string value in the facts table with a synthetic, integer-based ID key that references a new table [5]
{operation} upgrading to SQL Server 2016

make sure that queries against the tables with columnstore indexes can utilize parallelism in case if database compatibility level less than 130 [5]

{feature} [SQL Server 2019] automated columnstore index maintenance [8]
{improvement} [SQL Server 2019] better columnstore metadata memory management
{improvement} [SQL Server 2019] low-memory load path for columnstore tables
{improvement} [SQL Server 2019] improved performance for bulk loading to columnstore indexes
{improvement} [SQL Server 2019] server startup process has been made faster for databases that use in-memory columnstore tables for HTAP
{feature} DMVs

sys.column_store_segments

returns one row for each column per segment

sys.column_store_dictionaries

provides information about the dictionaries used by a columnstore index

sys.dm_db_column_store_row_group_physical_stats

rowgroup statuses

sys.column_store_row_groups

provides clustered columnstore index information on a per-segment basis

Previous Post <<||>> Next Post

References:
[1] SQL Docs (2020) Columnstore indexes: Overview [link]

[2] Microsoft Learn (2024) SQL: What's new in columnstore indexes [link]
[3] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

[4] SQL Docs (2019) Columnstore indexes - Query performance [link]

[5] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.

[6] Microsoft Learn (2016) Columnstore Index: Parallel load into clustered columnstore index from staging table [link]

[7] Microsoft Learn (2016) Columnstore Index Defragmentation using REORGANIZE Command [link]

[8] Microsoft (2018) Microsoft SQL Server 2019: Technical white paper [link]

Acronyms:
CCI - clustered columnstore index
CI - columnstore index
DBCC - Database Console Commands
DMV - Dynamic Management View
ETL - Extract, Transform, Load
HTAP - Hybrid Transactional/Analytical Processing
LOB - Line of Business
NCCI - nonclustered columnstore index
OLTP - On-Line Transaction Processing
SP - Service Pack

13 February 2025

🧊💠🗒️Data Warehousing: Table Partitioning in SQL Server [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 13-Feb-2025

[Data Warehousing] Table Partitioning

{def} the spreading of data across multiple tables based on a set of rules to balance large amounts of data across disks or nodes

data is distributed based on a function that defines a range of values for each partition [2]

the table is partitioned by applying the partition scheme to the values in a specified column [2]

{operation} partition creation

[large partitioned table]

should be created two auxiliary nonindexed empty tables with the same structure, including constraints and data compression options [4]

first table: create a check constraint that guarantees that all data from the table fits exactly with one empty partition of the fact table

the constraint must be created on the partitioning column [4]
a columnstore index can be created on the fact table, as long as it is aligned with the table [4]
after truncation of <table 2> the <table 1> is prepared to accept the next partition from your fact table for the next minimally logged deletion [4]

second table: for minimally logged deletions of large portions of data, a partition from the fact table can be switched to the empty table version without the check constraint [4]

then the table can be truncated

for minimally logged inserts, new data to the second auxiliary table should be bulk inserted in the auxiliary that has the check constraint [4]

INSERT operation can be minimally logged because the table is empty [4]
create a columnstore index on this auxiliary table, using the same structure as the columnstore index on your fact table [4]
switch data from this auxiliary table to a partition of your fact table [4]
drop the columnstore index on the auxiliary table, and change the check constraint to guarantee that all of the data for the next load can be switched to the next empty partition of the fact table [4]
the second auxiliary table is prepared for new bulk loads again [4]

{operation} [Query Optimizer] partition elimination

process in which SQL Server accesses only those partitions needed to satisfy query filters [4]

{operation} partition switching

{definition} process that switches a block of data from one table or partition to another table or partition [4]
types of switches

reassign all data from a nonpartitioned table to an empty existing partition of a partitioned table [4]
switch a partition of one partitioned table to a partition of another partitioned table [4]
reassign all data from a partition of a partitioned table to an existing empty nonpartitioned table [4]

{benefit} improves query performance [1]

by partitioning a table across filegroups [1]

specific ranges of data can be placed on different disk spindles [1]

can improve I/O performance [1]

⇐ the disk storage is already configured as a RAID 10 or RAID 5 array [1]

⇒ this usually has little benefit [1]

using a mix of fast solid state storage for recent, frequently accessed data, and mechanical disks for older, less queried rows [1]

use partitioning to balance disk performance against storage costs [1]

biggest performance gain from partitioning in a data warehouse is realized when queries return a range of rows that are filtered on the partitioning key [1]

the query optimizer can eliminate partitions that are not within the filter range [1]

dramatically reduce the number of rows that need to be read [1]

reduces contention [3]

can reduce the number of rows included in a table scan [3]

{benefit} more granular manageability [1]

some maintenance operations can be performed at partition level instead of on the whole table [1]

e.g. indexes can be created and rebuilt on a per-partition basis [1]
e.g. compression can be applied to individual partitions [1]
e.g. by mapping partitions to filegroups, partitions can be backed up and restored independently [1]

enables to back up older data once and then configure the backed up partitions as read-only [1]
future backups can be limited to the partitions that contain new or updated data [1]

{benefit} improved data load performance

enables loading many rows very quickly by switching a staging table with a partition

can dramatically reduce the time taken by ETL data loads [1]

with the right planning, it can be achieved with minimal requirements to drop or rebuild indexes [1]

{best practice} partition large fact tables

tables of around 50 GB or more
⇐ in general, fact tables benefit from partitioning more than dimension tables [1]

{best practice} partition on an incrementing date key [1]

assures that the most recent data are in the last partition and the earliest data are in the first partition [1]

{best practice} design the partition scheme for ETL and manageability [1]

the query performance gains realized by partitioning are small compared to the manageability and data load performance benefits [1]

ideally partitions should reflect the ETL load frequency

because this simplifies the load process [1]
merge partitions periodically to reduce the overall number of partitions (for example, at the start of each year [1]

could merge the monthly partitions for the previous year into a single partition for the whole year [1]

{best practice} maintain an empty partition at the start and end of the table [1]

simplifies the loading of new rows [1]
when new records need to be added, split the empty partition

⇐ to create two empty partitions)

then switch the staged data with the first empty partition [1]

⇐ loads the data into the table and leaves the second empty partition you created at the end of the table, ready for the next load [1]

a similar technique can be used to archive or delete obsolete data at the beginning of the table [1]

{best practice} chose the proper granularity

it should be aligned to the business requirements [2]

{best practice} create at least one filegroup in addition to the primary one

set it as the default filegroup

data tables are thus separated from system tables [2]

creating dedicated filegroups for extremely large fact tables [2]

place the fact tables on their own logical disks [2]

use a file and a filegroup separate from the fact and dimension tables [2]

{exception} staging tables that will be switched with partitions to perform fast loads [2]

staging tables must be created on the same filegroup as the partition with which they will be switched [2]

{def} partition scheme

a scheme that maps partitions to filegroups

{def} partition function
- object that maps rows to partitions by using values from specific columns (aka partitioning columns)
- performs logical mapping
{def} aligned index

index built on the same partition scheme as its base table [4]

if all indexes are aligned with their base table, switching a partition is a metadata operation only [4]

⇒ it’s very fast [4]

Previous Post <<||>> Next Post

References:
[1] 20467A - Designing Business Intelligence Solutions with Microsoft SQL Server 2012

[2] 20463C - Implementing a Data Warehouse with Microsoft SQL Server

[3] 10777A - Implementing a Data Warehouse with Microsoft SQL Server 2012

[4] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

[5] Microsoft Learn (2009) How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005 [link]

02 February 2025

🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

Business Intelligence Series

Last week Microsoft had a great session [1] on the present and future of SQL databases, a “light” version of Azure SQL databases designed for Microsoft Fabric environments, respectively workloads. SQL databases are currently available for testing, and after the first tests the product looks promising. Even if there are several feature gaps, it’s expected that Microsoft will bridge the gaps over time. Conversely, there might be features that don’t make sense in Fabric, respectively new features that need to be considered for facilitating the work in OneLake and its ecosystem.

During the session several Microsoft professionals answered the audience’s questions, and they did a great job. Even if the answers and questions barely scratched the surface, they offered some insight into what Microsoft wants to do. Probably the expectation is that SQL databases won’t need any administration - indexes being maintained automatically, infrastructure scaling as needed, however everything sounds too nice to be true if one considers in general the experience with RDBMS – the devil hides usually in details.

Even if the solutions built follow the best practices in the field, which frankly seldom happens, transferring the existing knowledge to Fabric may encounter some important challenges revolving around performance, flexibility, accessibility and probably costs. Even if SQL databases are expected to fill some minor gaps, considering the lessons of the past, such solutions can easily grow. Even if a lot of processing power is thrown at the SQL queries and the various functionality, customers still need to write quality code and refactor otherwise the costs will explode sooner or later.

As the practice has proven so many times while troubleshooting performance issues, sometimes one needs to use all the arsenal available – DBCC, DMVs and sometimes even undocumented features - to get a better understanding of what’s happening. Even if there are some voices stating that developers don’t need to know how the SQL engine works, just applying solutions blindly after a recipe can accidentally increase the value of code, though most likely it doesn’t exploit the full potential available. Unfortunately, this is a subjective topic without hard numbers to support it, and the stories told by developers and third-parties usually don’t tell the whole story.

It’s also true that diving deep into a database’s internal working requires time, that’s quite often not available, and the value for such an effort doesn’t necessarily pay off. Above this, there’s a software engineer’s aim of understanding of how things work. Otherwise, one should drop the engineering word and just call it coding. Conversely, the data citizen just needs a high-level knowledge of how things work, though the past 20-30 years proved that that’s often not enough. The more people don’t have the required knowledge, the higher the chances that code needs refactoring. Just remember the past issues organizations had with MS Access and Excel when people started to create their own solutions, the whole infrastructure being invaded by poorly designed solutions that continue to haunt some organizations even today.

Even if lot of technical knowledge can be transported to Microsoft Fabric, the new environments may still require also adequate tools that can be used for monitoring and troubleshooting. Microsoft seems to work in this direction, though from the information available the tools don’t and can’t offer the whole perspective. It will be interesting to see how much the current, respectively the future dashboards and various reports can help; respectively what important gaps will surface. Until the gaps are addressed, probably the SQL professional must rely on SQL scripts and the DMVs available. All this can be summarized in a few words: it will not be boring!

Previous Post <<||>> Next Post

References:
[1] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

07 May 2024

🏭🗒️Microsoft Fabric: The Metrics Layer [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 07-May-2024

The Metrics Layer in Microsoft Fabric (adapted diagram)

[new feature] Metrics Layer (Metrics Store)

{definition}an abstraction layer available between the data store(s) and end users which allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse

⇐ {important} feature still in private preview

{goal} extend existing infrastructure

{benefit} leverages and extends existing features

{goal} provide consistent definitions and descriptions [1]

consistent definitions that include besides business logic additional dimensions and filters [1]
⇒ {benefit} allows to standardize the metrics across the organization
⇒ {benefit} enforce to enforce a SSoT

{goal} easy management

via management views
[feature] lineage
[feature] source control
[feature] duplicate identification
[feature] push updates to downstream uses of the metrics

{goal}searchable and discoverable metrics

{feature} integration

based on Sempy fabric package

⇐ a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric

{goal}trust

[feature] trust indicators
{benefit} facilitates report's adoption

{feature} metric set

{definition} a Fabric item that groups together a set of metrics into a mini-model
{benefit} allows to reduce the overall complexity of semantic models, while being easy to evolve and consume
associated with a single domain

⇒ supports the data mesh architecture

shareable

can be shared with other users

{action} create metric set

creates the actual artifact, to which metrics can be added

{feature} metric

{definition} a way to elevate the measures from the various semantic models existing in the organization
tied to the original semantic model

⇒ {benefit} allows to see how a metric is used across the solutions

reusable

can be reused in other fabric artifacts

new reports on the Power BI service
notebooks

by copying the code

can be reused in Power BI

via OneLake data hub menu element

can be chained

changes are propagated downstream

materializable

its output can be persisted to OneLake by saving it a delta table into a lakehouse
{misuse} data is persisted unnecessarily

{action} elevate metric

copies measure's definition and description
⇒ implies restructuring, refactoring, moving, and testing a lot of code in the process
{misuse} data professionals build everything as metrics

{action} update metric
{action} add filters to metric
{action} add dimensions to metric
{action} materialize metric

Previous Post <<||>> Next Post

References:
[1] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[2] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
SSoT - single source of truth ()

06 May 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part III: The Metrics Layer) 🆕

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below.

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators.

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream.

The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly.

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated.

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project.

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily.

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s).

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built.

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link)
[14] KratosBI (2024) Introducing the Metrics Layer in #MicrosoftFabric with Carly Newsome [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon.

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work.

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed.

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done.

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1].

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

22 March 2024

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part III: Architectural Applications)

Business Intelligence Series

Now considering the 500 houses and the skyscraper model introduced in thee previous post, which do you think will be built first? A skyscraper takes 2-10 years to build, depending on the city in which is built and the architecture characteristics. A house may take 6-12 months depending on similar factors. But one needs to build 500 houses. For sure the process can be optimized when the houses look the same, though there are many constraints one needs to consider - the number of workers, tools, and the construction material available at a given time, the volume of planning, etc.

Within a rough estimate, it can take 2-5 years for each architecture to be built considering that on the average the advantages and disadvantages from the various areas can balance each other out. Historical data are in general needed for estimating the actual development time. One can start with a rough estimate and reevaluate the estimates up and down as more information are gathered. This usually happens in Software Engineering as well.

Monolith vs. Distributed Architecture - 500 families

There are multiple ways in which the work can be assigned to the contractors. When the houses are split between domains, each domain can have its own contractor(s) or the contractors can be specialized by knowledge areas, or a combination of the two. Contractors’ performance should be the same, though in practice no two contractors are the same. Conversely, the chances are higher for some contractors to deliver at the expected quality. It would be useful to have worked before with the contractors and have a partnership that spans years back. There are risks on both sides, even if the risks might favor one architecture over the other, and this depends also on the quality of the contractors, designs, and planning.

The planning must be good if not perfect to assure smooth development and each day can cost money when contractors are involved. The first planning must be done for the whole project and then split individually for each contractor and/or group of buildings. A back-and-forth check between the various plans is needed. Managing by exception can work, though it can also go terribly wrong.

Lot of communication must occur between domains to make sure that everything fits together. Especially at the beginning, all the parties must plan together, must make sure that the rules of the games (best practices, policies, procedures, processes, methodologies) are agreed upon. Oversight (governance) needs to happen at a small scale as well on aggregate to makes sure that the rules of the game are followed.

Now, which of the architectures do you think will fit a data warehouse (DWH)? Probably multiple voices will opt for the skyscraper, at least this is how a DWH looks from the outside. However, when one evaluates the architecture behind it, it can resemble a residential complex in which parts are bound together, but there are parts that can be distributed if needed. For example, in a DWH the HR department has its own area that's isolated from the other areas as it has higher security demands. There can be 2-3 other areas that don't share objects, and they can be distributed as well. The reasons why all infrastructure is on one machine are the costs associated with the licenses, respectively the reporting tools point to only one address.

In data marts based DWHs, there are multiple buildings within the architecture, and thus the data marts can be distributed across a wider infrastructure, with each domain responsible for its own data mart(s). The data marts are by definition domain-dependent, and this is one of the downsides imputed to this architecture.

Previous Post <<||>> Next Post

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part II: Architectural Choices)

Business Intelligence Series

One metaphor that can be used to understand the difference between monolith and distributed architectures, respectively between data warehouses and data mesh-based architectures as per Dehghani’s definition [1] - think that you need to accommodate 500 families (the data products to be built). There are several options: (1) build a skyscraper (developing on vertical) (2) build a complex of high buildings and develop by horizontal and vertical but finding a balance between the two; (3) to split (aka distribute) the second option and create several buildings; (4) build for each family a house, creating a village or a neighborhood.

Monolith vs. Distributed Architecture - 500 families

(1) and (2) fit the definition of monoliths, whiles (3) and (4) are distributed architectures, though also in (3) one of the buildings can resemble a monolith if one chooses different architectures and heights for the buildings. For houses one can use a single architecture, agree on a set of predefined architectures, or have an architecture for each house, so that houses would look alike only by chance. One can also opt to have the same architecture for the buildings belonging to the same neighborhood (domain or subdomain). Moreover, the development could be split between multiple contractors that adhere to the same standards.

If the land is expensive, for example in big, overpopulated cities, when the infrastructure and the terrain allow it, one can build entirely on vertical, a skyscraper. If the land is cheap one can build a house for each family. The other architectures can be considered for everything in between.

A skyscraper is easier for externals to find (mailmen, couriers, milkmen, and other service providers) though will need a doorman to interact with them and probably a few other resources. Everybody will have the same address except the apartment number. There must be many elevators and the infrastructure must allow the flux of utilities up and down the floors, which can be challenging to achieve.

Within a village every person who needs to deliver or pick up something needs to traverse parts of the village. There are many services that need to be provided for both scenarios though the difference it will be the time that's needed to move in between addresses. In the virtual world this shouldn't matter unless one needs to inspect each house to check and/or retrieve something. The network of streets and the flux of utilities must scale with the population from the area.

A skyscraper will need materials of high quality that resist the various forces that apply on the building even in the most extreme situations. Not the same can be said about a house, which in theory needs more materials though a less solid foundation and the construction specifications are more relaxed. Moreover, a house needs smaller tools and is easier to build, unless each house has own design.

A skyscraper can host the families only when the construction is finished, and the needed certificates were approved. The same can be said about houses but the effort and time is considerably smaller, though the utilities must be also available, and they can have their own timeline.

The model is far from perfect, though it allows us to reason how changing the architecture affects various aspects. It doesn't reflect the reality because there's a big difference between the physical and virtual world. E.g., parts of the monolith can be used productively much earlier (though the core functionality might become available later), one doesn't need construction material but needs tool, the infrastructure must be available first, etc. Conversely, functional prototypes must be available beforehand, the needed skillset and a set of assumptions and other requirements must be met, etc.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

SQL Troubles

Pages

20 February 2025

💠🛠️🗒️SQL Server: Nulls [Notes]

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

13 February 2025

🧊💠🗒️Data Warehousing: Table Partitioning in SQL Server [Notes]

02 February 2025

🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

07 May 2024

🏭🗒️Microsoft Fabric: The Metrics Layer [Notes] 🆕

06 May 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part III: The Metrics Layer) 🆕

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

22 March 2024

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part III: Architectural Applications)

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part II: Architectural Choices)

About Me