Showing posts sorted by date for query Data warehousing. Sort by relevance Show all posts
Showing posts sorted by date for query Data warehousing. Sort by relevance Show all posts

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 15-Feb-2024

[SQL Server] columnstore indexes (CI)

  • {def} a technology for storing, retrieving and managing data by using a columnar data format (aka columnstore
    • store compressed data on a per-column rather than a per-row basis [5]
  • {benefit} designed for analytics and data warehousing workloads
    • data warehousing
      • {scenario} store fact tables and large dimension tables
        • ⇐ tend to require full table scans rather than table seeks
    • analytics workloads
      • {scenario} [SQL Server 2016 SP1] can be used for real-time analytics on operational databases
        • ⇐ an updatable nonclustered columnstore index can be created on a rowstore table
  • {benefit} performance increase 
    • can achieve up to 100x better performance [4]
    • offers an order of magnitude better performance than a rowstore index
      • {feature} uses batch mode execution
        • improves query performance typically by two to four times
      • have high performance gains for analytic queries that scan large amounts of data, especially on large tables (>1 million rows) 
  • {benefit} reduces significantly the data warehouse storage costs
    • {feature} data compression
      • ⇒ provides high compression rates, typically by 10 times
        • ⇒ reduces total I/O from the physical media
          • ⇐ queries often select only a few columns from a table
          • minimizes or eliminates system I/O bottlenecks
        • reduces significantly the memory footprint
          • ⇒ query performance can improve 
            • because SQL Server can perform more query and data operations in memory
  • {benefit} built in memory
    • ⇒ sufficient memory must be available 
  • {benefit} part of the database engine
    • no special hardware is needed
  • {concept} columnstore 
    • {def} data structure logically organized as a table with rows and columns, and physically stored in a column-wise data format
      • stores values from the same domain which commonly have similar values
        • when a query references a column, then only that column is fetched from disk [3]
          • ⇐ the columns not requested are skipped 
            • ⇒ they are not loaded into memory 
        • when a query is executed, the rows must be reconstructed
          • ⇒ row reconstruction takes some time and uses some CPU and memory resources [3]
        • [SQL Server 2016] columnstore index on rowstore tables
          • columnstore is updated when data changes in the rowstore table
        • both indexes work against the same data
  • {concept}rowstore
    • {def} data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format
      • ⇐ the traditional way to store relational table data
      • refers to a table where the underlying data storage format is either
        • a heap
        • a clustered index
        • a memory-optimized table
  • {concept} rowstore index
    • performs best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values
      • ⇒ appropriate for transactional workloads 
        • because they tend to require mostly table seeks instead of table scans
  • {concept} rowgroup
    • {def} a group of rows that are compressed into columnstore format at the same time
      • {constraint} has a maximum number of rows per rowgroup, which is 1,048,576 =2^20 rows
      • contains one column segment for every column in the table
      • can have more than one delta rowgroup that form the deltastore
        • e.g. when multiple threads create columnstore indexes using parallel execution plans [5]
          • ⇐ each thread will work with its own subset of data, creating separate rowgroups [5]
        • [partitions] each table partition has its own set of row groups [5]
          • ⇐  too many partitions may prevent workloads from benefiting from a CCI [11]
            • ⇐ data aren’t pushed into a compressed columnstore segment until the rowgroup limit is reached
    • {event} rowgroup is compressed
      • marked as read-only [16]
      • a compressed rowgroup is considered as fragmented when either 
        • row number < rowgroup limit but dictionary size reached the maximum
          • nothing can be done to increase the number of rows [15]
          • the trim_reason is other than DICTIONARY_SIZE
        • it has nonzero deleted rows that exceeds a minimum threshold [15]
    • {event} all data from rowgroup deleted 
      • transitions from COMPRESSED into TOMBSTONE state
      • later removed by the tuple-mover background process
    • {event} rows in the columnstore indexes can be moved to different locations
      • row-id in the nonclustered indexes aren’t updated 
        • ⇐ the mappings between old and new row locations are stored in an internal structure (aka mapping index) 
    • {event} rowgroup build
      • all column data are combined on a per-row group basis, encoded and compressed [5]
        • the rows within a row group can be rearranged if that helps to achieve a better compression rate [5]
  • {feature} data compression
    • the table is sliced into rowgroups, and each rowgroup is compresses in a column-wise manner
      • the number of rows in the rowgroup must be 
        • large enough to improve compression rates
        • small enough to benefit from in-memory operations
          • having too many small rowgroups decreases columnstore index’s quality
    • uses its own compression mechanism 
      • ⇒ row or page compression cannot be used on it [3]
      • [SQL Server 2016] page compression has been removed
        • ⇐ in some cases, page compression disallowed the creation of columnstore indexes with a very large number of columns [5]
  • {feature} compression delay
    • computed when a delta rowgroup is closed [7]
    • keeps the ‘active’ rows in delta rowgroup and only transition these rows to compressed rowgroup after a specified delay [7]
      • ⇐ reduces the overall maintenance overhead of NCCI [7]
      • ⇒ leads to a larger number of delta rowgroups [7]
    • {best practice} if the workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option [7]
    • {best practice} [OLTP workload] if > 10% rows are marked deleted in recently compressed rowgroups, then consider a value that accommodates the behavior [7]
      • via: create nonclustered columnstore index with (compression_delay= 150)
  • {feature} data encoding
    • all values in the data are replaced with 64-bit integers using one of two encoding algorithms
    • {concept} dictionary encoding
      • stores distinct values from the data in a separate structure (aka dictionary} 
        • every value in a dictionary has a unique ID assigned [5]
          • the ID is used for replacement
    • {concept} global dictionary
      • shared across all segments that belong to the same index partition [5]
    • {concept} local dictionary
      • created for individual segments using values that are not present in the global dictionary
    • {concept} value-based encoding
      • mainly used for numeric and integer data types that do not have enough duplicated values [5]
        • dictionary encoding would be inefficient [5]
      • converts integer and numeric values to a smaller range of 64-bit integers in 2 steps
        • {step} [numeric data types] are converted to integers using the minimum positive exponent (aka magnitude that allows this conversion) [5]
          • {goal} convert all numeric values to integers [5]
          • [integer data types] the smallest negative exponent is chosen that can be applied to all values without losing their precision [5]
            • {goal} reduce the interval between the minimum and maximum values stored in the segment [5]
        • {step} the minimum value (aka base value) in the segment is identified and subtracted it from all other values [5]
          • ⇒ makes the minimum value in the segment number 0 [5]
      • after encoding the data are compressed and stored as a LOB allocation unit
  • {concept} column segment 
    • {def} a column of data from within the rowgroup
    • is compressed together and stored on physical media
    • SQL Server loads an entire segment to memory when it needs to access its data
  • {concept} segment metadata 
    • store metadata about each segment 
      • e.g. minimum and maximum values
      • ⇐ segments that do not have the required data are skipped [5]
  • {concept} deltastore
    • {def} all of the delta rowgroups of a columnstore index
    • its operations are handled behind the scenes
      • can be in either states
        • {state} open (aka open delta store) 
          • accepts new rows and allow modifications and deletions of data
        • {state} closed (aka closed data store)
          • a delta store is closed when it reaches its rowgroup limit
  • {concept} delta rowgroup 
    • {def} a clustered B-tree index that's used only with columnstore indexes
    • improves columnstore compression and performance by storing rows until the number of rows reaches the rowgroup limit and are then moved into the columnstore
    • {event} reaches the maximum number of rows
      • it transitions from an ‘open’ to ‘closed’ state
      • a closed rowgroup is compressed by the tuple-mover and stored into the columnstore as COMPRESSED rowgroup
    • {event} compressed
      • the existing delta rowgroup transitions into TOMBSTONE state to be removed later by the tuple-mover when there is no reference to it
  • {concept} tuple-mover 
    • background process that checks for closed row group
      • if it finds a closed rowgroup, it compresses the delta rowgroup and stores it into the columnstore as a COMPRESSED rowgroup
  • {concept} clustered columnstore index (CCI) 
    • is the primary storage for the entire table
    • {characteristic) updatable
      • has two structures that support data modifications
        • ⇐ both use the B-Tree format to store data [5]
        • ⇐ created on demand [5]
        • delete bitmap
          • indicates which rows were deleted from a table
          • upon deletion the row continues to be stored into the rowgroup
          • during query execution SQL Server checks the delete bitmap and excludes deleted rows from the processing [5]
        • delta store
          • includes newly inserted rows
          • updating a row triggers the deletion of the existing row and insertion of a new version of a row to a delta store
            • ⇒ the update does not change the row data
            • ⇒ the updated row is inserted to a delete bitmap
        • [partitions] each partition can have a single delete bitmap and multiple delta stores
          • ⇐ this makes each partition self-contained and independent from other partitions
            • ⇒ allows performing a partition switch on tables that have clustered columnstore indexes defined [5]
    • {feature} supports minimal logging for batch sizes >= rowgroup’s limit [12]
    • [SQL Server 2017] supports non-persisted computed columns in clustered columnstore indexes [2]
    • store some data temporarily into a clustered index (aka deltastore) and a btree list of IDs for deleted rows
      • ⇐ {benefit} reduces fragmentation of the column segments and improves performance
      • combines query results from both the columnstore and the deltastore to return the correct query results
    • [partitions] too many partitions can hurt the performance of a clustered columnstore index [11]
  • {concept} nonclustered columnstore index (NCCI)
    • {def} a secondary index that's created on a rowstore table
      • is defined as one or more columns of the table and has an optional condition that filters the rows
      • designed to be used for workloads involving a mix of transactional and analytics workload*
      • functions the same as a clustered columnstore index
        • ⇐ has same performance optimizations (incl. batchmode operators)
        • {exception} doesn’t supports persisted computed columns
          • can’t be created on a columnstore index that has a computed column [2]
        • however behave differently between the various versions of SQL Server
          • [SQL Server 2012|2014] {restriction} readonly
      • contains a copy of part or all of the rows and columns in the underlying table
        • include a row-id , which is either the address of
          • a row in a heap table 
          • a clustered index key value
            • includes all columns from the clustered index even when not explicitly defined in the CREATE statement
              • the not specified columns will not be available in the sys.index_columns view
      • [SQL Server 2016] multiple nonclustered rowstore indexes can be created on a columnstore index and perform efficient table seeks on the underlying columnstore
        • ⇒ once created, makes it possible to drop one or more btree nonclustered indexes
      • enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index
  • {concept} batch mode execution (aka vector-based execution, vectorized execution
    • {def} query processing method used to process multiple rows together in groups of rows, or batches, rather than one row at a time
      • SQL Server can push a predicate to the columnstore index scan operator, preventing unnecessary rows from being loaded into the batch [5]
      • queries can process up to 900 rows together
        • enables efficient query execution (by a 3-4x factor) [4]
        • ⇐ the size of the batches varies to fit into the CPU cache
        • ⇒ reduces the number of times that the CPU needs to request external data from memory or other components [5]
      • improves the performance of aggregations, which can be calculated on a per-batch rather than a per-row basis [5]
      • tries to minimize the copy of data between operators by creating and maintaining a special bitmap that indicates if a row is still valid in the batch [5]
        • ⇐ subsequent operators will ignore the non-valid rows
        • every operator has a queue of work items (batches) to process [5]
        • worker threads from a shared pool pick items from queues and process them while migrating from operator to operator [5]
    • is closely integrated with, and optimized around, the columnstore storage format.
      • columnstore indexes use batch mode execution
        • ⇐ improves query performance typically by two to four times
  • {concept} tuple mover
    • single-threaded process that works in the background, preserving system resources
      • runs every five minutes
    • converts closed delta stores to row groups that store data in a column-based storage format [5]
      • can be disabled via trace flag T-634 
      • ⇐ the conversion of closed delta stores to row groups can be forced by reorganizing an index [5]
        • runs in parallel using multiple threads
          • decreases significantly conversion time at a cost of extra CPU load and memory usage [5]
        • via: ALTER INDEX REORGANIZE command
    • it doesn’t prevent other sessions from inserting new data into a table [5]
    • deletions and data modifications would be blocked for the duration of the operation [5]
      • {recommendation} consider forcing index reorganization manually to reduce execution, and therefore locking, time [5]
    • considered fragmented if it has
      • multiple delta rowgroups
      • deleted rows
    • require maintenance like that of regular B-Tree indexes [5]
      • {issue] partially populated row groups
      • {issue} overhead of delta store and delete bitmap scans during query execution
      • rebuilding the columnstore index addresses the issues
      • the strategy depends on the volatility of the data and the ETL processes implemented in the system [5]
        • {recommendation} rebuild indexes when a table has a considerable volme of deleted rows and/or a large number of partially populated rowgroups [5]
        • {recommendation} rebuild partition(s) that still have a large number of rows in open delta stores after the ETL process has completed, especially if the ETL process does not use a bulk insert API [5]
      • creating/dropping/disabling/rebuilding functions like any other index
    • columnstore statistics 
      • a statistics object is created at the time of columnstore index creation; however, it is neither populated nor updated afterward [5]
        • ⇐ SQL Server relies on segment information, B-Tree indexes (when available), and column-level statistics when deciding if a columnstore index needs to be used [5]
        • it is beneficial to create missing column-level statistics on the columns that participate in a columnstore index and are used in query predicates and as join keys [5]
          • ⇐ statistics rarely update automatically on very large tables [5]
            • ⇒ statistics must be updated ‘manually’
      • [SQL Server 2019] included into the schema-only clone of a database functionality [8]
        • enable performance troubleshooting without the need to manual capture the statistics information
      • columnstore indexes has been added to sp_estimate_data_compression_savings. In SQL Server 2019 both 
      • COLUMNSTORE and COLUMNSTORE_ARCHIVE have been added to allow you to estimate the space savings if 
      • either of these indexes are used on a table.
        • via DBCC CLONEDATABASE
    • [in-memory tables] 
      • {limitation} a columnstore index must include all the columns and can’t have a filtered condition [2]
      • {limitation} queries on columnstore indexes run only in InterOP mode, and not in the in-memory native mode [2]
    • {operation} designing columnstore indexes
      • {best practice} understand as much as possible data’s characteristics
      • {best practice} identify workload’s characteristics
    • {operation} create a clustered columnstore index
      • via CREATE CLUSTERED COLUMNSTORE INDEX command
      • not needed to specify any columns in the statement
        • ⇐ the index will include all table columns
    • {operation} index rebuilding 
      • forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data [5]
        • all column segments are recreated with row groups fully populated [5]
      • [<SQL Server 2019] offline operation
      • [SQL Server 2019 Enterprise] online operation
        • ⇒ higher availability 
        • ⇐ pausing and resuming create and rebuild operations are not supported [11]
      • very resource intensive process
      • holds a schema modification (Sch-M) lock on the table
        • ⇒ prevents other sessions from accessing it [5]
        • ⇐ the overhead can be mitigated by using table/index partitioning
          • ⇒ indexes will be rebuild on a partition basis for those partition with volatile data [5]
    • {operation} index reorganization 
      • [<SQL Server 2019] a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups
      • [SQL Server 2019] a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted
        • ⇐ after merging smaller rowgroups, the index quality should be improved.
        • the tuple-mover is helped by a background merge task that automatically compresses smaller OPEN delta rowgroups that have existed for some time as determined by an internal threshold, or merges COMPRESSED rowgroups from where a large number of rows has been deleted
        • via: ALTER INDEX REORGANIZE command
      • [SQL Server 2016] performs additional defragmentation
        • removes deleted rows from row groups that have 10 or more percent of the rows logically deleted [5]
        • merges closed row groups together, keeping the total number of rows less than or equal than rowgroup’s limit [5]
        • ⇐ both processes can be done together [5]
      • [SQL Server 2014] the only action performed is compressing and moving the data from closed delta stores to rowgroups [5] 
        • ⇐ delete bitmap and open delta stores stay intact [5]
      • via: ALTER INDEX REORGANIZE
        • uses all available system resources while it is running [5]
          • ⇒ speeds up the execution process 
          • reduce the time during which other sessions cannot modify or delete data in a table [5]
        • close and compress all open row groups
          • via: ALTER INDEX REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
          • row groups aren’t merged during this operation [5]
    • {operation} estimate compression savings
      • [SQL Server 2019] COLUMNSTORE and COLUMNSTORE_ARCHIVE added
        • allows estimating the space savings if either of these indexes are used on a table [8]
        • {limitation} not available in all editions 
      • via: sp_estimate_data_compression_savings 
    • {operation} [bulk loads] when the number of rows is less than deltastore’s limit, all the rows go directly to the deltastore
      • [large bulk load] most of the rows go directly to the columnstore without passing through the deltastore
        • some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup
          • ⇒ the final rows go to the deltastore instead of the columnstore
      • bulk insert operations provide the number of rows in the batch as part of the API call [5]
        • best results are achieved by choosing a batch size that is divisible by rowgroup’s limit [5]
          • ⇐ guarantees that every batch produces one or several fully populated row groups [5]
            • ⇒ reduce the total number of row groups in a table [5]
            • ⇒ improves query performance
          • ⇐ the batch size shouldn’t exceed rowgroup’s limit [5]
            • row groups can be still created on the fly in a manner to similar a bulk insert when the size of the insert batch is close to or exceeds [5]
  • {operation} [non-bulk operations] trickle inserts go directly to a delta store
  • {feature} parallel inserts
    • [SQL Server 2016] requires following conditions for parallel insert on CCI [6]
      • must specify TABLOCK
      • no NCI on the clustered columnstore index
      • no identity column
      • database compatibility is set to 130
  • {recommendation} minimize the use of string columns in facts tables [5]
    • string data use more space
    • their encoding involves additional overhead during batch mode execution [5]
    • queries with predicates on string columns may have less efficient execution plans that also require significantly larger memory grants as compared to their non-string counterparts [5]
  • {recommendation} [SQL Server 2012|2014] do not push string predicates down toward the lowest operators in execution plans.
  • {recommendation} add another dimension table and replace the string value in the facts table with a synthetic, integer-based ID key that references a new table [5]
  • {operation} upgrading to SQL Server 2016
    • make sure that queries against the tables with columnstore indexes can utilize parallelism in case if database compatibility level less than 130 [5]
  • {feature} [SQL Server 2019] automated columnstore index maintenance [8]
  • {improvement} [SQL Server 2019] better columnstore metadata memory management
  • {improvement} [SQL Server 2019] low-memory load path for columnstore tables
  • {improvement} [SQL Server 2019] improved performance for bulk loading to columnstore indexes
  • {improvement} [SQL Server 2019] server startup process has been made faster for databases that use in-memory columnstore tables for HTAP
  • {feature} DMVs
Previous Post <<||>> Next Post

References:
[1] SQL Docs (2020) Columnstore indexes: Overview [link]
[2] Microsoft Learn (2024) SQL: What's new in columnstore indexes  [link]
[3] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
[4] SQL Docs (2019) Columnstore indexes - Query performance [link]
[5] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[6] Microsoft Learn (2016) Columnstore Index: Parallel load into clustered columnstore index from staging table [link]
[7] Microsoft Learn (2016) Columnstore Index Defragmentation using REORGANIZE Command [link]
[8] Microsoft (2018) Microsoft SQL Server 2019: Technical white paper [link]

Acronyms:
CCI - clustered columnstore index
CI - columnstore index
DBCC - Database Console Commands
DMV - Dynamic Management View
ETL - Extract, Transform, Load
HTAP - Hybrid Transactional/Analytical Processing 
LOB - Line of Business
NCCI - nonclustered columnstore index
OLTP - On-Line Transaction Processing
SP - Service Pack

13 February 2025

🧊💠🗒️Data Warehousing: Table Partitioning in SQL Server [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 13-Feb-2025

[Data Warehousing] Table Partitioning

  • {defthe spreading of data across multiple tables based on a set of rules to balance large amounts of data across disks or nodes
    • data is distributed based on a function that defines a range of values for each partition [2] 
      • the table is partitioned by applying the partition scheme to the values in a specified column [2]
  • {operation} partition creation
    • [large partitioned table]
      • should be created two auxiliary nonindexed empty tables with the same structure, including constraints and data compression options [4]
        • first table:  create a check constraint that guarantees that all data from the table fits exactly with one empty partition of the fact table
          • the constraint must be created on the partitioning column [4]
          • a columnstore index can be created on the fact table, as long as it is aligned with the table [4]
          • after truncation of <table 2> the <table 1> is prepared to accept the next partition from your fact table for the next minimally logged deletion [4]
        • second table: for minimally logged deletions of large portions of data, a partition from the fact table can be switched to the empty table version without the check constraint [4]
          • then the table can be truncated
      • for minimally logged inserts, new data to the second auxiliary table should be bulk inserted in the auxiliary that has the check constraint [4]
        • INSERT operation can be minimally logged because the table is empty [4]
        • create a columnstore index on this auxiliary table, using the same structure as the columnstore index on your fact table [4]
        • switch data from this auxiliary table to a partition of your fact table [4]
        • drop the columnstore index on the auxiliary table, and change the check constraint to guarantee that all of the data for the next load can be switched to the next empty partition of the fact table [4]
        • the second auxiliary table is prepared for new bulk loads again [4]
  • {operation} [Query Optimizer] partition elimination 
    • process in which SQL Server accesses only those partitions needed to satisfy query filters [4]
  • {operation} partition switching 
    • {definition} process that switches a block of data from one table or partition to another table or partition [4]
    • types of switches
      • reassign all data from a nonpartitioned table to an empty existing partition of a partitioned table [4]
      • switch a partition of one partitioned table to a partition of another partitioned table [4]
      • reassign all data from a partition of a partitioned table to an existing empty nonpartitioned table [4]
  • {benefit} improves query performance [1]
    • by partitioning a table across filegroups [1]
      • specific ranges of data can be placed on different disk spindles [1]
        • can improve I/O performance [1]
          • ⇐ the disk storage is already configured as a RAID 10 or RAID 5 array [1]
            • ⇒ this usually has little benefit [1]
      • using a mix of fast solid state storage for recent, frequently accessed data, and mechanical disks for older, less queried rows [1]
        • use partitioning to balance disk performance against storage costs [1]
      • biggest performance gain from partitioning in a data warehouse is realized when queries return a range of rows that are filtered on the partitioning key [1]
        • the query optimizer can eliminate partitions that are not within the filter range [1]
          • dramatically reduce the number of rows that need to be read [1]
    • reduces contention [3]
      • can reduce the number of rows included in a table scan [3]
  • {benefit} more granular manageability [1]
    • some maintenance operations can be performed at partition level instead of on the whole table [1]
      • e.g. indexes can be created and rebuilt on a per-partition basis [1]
      • e.g. compression can be applied to individual partitions [1]
      • e.g. by mapping partitions to filegroups, partitions can be backed up and restored independently [1]
        • enables to back up older data once and then configure the backed up partitions as read-only [1]
        • future backups can be limited to the partitions that contain new or updated data [1]
  • {benefit} improved data load performance
    • enables loading many rows very quickly by switching a staging table with a partition
      • can dramatically reduce the time taken by ETL data loads [1]
        • with the right planning, it can be achieved with minimal requirements to drop or rebuild indexes [1]
  • {best practice} partition large fact tables
    • tables of around 50 GB or more
    • ⇐ in general, fact tables benefit from partitioning more than dimension tables [1]
  • {best practice} partition on an incrementing date key [1]
    • assures that the most recent data are in the last partition and the earliest data are in the first partition [1]
  • {best practice} design the partition scheme for ETL and manageability [1]
    • the query performance gains realized by partitioning are small compared to the manageability and data load performance benefits [1]
      • ideally partitions should reflect the ETL load frequency
        • because this simplifies the load process [1]
        • merge partitions periodically to reduce the overall number of partitions (for example, at the start of each year [1]
      • could merge the monthly partitions for the previous year into a single partition for the whole year [1]
  • {best practice} maintain an empty partition at the start and end of the table [1]
    • simplifies the loading of new rows [1]
    • when new records need to be added, split the empty partition 
      • ⇐ to create two empty partitions)
    • then switch the staged data with the first empty partition [1]
      • ⇐ loads the data into the table and leaves the second empty partition you created at the end of the table, ready for the next load [1]
    • a similar technique can be used to archive or delete obsolete data at the beginning of the table [1]
  • {best practice} chose the proper granularity
    • it should be aligned to the business requirements [2]
  • {best practice} create at least one filegroup in addition to the primary one
    • set it as the default filegroup
      • data tables are thus separated from system tables [2]
    • creating dedicated filegroups for extremely large fact tables [2]
      • place the fact tables on their own logical disks [2]
    • use a file and a filegroup separate from the fact and dimension tables [2]
      • {exception} staging tables that will be switched with partitions to perform fast loads [2]
        • staging tables must be created on the same filegroup as the partition with which they will be switched [2]
  • {def} partition scheme 
    • a scheme that maps partitions to filegroups
  • {def} partition function 
    • object that maps rows to partitions by using values from specific columns (aka  partitioning columns)
    • performs logical mapping
  • {def} aligned index 
    • index built on the same partition scheme as its base table [4]
      • if all indexes are aligned with their base table, switching a partition is a metadata operation only [4]
        • ⇒ it’s very fast [4]
Previous Post <<||>> Next Post

References:
[1] 20467A - Designing Business Intelligence Solutions with Microsoft SQL Server 2012
[2] 20463C - Implementing a Data Warehouse with Microsoft SQL Server
[3] 10777A - Implementing a Data Warehouse with Microsoft SQL Server 2012
[4] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
[5] Microsoft Learn (2009) How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005 [link]

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part IX: Translating SQL to KQL - More Joins)

The last post exemplified the use of "explain" to translate queries from SQL to KQL. The current post attempts to test the feature based on the various join constructs available in SQL by using the NewSales and Products tables. The post presumes that the reader as a basic understanding of the join types from SQL-based environments. 

Inner Join

// full join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// full join in KQL
NewSales
| join kind=inner (
    Products
    | project ProductKey, ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| project CustomerKey, ProductKey, ProductName
| limit 10

A full join is probably the most used type of join given that fact tables presume the existence of dimensions, even if poor data warehousing design can lead also to exception. The join retrieves all the data matching from both tables, including the eventual duplicates from both sides of the join. 

Left Join

// left join in SQL
--
explain
SELECT NewSales.CustomerKey
, NewSales.ProductKey
, Products.ProductName
FROM NewSales 
     LEFT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// left join in KQL
NewSales
| join kind=leftouter (
    Products
    | project ProductKey
        , Product = ProductName 
    )
    on ($left.ProductKey == $right.ProductKey)
| where isnull(Product) 
| project CustomerKey
    , ProductKey
    , ProductName
| limit 10

A left join retrieves all the records from the left table, typically the fact table, independently whether records were found in the dimension table. One can check whether mismatches exist by retrieving the records where no match was found.

Right Join

// right join in SQL
--
explain
SELECT NewSales.CustomerKey
, Products.ProductKey
, Products.ProductName
FROM NewSales 
     RIGHT JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 

// right join in KQL
NewSales
| join kind=rightouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
| where isnull(ProductKey) 
| project CustomerKey
    , DimProductKey
    , DimProductName
| limit 10

A right join retrieves the records from the dimension together with the matches from the fact table, independently whether a match was found in the fact table. 

Full Outer Join

// full outer join in SQL
--
explain
SELECT NewSales.CustomerKey
, Coalesce(NewSales.ProductKey, Products.ProductKey) ProductKey
, Coalesce(NewSales.ProductName, Products.ProductName) ProductName
FROM NewSales 
     FULL OUTER JOIN Products 
      ON NewSales.ProductKey = Products.ProductKey 


// full outer join in KQL
NewSales
| join kind=fullouter (
    Products
    | project DimProductKey = ProductKey
    , DimProductName = ProductName 
    )
    on ($left.ProductKey == $right.DimProductKey)
//| where isnull(ProductKey) 
| project CustomerKey
    , ProductKey = coalesce(ProductKey, DimProductKey)
    , ProductName = coalesce(ProductName, DimProductName)
| limit 10

A full outer join retrieves all the data from both sides of the join independently on whether a match is found. In RDBMS this type of join performs poorly especially when further joins are considered, respectively when many records are involved on both sides of the join. Therefore it should be avoided when possible, though in many cases it might be the only feasible solution. There are also alternatives that involve a UNION between a LEFT JOIN and a RIGHT JOIN, the letter retrieving only the records which is not found in the fact table (see last query from a previous post). This can be a feasible solution when data sharding is involved. 

Notes:
1) If one ignores the unnecessary logic introduced by the translation via explain, the tool is excellent for learning KQL. It would be interesting to understand why the tool used a certain complex translation over another, especially when there's a performance benefit in the use of a certain piece of code.
2) Also in SQL-based queries it's recommended to start with the fact table, respectively with the table having the highest cardinality and/or the lowest level of detail, though the database engine might find an optimal plan independently of which table was written first.

Happy coding!

Previous Post <<||>> Next Post

02 February 2025

🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

Business Intelligence Series
Business Intelligence Series

Last week Microsoft had a great session [1] on the present and future of SQL databases, a “light” version of Azure SQL databases designed for Microsoft Fabric environments, respectively workloads. SQL databases are currently available for testing, and after the first tests the product looks promising. Even if there are several feature gaps, it’s expected that Microsoft will bridge the gaps over time. Conversely, there might be features that don’t make sense in Fabric, respectively new features that need to be considered for facilitating the work in OneLake and its ecosystem.

During the session several Microsoft professionals answered the audience’s questions, and they did a great job. Even if the answers and questions barely scratched the surface, they offered some insight into what Microsoft wants to do. Probably the expectation is that SQL databases won’t need any administration - indexes being maintained automatically, infrastructure scaling as needed, however everything sounds too nice to be true if one considers in general the experience with RDBMS – the devil hides usually in details.

Even if the solutions built follow the best practices in the field, which frankly seldom happens, transferring the existing knowledge to Fabric may encounter some important challenges revolving around performance, flexibility, accessibility and probably costs. Even if SQL databases are expected to fill some minor gaps, considering the lessons of the past, such solutions can easily grow. Even if a lot of processing power is thrown at the SQL queries and the various functionality, customers still need to write quality code and refactor otherwise the costs will explode sooner or later. 

As the practice has proven so many times while troubleshooting performance issues, sometimes one needs to use all the arsenal available – DBCC, DMVs and sometimes even undocumented features - to get a better understanding of what’s happening. Even if there are some voices stating that developers don’t need to know how the SQL engine works, just applying solutions blindly after a recipe can accidentally increase the value of code, though most likely it doesn’t exploit the full potential available. Unfortunately, this is a subjective topic without hard numbers to support it, and the stories told by developers and third-parties usually don’t tell the whole story. 

It’s also true that diving deep into a database’s internal working requires time, that’s quite often not available, and the value for such an effort doesn’t necessarily pay off. Above this, there’s a software engineer’s aim of understanding of how things work. Otherwise, one should drop the engineering word and just call it coding. Conversely, the data citizen just needs a high-level knowledge of how things work, though the past 20-30 years proved that that’s often not enough. The more people don’t have the required knowledge, the higher the chances that code needs refactoring. Just remember the past issues organizations had with MS Access and Excel when people started to create their own solutions, the whole infrastructure being invaded by poorly designed solutions that continue to haunt some organizations even today.

Even if lot of technical knowledge can be transported to Microsoft Fabric, the new environments may still require also adequate tools that can be used for monitoring and troubleshooting. Microsoft seems to work in this direction, though from the information available the tools don’t and can’t offer the whole perspective. It will be interesting to see how much the current, respectively the future dashboards and various reports can help; respectively what important gaps will surface. Until the gaps are addressed, probably the SQL professional must rely on SQL scripts and the DMVs available. All this can be summarized in a few words: it will not be boring!

Previous Post <<||>> Next Post

References:
[1] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)  

  • {def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)
    • includes 
      • the transportation of data
      • overlaps between stages
      • changes in flow 
        • due to 
          • new technologies
          • changing requirements
      • changes in scope
      • troubleshooting
        • due to data mismatches
  • {step} extraction
    • data is extracted directly from the source systems or intermediate repositories
      • data may be made available in intermediate repositories, when the direct access to the source system is not possible
        •  this approach can add a complexity layer
    •  {substep} data validation
      • an automated process that validates whether data pulled from sources has the expected values
      • relies on a validation engine
        • rejects data if it falls outside the validation rules
        • analyzes rejected records on an ongoing basis to
          • identifies what went wrong
          • corrects the source data
          • modifies extraction to resolve the problem in the next batches
  • {step} transform
    • transforms the data, removing extraneous or erroneous data
    • applies business rules 
    • checks data integrity
      • ensures that the data is not corrupted in the source or corrupted by ETL
      • may ensure no data was dropped in previous stages
    • aggregates the data if necessary
  • {step} load
    • {substep} store the data into a staging layer
      • transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
      • {advantage} makes it easier to roll back, if something went wrong
      • {advantage} allows to develop the logic iteratively and publish the data only when needed 
      • {advantage} can be used to generate audit reports for 
        • regulatory compliance 
        • diagnose and repair of data problems
      • modern ETL process perform transformations in place, instead of in staging areas
    • {substep} publish the data to the target
      • loads the data into the target table(s)
      • {scenario} the existing data are overridden every time the ETL pipeline loads a new batch
        • this might happen daily, weekly, or monthly
      • {scenario} add new data without overriding
        • the timestamp can indicate the data is new
      • {recommendation} prevent the loading process to error out due to disk space and performance limitations
  • {approach} building an ETL infrastructure
    • involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly
      • recurring process
        • e.g. data used for reporting
      • one-time process
        • e.g. data migration
    • may involve 
      • multiple source or destination systems 
      • different types of data
        • e.g. reference, master and transactional data
        • ⇐ may have complex dependencies
      • different level of structuredness
        • e.g. structured, semistructured, nonstructured 
      • different data formats
      • data of different quality
      • different ownership 
  • {recommendation} consider ETL best practices
    • {best practice} define requirements upfront in a consolidated and consistent manner
      • allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
      • the requirements may involve all the aspects of the process
        • e.g. data extraction, data transformation, standard formatting, etc.
    • {best practice} define a high level strategy
      • allows to define the road ahead, risks and other aspects 
      • allows to provide transparency
      • this may be part of a broader strategy that can be referenced 
    • {best practice} align the requirements and various aspects to the existing strategies existing in the organization
      • allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
      • e.g. IT, business, Information Security, Data Management strategies
    • {best practice} define the scope upfront
      • allows to better estimate the effort and validate the outcomes
      • even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations
    • {best practice} manage the effort as a project and use a suitable Project Management methodology
      • allows to apply structured well-established PM practices 
      • it might be suited to adapt the methodology to project's particularities 
    • {best practice} convert data to standard formats to standardize data processing
      • allows to reduce the volume of issues resulted from data type mismatches
      • applies mainly to dates, numeric or other values for which can be defined standard formats
    • {best practice} clean the data in the source systems, when  cost-effective
      • allows to reduces the overall effort, especially when this is done in advance
      • this should be based ideally on the scope
    • {best practice} define and enforce data ownership
      • allows to enforce clear responsibilities across the various processes
      • allows to reduce the overall effort
    • {best practice} document data dependencies
      • document the dependencies existing in the data at the various levels
    • {best practice} protocol data movement from source(s) to destination(s) in term of data volume
      • allows to provide transparence into the data movement process
      • allows to identify gaps in the data or broader issues
      • can be used for troubleshooting and understanding the overall data growth
  • {recommendation} consider proven systems, architectures and methodologies
    • allows to minimize the overall effort and costs associated with the process

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part VI: SQL Databases for OLTP scenarios) [new feature]

Data Warehousing Series
Data Warehousing Series

One interesting announcements at Ignite is the availability in public preview of SQL databases in Microsoft Fabric, "a versatile and developer-friendly transactional database built on the foundation of Azure SQL database". With this Fabric can address besides OLAP also OLTP scenarios, evolving thus from analytics to a data platform [1]. According to the announcement, besides the AI-optimized architectural aspects, the feature makes the SQL Azure simple, autonomous and secure by design [1], and these latest aspects are considered in this post. 

Simplicity revolves around the deployment and configuration of databases, the creation of a new database requiring giving a name and the database is created in seconds [1]. It’s a considerable improvement compared with the relatively complex setup needed for on-premise configurations, though sometimes more flexibility in configuration is needed upfront or over database’s lifetime. To get a database ready for testing one can import a sample database or get specific data via data flows and/or pipelines [1]. As development tools one can use Visual Studio Code or SSMS [1], and probably more tools will be available in time.

The integration with both GitHub and Azure DevOps allows to configure each database under source control, which is needed for many scenarios especially when multiple resources make changes to the database objects [1]. Frankly, that’s mainly important during the development phase, respectively in scenarios in which multiple people make in parallel changes to the logic. It will be interesting to see how much overhead or challenges the feature adds to development and how smoothly everything works together!

The most important aspect for many solutions is the replication of data in near-real time to the (open-source) delta parquet format in OneLake and thus making the data available for analytics almost immediately [1]. Probably, from this aspect many cloud-based applications can benefit, even if the performance might not be as good as in other well-established architectures. However, there are many other scenarios in which one needs to maintain and use data for OLTP/OLAP purposes. This invites adequate testing and a good weighting of the advantages and disadvantages involved. 

A SQL database is a native item in Fabric, and therefore it utilizes Fabric capacity units like other Fabric workloads [1]. One can use the Fabric SKU estimator (still in private preview) to estimate the costs [2], though it will be interesting to see how cost-effective the solutions are. Probably, especially when the infrastructure is already available outside of Fabric, it will be easier and cost-effective to use the mirroring functionality. One should test and have a better estimator before moving blindly from the existing infrastructure to Fabric. 

SQL databases in Fabric are autonomous by design, while allowing to get the best performance and availability by default [1]. High availability is reached through zone redundancy, while performance is achieved by scaling automatically the storage and compute to accommodate the workloads [1]. The auto-optimization capability is achieved with the help of the latest Intelligent Query Processing (IQP) enhancements, respectively the creation of missing indexes to improve query performance [1]. It will be interesting to see how the whole process works, given that the maintenance of indexes usually involves some challenges (e.g. identifying covering indexes, indexes needed only for temporary workloads, duplicated indexes).

SQL databases in Fabric are automatically configured for high availability with zone redundancy, while storage and compute scale automatically to accommodate the user workload [1]. The database is auto-optimized through the latest IQP enhancements while the system creates any missing indexes to improve query performance. All data is replicated to OneLake by default [1]. Finally, the database always receives the latest security updates with auto-patching, while automatic backups help in disaster recovery scenarios  [1], which can be of real help for database administrators. 

References:
[1] Microsoft Fabric Updates Blog (2024) Announcing SQL database in Microsoft Fabric Public Preview [link]
[2] Microsoft Fabric Updates Blog (2024) Announcing New Recruitment for the Private Preview of Microsoft Fabric SKU Estimator [link]

10 November 2024

🏭🗒️Microsoft Fabric: Data Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 11-Mar-2024

Warehouse vs SQL analytics endpoint in Microsoft Fabric
Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

[Microsoft Fabric] Data Warehouse

  • highly available relational data warehouse that can be used to store and query data in the Lakehouse
    • supports the full transactional T-SQL capabilities 
    • modernized version of the traditional data warehouse
  • unifies capabilities from Synapse Dedicated and Serverless SQL Pools
  • modernized with key improvements
  • resources are managed elastically to provide the best possible performance
    • ⇒ no need to think about indexing or distribution
    • a new parser gives enhanced CSV file ingestion time
    • metadata is now cached in addition to data
    • improved assignment of compute resources to milliseconds
    • multi-TB result sets are streamed to the client
  • leverages a distributed query processing engine
    • provides with workloads that have a natural isolation boundary [3]
      • true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]
  • {operation} data ingestion
    • involves moving data from source systems into the data warehouse [2]
      • the data becomes available for analysis [1]
    • via Pipelines, Dataflows, cross-database querying, COPY INTO command
    • no need to copy data from the lakehouse to the data warehouse [1]
      • one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]
  • {operation} data storage
    • involves storing the data in a format that is optimized for analytics [2]
  • {operation} data processing
    • involves transforming the data into a format that is ready for consumption by analytical tools [1]
  • {operation} data analysis and delivery
    • involves analyzing the data to gain insights and delivering those insights to the business [1]
  • {operation} designing a warehouse (aka warehouse design)
    • standard warehouse design can be used
  • {operation} sharing a warehouse (aka warehouse sharing)
    • a way to provide users read access to the warehouse for downstream consumption
      • via SQL, Spark, or Power BI
    • the level of permissions can be customized to provide the appropriate level of access
  • {feature} mirroring 
    • provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric
      • any database can be accessed and managed centrally from within Fabric without having to switch database clients
      • data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload
  • {concept}SQL analytics endpoint 
    • a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]
  • {concept}virtual warehouse
    • can containing data from virtually any source by using shortcuts [3]
  • {concept} cross database querying 
    • enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]
References:
[1] Microsoft Learn: Fabric (2023) Get started with data warehouses in Microsoft Fabric (link
[2] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: choose a data store (link)
[3] Microsoft Learn: Fabric (2024) What is data warehousing in Microsoft Fabric? (link)
[4] Microsoft Learn: Fabric (2023) Better together: the lakehouse and warehouse (link)

Resources:
[1] Microsoft Learn: Fabric (2023) Data warehousing documentation in Microsoft Fabric (link)


16 September 2024

🧭Business Intelligence: Mea Culpa (Part IV: Generalist or Specialist in an AI Era?)

Business Intelligence Series
Business Intelligence Series

Except the early professional years when I did mainly programming for web or desktop applications in the context of n-tier architectures, over the past 20 years my professional life was a mix between BI, Data Analytics, Data Warehousing, Data Migrations and other topics (ERP implementations and support, Project Management, IT Service Management, IT, Data and Applications Management), though the BI topics covered probably on average at least 60% of my time, either as internal or external consultant. 

I can consider myself thus a generalist who had the chance to cover most of the important aspects of a business from an IT perspective, and it was thus a great experience, at least until now! It’s a great opportunity to have the chance to look at problems, solutions, processes and the various challenges and opportunities from different perspectives. Technical people should have this opportunity directly in their jobs through the communication occurring in projects or IT services, though that’s more of a wish! Unfortunately, the dialogue between IT and business occurs almost only over the tickets and documents, which might be transparent but isn’t necessarily effective or efficient! 

Does working only part time in an area make one person less experienced or knowledgeable than other people? In theory, a full-time employee should get more exposure in depth and/or breadth, but that’s relative! It depends on the challenges one faces, the variation of the tasks, the implemented solutions, their depth and other technical and nontechnical factors like training, one’s experience in working with the various tools, the variety of the tasks and problem faced, professionalism, etc. A richer exposure can but not necessarily involve more technical and nontechnical knowledge, and this shouldn’t be taken as given! There’s no right or wrong answer even if people tend to take sides and argue over details.

Independently of job's effective time, one is forced to use his/her time to keep current with technologies or extend one’s horizon. In IT, a professional seldom can rely on what is learned on the job. Fortunately, nowadays one has more and more ways of learning, while the challenge shifts toward what to ignore, respectively better management of one’s time while learning. The topics increase in complexity and with this blogging becomes even more difficult, especially when one competes with AI content!

Talking about IT, it will be interesting to see how much AI can help or replace some of the professions or professionals. Anyway, some jobs will become obsolete or shift the focus to prompt engineering and technical reviews. AI still needs explicit descriptions of how to address tasks, at least until it learns to create and use better recipes for problem definition and solving. The bottom line, AI and its use can’t be ignored, and it can and should be used also in learning new things. It’s amazing what one can do nowadays with prompt engineering! 

Another aspect on which AI can help is to tailor the content to one’s needs. A high percentage in the learning process is spent on fishing in a sea of information for content that is worth knowing, respectively for a solution to one’s needs. AI must be able to address also some of the context without prompters being forced to give information explicitly!

AI opens many doors but can close many others. How much of one’s experience will remain relevant over the next years? Will AI have more success in addressing some of the challenges existing in people’s understanding or people will just trust AI blindly? Anyway, somebody must be smarter than AI, and here people’s collective intelligence probably can prove to be a real match. 

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.