Showing posts sorted by date for query Data management. Sort by relevance Show all posts
Showing posts sorted by date for query Data management. Sort by relevance Show all posts

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 15-Feb-2024

[SQL Server] columnstore indexes (CI)

  • {def} a technology for storing, retrieving and managing data by using a columnar data format (aka columnstore
    • store compressed data on a per-column rather than a per-row basis [5]
  • {benefit} designed for analytics and data warehousing workloads
    • data warehousing
      • {scenario} store fact tables and large dimension tables
        • ⇐ tend to require full table scans rather than table seeks
    • analytics workloads
      • {scenario} [SQL Server 2016 SP1] can be used for real-time analytics on operational databases
        • ⇐ an updatable nonclustered columnstore index can be created on a rowstore table
  • {benefit} performance increase 
    • can achieve up to 100x better performance [4]
    • offers an order of magnitude better performance than a rowstore index
      • {feature} uses batch mode execution
        • improves query performance typically by two to four times
      • have high performance gains for analytic queries that scan large amounts of data, especially on large tables (>1 million rows) 
  • {benefit} reduces significantly the data warehouse storage costs
    • {feature} data compression
      • ⇒ provides high compression rates, typically by 10 times
        • ⇒ reduces total I/O from the physical media
          • ⇐ queries often select only a few columns from a table
          • minimizes or eliminates system I/O bottlenecks
        • reduces significantly the memory footprint
          • ⇒ query performance can improve 
            • because SQL Server can perform more query and data operations in memory
  • {benefit} built in memory
    • ⇒ sufficient memory must be available 
  • {benefit} part of the database engine
    • no special hardware is needed
  • {concept} columnstore 
    • {def} data structure logically organized as a table with rows and columns, and physically stored in a column-wise data format
      • stores values from the same domain which commonly have similar values
        • when a query references a column, then only that column is fetched from disk [3]
          • ⇐ the columns not requested are skipped 
            • ⇒ they are not loaded into memory 
        • when a query is executed, the rows must be reconstructed
          • ⇒ row reconstruction takes some time and uses some CPU and memory resources [3]
        • [SQL Server 2016] columnstore index on rowstore tables
          • columnstore is updated when data changes in the rowstore table
        • both indexes work against the same data
  • {concept}rowstore
    • {def} data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format
      • ⇐ the traditional way to store relational table data
      • refers to a table where the underlying data storage format is either
        • a heap
        • a clustered index
        • a memory-optimized table
  • {concept} rowstore index
    • performs best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values
      • ⇒ appropriate for transactional workloads 
        • because they tend to require mostly table seeks instead of table scans
  • {concept} rowgroup
    • {def} a group of rows that are compressed into columnstore format at the same time
      • {constraint} has a maximum number of rows per rowgroup, which is 1,048,576 =2^20 rows
      • contains one column segment for every column in the table
      • can have more than one delta rowgroup that form the deltastore
        • e.g. when multiple threads create columnstore indexes using parallel execution plans [5]
          • ⇐ each thread will work with its own subset of data, creating separate rowgroups [5]
        • [partitions] each table partition has its own set of row groups [5]
          • ⇐  too many partitions may prevent workloads from benefiting from a CCI [11]
            • ⇐ data aren’t pushed into a compressed columnstore segment until the rowgroup limit is reached
    • {event} rowgroup is compressed
      • marked as read-only [16]
      • a compressed rowgroup is considered as fragmented when either 
        • row number < rowgroup limit but dictionary size reached the maximum
          • nothing can be done to increase the number of rows [15]
          • the trim_reason is other than DICTIONARY_SIZE
        • it has nonzero deleted rows that exceeds a minimum threshold [15]
    • {event} all data from rowgroup deleted 
      • transitions from COMPRESSED into TOMBSTONE state
      • later removed by the tuple-mover background process
    • {event} rows in the columnstore indexes can be moved to different locations
      • row-id in the nonclustered indexes aren’t updated 
        • ⇐ the mappings between old and new row locations are stored in an internal structure (aka mapping index) 
    • {event} rowgroup build
      • all column data are combined on a per-row group basis, encoded and compressed [5]
        • the rows within a row group can be rearranged if that helps to achieve a better compression rate [5]
  • {feature} data compression
    • the table is sliced into rowgroups, and each rowgroup is compresses in a column-wise manner
      • the number of rows in the rowgroup must be 
        • large enough to improve compression rates
        • small enough to benefit from in-memory operations
          • having too many small rowgroups decreases columnstore index’s quality
    • uses its own compression mechanism 
      • ⇒ row or page compression cannot be used on it [3]
      • [SQL Server 2016] page compression has been removed
        • ⇐ in some cases, page compression disallowed the creation of columnstore indexes with a very large number of columns [5]
  • {feature} compression delay
    • computed when a delta rowgroup is closed [7]
    • keeps the ‘active’ rows in delta rowgroup and only transition these rows to compressed rowgroup after a specified delay [7]
      • ⇐ reduces the overall maintenance overhead of NCCI [7]
      • ⇒ leads to a larger number of delta rowgroups [7]
    • {best practice} if the workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option [7]
    • {best practice} [OLTP workload] if > 10% rows are marked deleted in recently compressed rowgroups, then consider a value that accommodates the behavior [7]
      • via: create nonclustered columnstore index with (compression_delay= 150)
  • {feature} data encoding
    • all values in the data are replaced with 64-bit integers using one of two encoding algorithms
    • {concept} dictionary encoding
      • stores distinct values from the data in a separate structure (aka dictionary} 
        • every value in a dictionary has a unique ID assigned [5]
          • the ID is used for replacement
    • {concept} global dictionary
      • shared across all segments that belong to the same index partition [5]
    • {concept} local dictionary
      • created for individual segments using values that are not present in the global dictionary
    • {concept} value-based encoding
      • mainly used for numeric and integer data types that do not have enough duplicated values [5]
        • dictionary encoding would be inefficient [5]
      • converts integer and numeric values to a smaller range of 64-bit integers in 2 steps
        • {step} [numeric data types] are converted to integers using the minimum positive exponent (aka magnitude that allows this conversion) [5]
          • {goal} convert all numeric values to integers [5]
          • [integer data types] the smallest negative exponent is chosen that can be applied to all values without losing their precision [5]
            • {goal} reduce the interval between the minimum and maximum values stored in the segment [5]
        • {step} the minimum value (aka base value) in the segment is identified and subtracted it from all other values [5]
          • ⇒ makes the minimum value in the segment number 0 [5]
      • after encoding the data are compressed and stored as a LOB allocation unit
  • {concept} column segment 
    • {def} a column of data from within the rowgroup
    • is compressed together and stored on physical media
    • SQL Server loads an entire segment to memory when it needs to access its data
  • {concept} segment metadata 
    • store metadata about each segment 
      • e.g. minimum and maximum values
      • ⇐ segments that do not have the required data are skipped [5]
  • {concept} deltastore
    • {def} all of the delta rowgroups of a columnstore index
    • its operations are handled behind the scenes
      • can be in either states
        • {state} open (aka open delta store) 
          • accepts new rows and allow modifications and deletions of data
        • {state} closed (aka closed data store)
          • a delta store is closed when it reaches its rowgroup limit
  • {concept} delta rowgroup 
    • {def} a clustered B-tree index that's used only with columnstore indexes
    • improves columnstore compression and performance by storing rows until the number of rows reaches the rowgroup limit and are then moved into the columnstore
    • {event} reaches the maximum number of rows
      • it transitions from an ‘open’ to ‘closed’ state
      • a closed rowgroup is compressed by the tuple-mover and stored into the columnstore as COMPRESSED rowgroup
    • {event} compressed
      • the existing delta rowgroup transitions into TOMBSTONE state to be removed later by the tuple-mover when there is no reference to it
  • {concept} tuple-mover 
    • background process that checks for closed row group
      • if it finds a closed rowgroup, it compresses the delta rowgroup and stores it into the columnstore as a COMPRESSED rowgroup
  • {concept} clustered columnstore index (CCI) 
    • is the primary storage for the entire table
    • {characteristic) updatable
      • has two structures that support data modifications
        • ⇐ both use the B-Tree format to store data [5]
        • ⇐ created on demand [5]
        • delete bitmap
          • indicates which rows were deleted from a table
          • upon deletion the row continues to be stored into the rowgroup
          • during query execution SQL Server checks the delete bitmap and excludes deleted rows from the processing [5]
        • delta store
          • includes newly inserted rows
          • updating a row triggers the deletion of the existing row and insertion of a new version of a row to a delta store
            • ⇒ the update does not change the row data
            • ⇒ the updated row is inserted to a delete bitmap
        • [partitions] each partition can have a single delete bitmap and multiple delta stores
          • ⇐ this makes each partition self-contained and independent from other partitions
            • ⇒ allows performing a partition switch on tables that have clustered columnstore indexes defined [5]
    • {feature} supports minimal logging for batch sizes >= rowgroup’s limit [12]
    • [SQL Server 2017] supports non-persisted computed columns in clustered columnstore indexes [2]
    • store some data temporarily into a clustered index (aka deltastore) and a btree list of IDs for deleted rows
      • ⇐ {benefit} reduces fragmentation of the column segments and improves performance
      • combines query results from both the columnstore and the deltastore to return the correct query results
    • [partitions] too many partitions can hurt the performance of a clustered columnstore index [11]
  • {concept} nonclustered columnstore index (NCCI)
    • {def} a secondary index that's created on a rowstore table
      • is defined as one or more columns of the table and has an optional condition that filters the rows
      • designed to be used for workloads involving a mix of transactional and analytics workload*
      • functions the same as a clustered columnstore index
        • ⇐ has same performance optimizations (incl. batchmode operators)
        • {exception} doesn’t supports persisted computed columns
          • can’t be created on a columnstore index that has a computed column [2]
        • however behave differently between the various versions of SQL Server
          • [SQL Server 2012|2014] {restriction} readonly
      • contains a copy of part or all of the rows and columns in the underlying table
        • include a row-id , which is either the address of
          • a row in a heap table 
          • a clustered index key value
            • includes all columns from the clustered index even when not explicitly defined in the CREATE statement
              • the not specified columns will not be available in the sys.index_columns view
      • [SQL Server 2016] multiple nonclustered rowstore indexes can be created on a columnstore index and perform efficient table seeks on the underlying columnstore
        • ⇒ once created, makes it possible to drop one or more btree nonclustered indexes
      • enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index
  • {concept} batch mode execution (aka vector-based execution, vectorized execution
    • {def} query processing method used to process multiple rows together in groups of rows, or batches, rather than one row at a time
      • SQL Server can push a predicate to the columnstore index scan operator, preventing unnecessary rows from being loaded into the batch [5]
      • queries can process up to 900 rows together
        • enables efficient query execution (by a 3-4x factor) [4]
        • ⇐ the size of the batches varies to fit into the CPU cache
        • ⇒ reduces the number of times that the CPU needs to request external data from memory or other components [5]
      • improves the performance of aggregations, which can be calculated on a per-batch rather than a per-row basis [5]
      • tries to minimize the copy of data between operators by creating and maintaining a special bitmap that indicates if a row is still valid in the batch [5]
        • ⇐ subsequent operators will ignore the non-valid rows
        • every operator has a queue of work items (batches) to process [5]
        • worker threads from a shared pool pick items from queues and process them while migrating from operator to operator [5]
    • is closely integrated with, and optimized around, the columnstore storage format.
      • columnstore indexes use batch mode execution
        • ⇐ improves query performance typically by two to four times
  • {concept} tuple mover
    • single-threaded process that works in the background, preserving system resources
      • runs every five minutes
    • converts closed delta stores to row groups that store data in a column-based storage format [5]
      • can be disabled via trace flag T-634 
      • ⇐ the conversion of closed delta stores to row groups can be forced by reorganizing an index [5]
        • runs in parallel using multiple threads
          • decreases significantly conversion time at a cost of extra CPU load and memory usage [5]
        • via: ALTER INDEX REORGANIZE command
    • it doesn’t prevent other sessions from inserting new data into a table [5]
    • deletions and data modifications would be blocked for the duration of the operation [5]
      • {recommendation} consider forcing index reorganization manually to reduce execution, and therefore locking, time [5]
    • considered fragmented if it has
      • multiple delta rowgroups
      • deleted rows
    • require maintenance like that of regular B-Tree indexes [5]
      • {issue] partially populated row groups
      • {issue} overhead of delta store and delete bitmap scans during query execution
      • rebuilding the columnstore index addresses the issues
      • the strategy depends on the volatility of the data and the ETL processes implemented in the system [5]
        • {recommendation} rebuild indexes when a table has a considerable volme of deleted rows and/or a large number of partially populated rowgroups [5]
        • {recommendation} rebuild partition(s) that still have a large number of rows in open delta stores after the ETL process has completed, especially if the ETL process does not use a bulk insert API [5]
      • creating/dropping/disabling/rebuilding functions like any other index
    • columnstore statistics 
      • a statistics object is created at the time of columnstore index creation; however, it is neither populated nor updated afterward [5]
        • ⇐ SQL Server relies on segment information, B-Tree indexes (when available), and column-level statistics when deciding if a columnstore index needs to be used [5]
        • it is beneficial to create missing column-level statistics on the columns that participate in a columnstore index and are used in query predicates and as join keys [5]
          • ⇐ statistics rarely update automatically on very large tables [5]
            • ⇒ statistics must be updated ‘manually’
      • [SQL Server 2019] included into the schema-only clone of a database functionality [8]
        • enable performance troubleshooting without the need to manual capture the statistics information
      • columnstore indexes has been added to sp_estimate_data_compression_savings. In SQL Server 2019 both 
      • COLUMNSTORE and COLUMNSTORE_ARCHIVE have been added to allow you to estimate the space savings if 
      • either of these indexes are used on a table.
        • via DBCC CLONEDATABASE
    • [in-memory tables] 
      • {limitation} a columnstore index must include all the columns and can’t have a filtered condition [2]
      • {limitation} queries on columnstore indexes run only in InterOP mode, and not in the in-memory native mode [2]
    • {operation} designing columnstore indexes
      • {best practice} understand as much as possible data’s characteristics
      • {best practice} identify workload’s characteristics
    • {operation} create a clustered columnstore index
      • via CREATE CLUSTERED COLUMNSTORE INDEX command
      • not needed to specify any columns in the statement
        • ⇐ the index will include all table columns
    • {operation} index rebuilding 
      • forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data [5]
        • all column segments are recreated with row groups fully populated [5]
      • [<SQL Server 2019] offline operation
      • [SQL Server 2019 Enterprise] online operation
        • ⇒ higher availability 
        • ⇐ pausing and resuming create and rebuild operations are not supported [11]
      • very resource intensive process
      • holds a schema modification (Sch-M) lock on the table
        • ⇒ prevents other sessions from accessing it [5]
        • ⇐ the overhead can be mitigated by using table/index partitioning
          • ⇒ indexes will be rebuild on a partition basis for those partition with volatile data [5]
    • {operation} index reorganization 
      • [<SQL Server 2019] a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups
      • [SQL Server 2019] a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted
        • ⇐ after merging smaller rowgroups, the index quality should be improved.
        • the tuple-mover is helped by a background merge task that automatically compresses smaller OPEN delta rowgroups that have existed for some time as determined by an internal threshold, or merges COMPRESSED rowgroups from where a large number of rows has been deleted
        • via: ALTER INDEX REORGANIZE command
      • [SQL Server 2016] performs additional defragmentation
        • removes deleted rows from row groups that have 10 or more percent of the rows logically deleted [5]
        • merges closed row groups together, keeping the total number of rows less than or equal than rowgroup’s limit [5]
        • ⇐ both processes can be done together [5]
      • [SQL Server 2014] the only action performed is compressing and moving the data from closed delta stores to rowgroups [5] 
        • ⇐ delete bitmap and open delta stores stay intact [5]
      • via: ALTER INDEX REORGANIZE
        • uses all available system resources while it is running [5]
          • ⇒ speeds up the execution process 
          • reduce the time during which other sessions cannot modify or delete data in a table [5]
        • close and compress all open row groups
          • via: ALTER INDEX REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
          • row groups aren’t merged during this operation [5]
    • {operation} estimate compression savings
      • [SQL Server 2019] COLUMNSTORE and COLUMNSTORE_ARCHIVE added
        • allows estimating the space savings if either of these indexes are used on a table [8]
        • {limitation} not available in all editions 
      • via: sp_estimate_data_compression_savings 
    • {operation} [bulk loads] when the number of rows is less than deltastore’s limit, all the rows go directly to the deltastore
      • [large bulk load] most of the rows go directly to the columnstore without passing through the deltastore
        • some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup
          • ⇒ the final rows go to the deltastore instead of the columnstore
      • bulk insert operations provide the number of rows in the batch as part of the API call [5]
        • best results are achieved by choosing a batch size that is divisible by rowgroup’s limit [5]
          • ⇐ guarantees that every batch produces one or several fully populated row groups [5]
            • ⇒ reduce the total number of row groups in a table [5]
            • ⇒ improves query performance
          • ⇐ the batch size shouldn’t exceed rowgroup’s limit [5]
            • row groups can be still created on the fly in a manner to similar a bulk insert when the size of the insert batch is close to or exceeds [5]
  • {operation} [non-bulk operations] trickle inserts go directly to a delta store
  • {feature} parallel inserts
    • [SQL Server 2016] requires following conditions for parallel insert on CCI [6]
      • must specify TABLOCK
      • no NCI on the clustered columnstore index
      • no identity column
      • database compatibility is set to 130
  • {recommendation} minimize the use of string columns in facts tables [5]
    • string data use more space
    • their encoding involves additional overhead during batch mode execution [5]
    • queries with predicates on string columns may have less efficient execution plans that also require significantly larger memory grants as compared to their non-string counterparts [5]
  • {recommendation} [SQL Server 2012|2014] do not push string predicates down toward the lowest operators in execution plans.
  • {recommendation} add another dimension table and replace the string value in the facts table with a synthetic, integer-based ID key that references a new table [5]
  • {operation} upgrading to SQL Server 2016
    • make sure that queries against the tables with columnstore indexes can utilize parallelism in case if database compatibility level less than 130 [5]
  • {feature} [SQL Server 2019] automated columnstore index maintenance [8]
  • {improvement} [SQL Server 2019] better columnstore metadata memory management
  • {improvement} [SQL Server 2019] low-memory load path for columnstore tables
  • {improvement} [SQL Server 2019] improved performance for bulk loading to columnstore indexes
  • {improvement} [SQL Server 2019] server startup process has been made faster for databases that use in-memory columnstore tables for HTAP
  • {feature} DMVs
Previous Post <<||>> Next Post

References:
[1] SQL Docs (2020) Columnstore indexes: Overview [link]
[2] Microsoft Learn (2024) SQL: What's new in columnstore indexes  [link]
[3] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
[4] SQL Docs (2019) Columnstore indexes - Query performance [link]
[5] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[6] Microsoft Learn (2016) Columnstore Index: Parallel load into clustered columnstore index from staging table [link]
[7] Microsoft Learn (2016) Columnstore Index Defragmentation using REORGANIZE Command [link]
[8] Microsoft (2018) Microsoft SQL Server 2019: Technical white paper [link]

Acronyms:
CCI - clustered columnstore index
CI - columnstore index
DBCC - Database Console Commands
DMV - Dynamic Management View
ETL - Extract, Transform, Load
HTAP - Hybrid Transactional/Analytical Processing 
LOB - Line of Business
NCCI - nonclustered columnstore index
OLTP - On-Line Transaction Processing
SP - Service Pack

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

 Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 14-Feb-2024

[Microsoft Fabric] Partitions

  • {def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards
    • a pattition is defined based on one or more fields
    • each partition contains a subset of the data
    • each partitions can be stored and processed independently
  • {goal} improve performance, scalability, and manageability of large data tables
  • {benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]
    • e.g., date ranges, regions, categories, entities
  • {benefit} allows to improve queries' performance as they can target specific partitions [2]
    • reduces the amount of data scanned [2]
    • improves queries' performance [2]
  • {benefit} allows for more efficient data loading [2]
  • {benefit} facilitates the management of big tables [2]
    • maintenance tasks can be performed on individual partitions  [2]
    • obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]
  • applies to 
    • backups
    • indexing
    • allows optimizing query performance for specific subsets of data
    • statistics
  • performance can be affected by
    • the choice of partition columns for a delta table [1]
    • the number and size of partitions of the partition column [1]
    • a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]
      • ⇐ negatively impacts performance of the metadata discovery scan for changes [1]
      • {recommendation} if the cardinality of a column is high, choose another column for partitioning [1]
    • the size of each partition can also affect performance
      • {recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
      • {recommendation} follow the best practices for delta tables maintenance [1]
      • a large volume of small-sized parquet files increases the time it takes to sync the changes [1]
        • ⇒ leads to large number of parquet files in a delta table due to [1]
          • over-partitioning
            • partitions with high number of unique values [1]
            • {recommendation} choose a partition column that 
              • doesn't have a high cardinality [1]
              • results in individual partition size of at least 1 GB [1]
          • many small files
            • batch and streaming data ingestion rates might result in small files
              • depends on the frequency and size of changes being written to a lakehouse [1]
            • {recommendation} implement regular lakehouse table maintenance [1] 

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance
written [link]
[3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]
[4] 

Resources
[R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]

09 February 2025

🏭🗒️Microsoft Fabric: Kusto Query Language (KQL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2025

[Microsoft Fabric] Kusto Query Language (KQL)

  • {def} a read-only request to process query language [1]
    • designed for data exploration and summarization [1]
      • very similar to SQL
        • the explain command can be used to transform SQL into KQL code
          • ⇐ not all the SQL syntax can be translated
    • statements are sequenced being executed in the order of their arrangement
      • funnel like processing where data is piped from one operator to the next
        • data is filtered, rearranged or summarized at each step and then fed into the following step
        • statements are sequenced by a pipe (|)
    • returns data in a tabular or graph format
    • designed and developed to take advantage of cloud computing through clustering and scaling compute [2]
      • ideal engine to power fast, real-time dashboards
    • case-sensitive in general 
    • named after the undersea pioneer Jacques Cousteau [2]
    • operation sequence
      • filter data
      • aggregate data
      • order data
      • modify column output
  • supports standard data types 
    • string
      • a sequence of zero or more Unicode characters
      • characters are encoded in UTF-8.
    • int 
      • 32-bit whole-number integer
    • long
      • signed 64-bit whole-number integer
    • real (aka double)
      • 64-bit decimal-based number 
      • and provides high precision with decimal points.
    • decimal
      • a 128-bit decimal number
      • provides the highest precision of decimal points
      • {recommendation} if precision is not needed, use the real type instead [2]
    • bool 
      • a boolean value that can be a true (1), false (0), or null
    • datetime 
      • represents a date in the UTC zone
    • timespan 
      • represents a time interval
        •  days, hours, minutes, seconds, milliseconds, microseconds, tick
      • if no time frame is specified, it will default to day
    • dynamic
      • a special data type that can take 
        • any value from the other data types
        • arrays
        • a {name = value} property bag
    • guid
      • a 128-bit globally unique value
  • statement types
    • tabular expression statement
    • let statement
      • used to 
        • set variable names equal to an expression
        • create views
      • ⇐ used mostly to 
        • help break complex expressions into multiple parts, each represented by a variable
        • sett constants outside the query to aid in readability
    • set statement
      • used to set the query duration
  • {tool}Microsoft Santinel 
    • {def} a cloud native SIEM and SOAR that provides cyberthreat detection, investigation, response, and proactive hunting, with a bird's-eye view across your enterprise [3]
  • {tool} Kusto Explorer
    • {def} user-friendly interface to query and analyze data with KQL [4]
  • {tool} Azure Data Studio 
    • {def} lightweight, cross-platform data management and development tool for data professionals [5]

References:
[1] Microsoft (2024) Real-time Analytics: End-to-End Workshop
[2] Mark Morowczynski et al (2024) The Definitive Guide to KQL: Using Kusto Query Language for Operations, Defending, and Threat Hunting
[3] Microsoft Learn (2024) Azure: What is Microsoft Sentinel [link]
[4] Microsoft Learn (2024) Kusto: Kusto.Explorer installation and user interface [link]
[5] Microsoft Learn (2024) SQL: What is Azure Data Studio? [link]

Acronyms:
KQL - Kusto Query Language (
SIEM - security information and event management
SOAR - security orchestration, automation, and response
SQL - Structured Query Language
UTC - Universal Time Coordinated

🏭🗒️Microsoft Fabric: Sharding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2024

 [Microsoft Fabric] Data Partitioning (aka Sharding)

  • {definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
  • allows a logical database to be partitioned across multiple physical servers [1]
    • each partition is referred to as a shard
    • the largest tables are partitioned across multiple database servers [1]
      • when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]
    • partitioning is based on a Key Value
      • e.g. such as a user ID
  • proven technique for achieving data processing on a massive scale [1]
    • solution used at the largest websites 
      • e.g. Facebook, Twitter
      • usually associated with rapid growth
        •  ⇒ the approach needs to be dynamic [1]
    • the only way to scale a relational database to massive web use [1]
      • together with caching and replication [1]
  • {drawback} involves significant operational complexities and compromises [1]
    • the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
    • requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.
  • {drawback} high operational costs [1]
  • {drawback} application complexity
    • it’s up to the application code to route SQL requests to the correct shard [1]
      • ⇒ a dynamic routing layer must be implemented
        • ⇐ most massive websites are adding shards as they grow [1]
        • layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]
  • {drawback} crippled SQL
    • [sharded database] it is not possible to issue a SQL statement that operates across shards [1]
      • ⇒ usually SQL statements are limited to row-level access [1]
      • ⇒ only programmers can query the database as a whole [1]
      • joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]
  • {drawback} loss of transactional integrity
    • ACID transactions against multiple shards are not possible and/or not practical [1]
      • ⇐ {exception} there are database systems that support 2PC
        • involves considerable troubleshooting as conflicts and bottlenecks can occur [1]
  • {drawback} operational complexity. 
    • load balancing across shards becomes extremely problematic
      • adding new shards requires a complex rebalancing of data [1]
      • changing the database schema requires a rolling operation across all the shards [1]
        • ⇒ can lead to transitory inconsistencies in the schema [1]
    • a sharded database entails a huge amount of operational effort and administrator skill [1]
  • {concept} CAP (Consistency, Availability, and Partition) theorem 
    • in a distributed database system, one can have at most only two of CAP tolerance [1]
    • consistency
      • every user of the database has an identical view of the data at any given instant [1]
    • availability
      • in the event of a failure, the database remains operational [1]
    • partition tolerance
      • the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]
  • {concept} partitioning
    • {def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]
      • ⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.
        • a partition is a set of replicas)
    • {type} [stateless services] a logical unit that contains one or more instances of a service [5]
      • partitioning a stateless service is a very rare scenario
      • scalability and availability are normally achieved by adding more instances
      • {subtype} externally persisted state
        • persists its state externally [5]
          • e.g. databases in Azure SQL Database
      • {subtype} computation-only services
        • service that do not manage any persistent state e.g. calculator or image thumbnailing [5]
    • {type} scalable stateful services
      • partition state (data)
      • a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
      • the state must be accessed and stored
        • ⇒ bound by
          • network bandwidth limits
          • system memory limits
          • disk storage limits
        • {scenario} run into resource constraints in a running cluster
          • {recommendation} scale out the cluster to accommodate the new requirements [4]
  • {concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]
    • ⇐ addresses the significant challenges in developing and managing cloud applications
    • places the partitions on different nodes [5]
      • allows partitions to grow to a node's resource limit
        • ⇐ partitions are rebalances across nodes [5]
      • {benefit} ensures the continued efficient use of hardware resources [5]
    • {default} makes sure that there is about the same number of primary and secondary replicas on each node
      • ⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
      • hot and cold spots may appear in a cluster
          • ⇐ it should be preferably avoided
        • {recommendation} partition the state so is evenly distributed across all partitions [5]
        • {recommendation} report load from each of the replicas for the service [5]
    • provides the capability to report load consumed by services [5]
      • e.g.  amount of memory, number of records
      • detects which partitions server higher loads than others [5]
        • ⇐ based on the metrics reported
      • rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded  [5]
      • ⇐ it's not always possible to know how much data will be in a given partition
        • {recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]
          • {benefit} prevents situations described in the voting example [5]
        • {recommendation} report load
          • {benefit} helps smooth out temporary differences in access or load over time [5]
    • {recommendation} choose an optimal number of partitions to begin with
      • ⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]
        • ⇐ assuming the maximum number of partitions is a valid approach [5]
      • ⇒ one may end up needing more partitions than initially considered [5]
        • ⇐ {constraint} the partition count can't be changed after the fact [5]
          •  ⇒ apply more advanced partition approaches
            • e.g. creating a new service instance of the same service type
            • e.g. implement client-side logic that routes the requests to the correct service instance

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
[2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed
[3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
[4] Microsoft Fabric (2024) Data sharding policy [link]
[5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
[6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

Acronyms:
ACID - atomicity, consistency, isolation, durability
2PC - Two Phase Commit
CAP - Consistency, Availability, Partition

06 February 2025

🌌🏭KQL Reloaded: First Steps (Part V: Database Metadata)

When working with a new data repository, one of the first things to do is to look at database's metadata, when available, and try to get a birds eye view of what's available, how big is the databases in terms of size, tables and user-defined objects, how the schema was defined, how the data are stored, eventually how often backup are taken, what users have access and to what, etc. 

So, after creating some queries in KQL and figuring out how things work, I tried to check what metadata are available, how it can be accessed, etc. The target is not to provide a full list of the available metadata, but to understand what information is available, in what format, how easy is to extract the important metadata, etc. 

So, the first set of metadata is related to database:

// get database metadata metadata
.show databases (ContosoSales)

// get database metadata metadata (multiple databases)
.show databases (ContosoSales, Samples)

// get database schema metadata
.show databases (ContosoSales) schema

// get database schema metadata (multiple databases) 
.show databases (ContosoSales, Samples) schema

// get database schema violations metadata
.show database ContosoSales schema violations

// get database entities metadata
.show databases entities with (showObfuscatedStrings=true)
| where DatabaseName == "ContosoSales"

// get database metadata 
.show databases entities with (resolveFunctionsSchema=true)
| where DatabaseName == "ContosoSales" and EntityType == "Table"
//| summarize count () //get the number of tables

// get a function's details
.show databases entities with (resolveFunctionsSchema=true)
| where DatabaseName == "ContosoSales" 
    and EntityType == "Function" 
    and EntityName == "SalesWithParams"

// get external tables metadata
.show external tables

// get materialized views metadata
.show materialized-views

// get query results metadata
.show stored_query_results

// get entities groups metadata
.show entity_groups

Then, it's useful to look at the database objects. 

// get all tables 
.show tables 
//| count

// get tables metadata
.show tables (Customers, NewSales)

// get tables schema
.show table Customers cslschema

// get schema as json
.show table Customers schema as json

// get table size: Customers
Customers
| extend sizeEstimateOfColumn = estimate_data_size(*)
| summarize totalSize_MB=round(sum(sizeEstimateOfColumn)/1024.00/1024.00,2)

Unfortunately, the public environment has restrictions in what concerns the creation of objects, while for the features available one needs to create some objects to query the corresponding metadata.

Furthermore, it would be interesting to understand who has access to the various repositories, what policies were defined, and so on. 

// get principal roles
.show database ContosoSales principal roles

// get principal roles for table
.show table Customers principal roles

// get principal roles for function:
.show function SalesWithParams principal roles

// get retention policies
.show table Customers policy retention

// get sharding policies
.show table Customers policy sharding

There are many more objects one can explore. It makes sense to document the features, respectively the objects used for the various purposes.

In addition, one should check also the best practices available for the data repository (see [2]).

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Management commands overview [link]
[2] Microsoft Learn (2024) Kusto: Best practices for schema management [link]

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. 

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

  • {def} features of data that can be measured or assessed against defined standards to determine the quality of data
    • captures a specific aspect of general data quality
      • can refer to data values or to their schema
  • {type} hard dimensions
    • dimensions that can be measured
  • {type} soft dimensions
    • dimensions that can be measured only indirectly
      • ⇐ through interviews with data users or through any other kind of communication with users
    • dimensions whose measurement depends on the perception of the users of the data
  • {dimension} uniqueness [post]
    • the degree to which a value or set of values is unique within a dataset
      • can be determined based on a set of values supposed to be unique across the whole dataset
        • some systems have a artificial, respectively natural unique identified
      • measured in terms of either
        • the percentage of unique values available in a dataset 
        • the percentage of duplicate values available in a dataset
    • the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
    • it can have broader implications
      • aggregated information is not shown correctly
        • ⇐ split across different entities
      • can lead to further duplicates in other areas
    • {recommendation} enforce uniqueness by design, if possible
    • {recommendation} check the data regularly for duplicates and disable or delete the duplicated records
      • ⇐ one should make sure that the records can't be further reused in business processes or analytics workloads
  • {dimension} completeness [post]
    •  the extent to which there are missing data in a dataset
      • ⇐ reflected in the number of the missing values
        • measured as percentage of the missing values compared to the total
    • determined by the presence of NULL values  
      • {type} attribute completeness 
        • the number of NULLs in a specific attribute
      • {type} tuple completeness 
        • the number of unknown values of the attributes in a tuple
      • {type} relation completeness 
        • the number of tuples with unknown attribute values in the relation
      • {type} value completeness
        • makes sense for complex, semi-structured columns such as XML data type columns
          • e.g. a complete element or attribute can be missing
    • considered in report to 
      • mandatory attributes
        • attributes that need a not-Null value for each record
      • optional attributes
        • attributes that not necessarily need to be provided
      • inapplicable attributes
        • attributes not applicable (relevant) for certain scenarios by design
  • {dimension} conformity (aka format compliance) [post]
    • {def} the extent data are in the expected format
      • dependent on the data type and its definition
    • can be associated with a set of metadata 
      • data type
        • e.g. text, numeric, alphanumeric, positive, date
      • length
      • precision
      • scale
      • formatting patterns 
        • e.g. phone number, decimal and digit grouping symbols
        • different formatting might apply based on various business rules 
        • can use delimiters
    • {recommendation} define the data type and further constraints to enforce the various characteristics of the element
    • {recommendation} make sure that the delimiters don't overlap with other uses
  • {dimension} accuracy [post]
    • {def} the extent data is correct, respectively match the reality with an acceptable level of approximation
    • stricter than just conforming to business rules
    • can be measured at column and table level
      • [discrete data values] 
        • use frequency distribution of values
          • a value with very low frequency is probably incorrect
      • [alphanumeric values]
        • use string length distribution
          • a  string with a very atypical length is potentially incorrect
        • try to find patterns and then create pattern distribution.
          • patterns with low frequency probably denote wrong values
      • [continuous attributes]
        • use descriptive statistics
          • just by looking at minimal and maximal values, you can easily spot potentially problematic data
  • {dimension} consistency [post]
    • {def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component
      • {type} notational consistency
        • the extent (data) values are consistent in notation
      • {type} semantic consistency
        • the degree to which data has unique meaning
        • is more restrictive than the notational consistency
    • measures the equivalence of information stored in various repositories
    • involves comparing values with a predefined set of possible values
      •  from the same or from different systems
    • can be measured at column and table level
    • can have different scopes
      • cross-system consistencies
        • among systems or data repositories
      • cross-record consistency
        • within the same repository
      • temporal consistency
        •  within the same record at different points in time
  • {dimension} timeliness [post]
    • tells the degree to which data is current and available when needed
      • there is always some delay between change in the real world and the moment when this change is entered into a system
    • stale data/obsolete data
  • {dimension} structuredness [post]
    • the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
    • allows the categorization of data as
      • structured data [def]
        • refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations
      • unstructured data [def]
        • refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined
      • semi-structured data [def]
        • refers to islands of structured data stored with unstructured data, or vice versa
      • ⇐ the more structured the data, the easier it is to be processed
  • {dimension} referential integrity [post]
    • {def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
    • it's an architectural concept of the database 
    • {recommendation} keep the referential integrity of a system by design
      • some systems build logic for assuring the referential integrity in the applications and not in the database
  • {dimension} currency (aka actuality)
    •  the extent to which data is actual
    •  can be considered as a special type of accuracy 
      • ⇐ when the data is not actual then it doesn’t reflect reality
  • {dimension} ease of use
    • the extent to which data can be used for a given purpose
      • usually it refers to whether the data can be processed as needed
      • depends on the application or on the user interface
  • {dimension} fitness of use
    • the degree to which the data is fit for use
      • the data may have good quality for a given purposes but 
        • not usable for other purposes
        • can be used as substitute for other data
          • e.g. use phone area codes instead of ZIP codes to locate customers approximately
  • {dimension} trustfulness [post]
    • the degree to which the data can be trusted
      • is a matter of perception 
        • ask users whether they trust the data and which are the reasons
          • if the users don’t trust the data 
            • they will create their own solutions
            • they will not use applications
  • {dimension} entropy
    • {def} the average amount of information conveyed
      • ⇐ quantification of information in a system
      • ⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
      • ⇐ can tell whether your data is suitable for analysis or not
    • can be measured at column and table level
  • {dimension} presentation quality 
    • applicable to applications that presents data
      • format and appearance should support the appropriate use of data
      • depends on the UI used
  • {recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed
    • the data should be exclusively managed though the management system
    • {anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system
Previous Post <<||>>  Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

24 January 2025

🧭Business Intelligence: Perspectives (Part XXIV: Building Castles in the Air)

Business Intelligence Series
Business Intelligence Series

Business users have mainly three means of visualizing data – reports, dashboards and more recently notebooks, the latter being a mix between reports and dashboards. Given that all three types of display can be a mix of tabular representations and visuals/visualizations, the difference between them is often neglectable to the degree that the terms are used interchangeably. 

For example, in Power BI a report is a "multi-perspective view into a single semantic model, with visualizations that represent different findings and insights from that semantic model" [1], while a dashboard is "a single page, often called a canvas, that uses visualizations to tell a story" [1], a dashboards’ visuals coming from one or more reports [2]. Despite this clear delimitation, the two concepts continue to be mixed and misused in conversations even by data-related professionals. This happens also because in other tools the vendors designate as dashboard what is called report in Power BI. 

Given the limited terminology, it’s easy to generalize that dashboards are useless, poorly designed, bad for business users, and so on. As Stephen Few recognized almost two decades ago, "most dashboards fail to communicate efficiently and effectively, not because of inadequate technology (at least not primarily), but because of poorly designed implementations" [3]. Therefore, when people say that "dashboards are bad" refer to the result of poorly implementations, of what some of them were part of, which frankly is a different topic! Unfortunately, BI implementations reflect probably more than any other areas how easy is to fail!

Frankly, here it is not necessarily the poor implementation of a project management methodology at fault, which quite often happens, but the way requirements are defined, understood, documented and implemented. Even if these last aspects are part of the methodologies, they are merely a reflection of how people understand the business. The outcomes of BI implementations are rooted in other areas, and it starts with how the strategic goals and objectives are defined, how the elements that need oversight are considered in the broader perspectives. The dashboards become thus the end-result of a chain of failures, failing to build the business-related fundament on which the reporting infrastructure should be based upon. It’s so easy to shift the blame on what’s perceptible than on what’s missing!

Many dashboards are built because people need a sense of what’s happening in the business. It starts with some ideas based on the problems identified in organizations, one or more dashboards are built, and sometimes a lot of time is invested in the process. Then, some important progress is made, and all comes to a stale if the numbers don’t reveal something new, important, or whatever users’ perception is. Some might regard this as failure, though as long as the initial objectives were met, something was learned in the process and a difference was made, one can’t equate this with failure!

It’s more important to recognize the temporary character of dashboards, respectively of the requirements that lead to them and build around them. Of course, this requires occasionally a different approach to the whole topic. It starts with how KPIs and other business are defined and organized, respectively on how data repositories are built, and it ends with how data are visualized and reported.

As the practice often revealed, it’s possible to build castles in the air, without a solid foundation, though the expectation for such edifices to sustain the weight of businesses is unrealistic. Such edifices break with the first strong storm and unfortunately it's easier to blame a set of tools, some people or a whole department instead at looking critically at the whole organization!


References:
[1] Microsoft Learn (2024) Power BI: Glossary [link]
[2] Microsoft Learn (2024) Power BI: Dashboards for business users of the Power BI service [link
[3] Stephen Few, "Information Dashboard Design", 2006
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.