Showing posts with label SQL Server 2017. Show all posts
Showing posts with label SQL Server 2017. Show all posts

16 February 2025

💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 15-Feb-2024

[SQL Server] columnstore indexes (CI)

  • {def} a technology for storing, retrieving and managing data by using a columnar data format (aka columnstore
    • store compressed data on a per-column rather than a per-row basis [5]
  • {benefit} designed for analytics and data warehousing workloads
    • data warehousing
      • {scenario} store fact tables and large dimension tables
        • ⇐ tend to require full table scans rather than table seeks
    • analytics workloads
      • {scenario} [SQL Server 2016 SP1] can be used for real-time analytics on operational databases
        • ⇐ an updatable nonclustered columnstore index can be created on a rowstore table
  • {benefit} performance increase 
    • can achieve up to 100x better performance [4]
    • offers an order of magnitude better performance than a rowstore index
      • {feature} uses batch mode execution
        • improves query performance typically by two to four times
      • have high performance gains for analytic queries that scan large amounts of data, especially on large tables (>1 million rows) 
  • {benefit} reduces significantly the data warehouse storage costs
    • {feature} data compression
      • ⇒ provides high compression rates, typically by 10 times
        • ⇒ reduces total I/O from the physical media
          • ⇐ queries often select only a few columns from a table
          • minimizes or eliminates system I/O bottlenecks
        • reduces significantly the memory footprint
          • ⇒ query performance can improve 
            • because SQL Server can perform more query and data operations in memory
  • {benefit} built in memory
    • ⇒ sufficient memory must be available 
  • {benefit} part of the database engine
    • no special hardware is needed
  • {concept} columnstore 
    • {def} data structure logically organized as a table with rows and columns, and physically stored in a column-wise data format
      • stores values from the same domain which commonly have similar values
        • when a query references a column, then only that column is fetched from disk [3]
          • ⇐ the columns not requested are skipped 
            • ⇒ they are not loaded into memory 
        • when a query is executed, the rows must be reconstructed
          • ⇒ row reconstruction takes some time and uses some CPU and memory resources [3]
        • [SQL Server 2016] columnstore index on rowstore tables
          • columnstore is updated when data changes in the rowstore table
        • both indexes work against the same data
  • {concept}rowstore
    • {def} data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format
      • ⇐ the traditional way to store relational table data
      • refers to a table where the underlying data storage format is either
        • a heap
        • a clustered index
        • a memory-optimized table
  • {concept} rowstore index
    • performs best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values
      • ⇒ appropriate for transactional workloads 
        • because they tend to require mostly table seeks instead of table scans
  • {concept} rowgroup
    • {def} a group of rows that are compressed into columnstore format at the same time
      • {constraint} has a maximum number of rows per rowgroup, which is 1,048,576 =2^20 rows
      • contains one column segment for every column in the table
      • can have more than one delta rowgroup that form the deltastore
        • e.g. when multiple threads create columnstore indexes using parallel execution plans [5]
          • ⇐ each thread will work with its own subset of data, creating separate rowgroups [5]
        • [partitions] each table partition has its own set of row groups [5]
          • ⇐  too many partitions may prevent workloads from benefiting from a CCI [11]
            • ⇐ data aren’t pushed into a compressed columnstore segment until the rowgroup limit is reached
    • {event} rowgroup is compressed
      • marked as read-only [16]
      • a compressed rowgroup is considered as fragmented when either 
        • row number < rowgroup limit but dictionary size reached the maximum
          • nothing can be done to increase the number of rows [15]
          • the trim_reason is other than DICTIONARY_SIZE
        • it has nonzero deleted rows that exceeds a minimum threshold [15]
    • {event} all data from rowgroup deleted 
      • transitions from COMPRESSED into TOMBSTONE state
      • later removed by the tuple-mover background process
    • {event} rows in the columnstore indexes can be moved to different locations
      • row-id in the nonclustered indexes aren’t updated 
        • ⇐ the mappings between old and new row locations are stored in an internal structure (aka mapping index) 
    • {event} rowgroup build
      • all column data are combined on a per-row group basis, encoded and compressed [5]
        • the rows within a row group can be rearranged if that helps to achieve a better compression rate [5]
  • {feature} data compression
    • the table is sliced into rowgroups, and each rowgroup is compresses in a column-wise manner
      • the number of rows in the rowgroup must be 
        • large enough to improve compression rates
        • small enough to benefit from in-memory operations
          • having too many small rowgroups decreases columnstore index’s quality
    • uses its own compression mechanism 
      • ⇒ row or page compression cannot be used on it [3]
      • [SQL Server 2016] page compression has been removed
        • ⇐ in some cases, page compression disallowed the creation of columnstore indexes with a very large number of columns [5]
  • {feature} compression delay
    • computed when a delta rowgroup is closed [7]
    • keeps the ‘active’ rows in delta rowgroup and only transition these rows to compressed rowgroup after a specified delay [7]
      • ⇐ reduces the overall maintenance overhead of NCCI [7]
      • ⇒ leads to a larger number of delta rowgroups [7]
    • {best practice} if the workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option [7]
    • {best practice} [OLTP workload] if > 10% rows are marked deleted in recently compressed rowgroups, then consider a value that accommodates the behavior [7]
      • via: create nonclustered columnstore index with (compression_delay= 150)
  • {feature} data encoding
    • all values in the data are replaced with 64-bit integers using one of two encoding algorithms
    • {concept} dictionary encoding
      • stores distinct values from the data in a separate structure (aka dictionary} 
        • every value in a dictionary has a unique ID assigned [5]
          • the ID is used for replacement
    • {concept} global dictionary
      • shared across all segments that belong to the same index partition [5]
    • {concept} local dictionary
      • created for individual segments using values that are not present in the global dictionary
    • {concept} value-based encoding
      • mainly used for numeric and integer data types that do not have enough duplicated values [5]
        • dictionary encoding would be inefficient [5]
      • converts integer and numeric values to a smaller range of 64-bit integers in 2 steps
        • {step} [numeric data types] are converted to integers using the minimum positive exponent (aka magnitude that allows this conversion) [5]
          • {goal} convert all numeric values to integers [5]
          • [integer data types] the smallest negative exponent is chosen that can be applied to all values without losing their precision [5]
            • {goal} reduce the interval between the minimum and maximum values stored in the segment [5]
        • {step} the minimum value (aka base value) in the segment is identified and subtracted it from all other values [5]
          • ⇒ makes the minimum value in the segment number 0 [5]
      • after encoding the data are compressed and stored as a LOB allocation unit
  • {concept} column segment 
    • {def} a column of data from within the rowgroup
    • is compressed together and stored on physical media
    • SQL Server loads an entire segment to memory when it needs to access its data
  • {concept} segment metadata 
    • store metadata about each segment 
      • e.g. minimum and maximum values
      • ⇐ segments that do not have the required data are skipped [5]
  • {concept} deltastore
    • {def} all of the delta rowgroups of a columnstore index
    • its operations are handled behind the scenes
      • can be in either states
        • {state} open (aka open delta store) 
          • accepts new rows and allow modifications and deletions of data
        • {state} closed (aka closed data store)
          • a delta store is closed when it reaches its rowgroup limit
  • {concept} delta rowgroup 
    • {def} a clustered B-tree index that's used only with columnstore indexes
    • improves columnstore compression and performance by storing rows until the number of rows reaches the rowgroup limit and are then moved into the columnstore
    • {event} reaches the maximum number of rows
      • it transitions from an ‘open’ to ‘closed’ state
      • a closed rowgroup is compressed by the tuple-mover and stored into the columnstore as COMPRESSED rowgroup
    • {event} compressed
      • the existing delta rowgroup transitions into TOMBSTONE state to be removed later by the tuple-mover when there is no reference to it
  • {concept} tuple-mover 
    • background process that checks for closed row group
      • if it finds a closed rowgroup, it compresses the delta rowgroup and stores it into the columnstore as a COMPRESSED rowgroup
  • {concept} clustered columnstore index (CCI) 
    • is the primary storage for the entire table
    • {characteristic) updatable
      • has two structures that support data modifications
        • ⇐ both use the B-Tree format to store data [5]
        • ⇐ created on demand [5]
        • delete bitmap
          • indicates which rows were deleted from a table
          • upon deletion the row continues to be stored into the rowgroup
          • during query execution SQL Server checks the delete bitmap and excludes deleted rows from the processing [5]
        • delta store
          • includes newly inserted rows
          • updating a row triggers the deletion of the existing row and insertion of a new version of a row to a delta store
            • ⇒ the update does not change the row data
            • ⇒ the updated row is inserted to a delete bitmap
        • [partitions] each partition can have a single delete bitmap and multiple delta stores
          • ⇐ this makes each partition self-contained and independent from other partitions
            • ⇒ allows performing a partition switch on tables that have clustered columnstore indexes defined [5]
    • {feature} supports minimal logging for batch sizes >= rowgroup’s limit [12]
    • [SQL Server 2017] supports non-persisted computed columns in clustered columnstore indexes [2]
    • store some data temporarily into a clustered index (aka deltastore) and a btree list of IDs for deleted rows
      • ⇐ {benefit} reduces fragmentation of the column segments and improves performance
      • combines query results from both the columnstore and the deltastore to return the correct query results
    • [partitions] too many partitions can hurt the performance of a clustered columnstore index [11]
  • {concept} nonclustered columnstore index (NCCI)
    • {def} a secondary index that's created on a rowstore table
      • is defined as one or more columns of the table and has an optional condition that filters the rows
      • designed to be used for workloads involving a mix of transactional and analytics workload*
      • functions the same as a clustered columnstore index
        • ⇐ has same performance optimizations (incl. batchmode operators)
        • {exception} doesn’t supports persisted computed columns
          • can’t be created on a columnstore index that has a computed column [2]
        • however behave differently between the various versions of SQL Server
          • [SQL Server 2012|2014] {restriction} readonly
      • contains a copy of part or all of the rows and columns in the underlying table
        • include a row-id , which is either the address of
          • a row in a heap table 
          • a clustered index key value
            • includes all columns from the clustered index even when not explicitly defined in the CREATE statement
              • the not specified columns will not be available in the sys.index_columns view
      • [SQL Server 2016] multiple nonclustered rowstore indexes can be created on a columnstore index and perform efficient table seeks on the underlying columnstore
        • ⇒ once created, makes it possible to drop one or more btree nonclustered indexes
      • enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index
  • {concept} batch mode execution (aka vector-based execution, vectorized execution
    • {def} query processing method used to process multiple rows together in groups of rows, or batches, rather than one row at a time
      • SQL Server can push a predicate to the columnstore index scan operator, preventing unnecessary rows from being loaded into the batch [5]
      • queries can process up to 900 rows together
        • enables efficient query execution (by a 3-4x factor) [4]
        • ⇐ the size of the batches varies to fit into the CPU cache
        • ⇒ reduces the number of times that the CPU needs to request external data from memory or other components [5]
      • improves the performance of aggregations, which can be calculated on a per-batch rather than a per-row basis [5]
      • tries to minimize the copy of data between operators by creating and maintaining a special bitmap that indicates if a row is still valid in the batch [5]
        • ⇐ subsequent operators will ignore the non-valid rows
        • every operator has a queue of work items (batches) to process [5]
        • worker threads from a shared pool pick items from queues and process them while migrating from operator to operator [5]
    • is closely integrated with, and optimized around, the columnstore storage format.
      • columnstore indexes use batch mode execution
        • ⇐ improves query performance typically by two to four times
  • {concept} tuple mover
    • single-threaded process that works in the background, preserving system resources
      • runs every five minutes
    • converts closed delta stores to row groups that store data in a column-based storage format [5]
      • can be disabled via trace flag T-634 
      • ⇐ the conversion of closed delta stores to row groups can be forced by reorganizing an index [5]
        • runs in parallel using multiple threads
          • decreases significantly conversion time at a cost of extra CPU load and memory usage [5]
        • via: ALTER INDEX REORGANIZE command
    • it doesn’t prevent other sessions from inserting new data into a table [5]
    • deletions and data modifications would be blocked for the duration of the operation [5]
      • {recommendation} consider forcing index reorganization manually to reduce execution, and therefore locking, time [5]
    • considered fragmented if it has
      • multiple delta rowgroups
      • deleted rows
    • require maintenance like that of regular B-Tree indexes [5]
      • {issue] partially populated row groups
      • {issue} overhead of delta store and delete bitmap scans during query execution
      • rebuilding the columnstore index addresses the issues
      • the strategy depends on the volatility of the data and the ETL processes implemented in the system [5]
        • {recommendation} rebuild indexes when a table has a considerable volme of deleted rows and/or a large number of partially populated rowgroups [5]
        • {recommendation} rebuild partition(s) that still have a large number of rows in open delta stores after the ETL process has completed, especially if the ETL process does not use a bulk insert API [5]
      • creating/dropping/disabling/rebuilding functions like any other index
    • columnstore statistics 
      • a statistics object is created at the time of columnstore index creation; however, it is neither populated nor updated afterward [5]
        • ⇐ SQL Server relies on segment information, B-Tree indexes (when available), and column-level statistics when deciding if a columnstore index needs to be used [5]
        • it is beneficial to create missing column-level statistics on the columns that participate in a columnstore index and are used in query predicates and as join keys [5]
          • ⇐ statistics rarely update automatically on very large tables [5]
            • ⇒ statistics must be updated ‘manually’
      • [SQL Server 2019] included into the schema-only clone of a database functionality [8]
        • enable performance troubleshooting without the need to manual capture the statistics information
      • columnstore indexes has been added to sp_estimate_data_compression_savings. In SQL Server 2019 both 
      • COLUMNSTORE and COLUMNSTORE_ARCHIVE have been added to allow you to estimate the space savings if 
      • either of these indexes are used on a table.
        • via DBCC CLONEDATABASE
    • [in-memory tables] 
      • {limitation} a columnstore index must include all the columns and can’t have a filtered condition [2]
      • {limitation} queries on columnstore indexes run only in InterOP mode, and not in the in-memory native mode [2]
    • {operation} designing columnstore indexes
      • {best practice} understand as much as possible data’s characteristics
      • {best practice} identify workload’s characteristics
    • {operation} create a clustered columnstore index
      • via CREATE CLUSTERED COLUMNSTORE INDEX command
      • not needed to specify any columns in the statement
        • ⇐ the index will include all table columns
    • {operation} index rebuilding 
      • forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data [5]
        • all column segments are recreated with row groups fully populated [5]
      • [<SQL Server 2019] offline operation
      • [SQL Server 2019 Enterprise] online operation
        • ⇒ higher availability 
        • ⇐ pausing and resuming create and rebuild operations are not supported [11]
      • very resource intensive process
      • holds a schema modification (Sch-M) lock on the table
        • ⇒ prevents other sessions from accessing it [5]
        • ⇐ the overhead can be mitigated by using table/index partitioning
          • ⇒ indexes will be rebuild on a partition basis for those partition with volatile data [5]
    • {operation} index reorganization 
      • [<SQL Server 2019] a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups
      • [SQL Server 2019] a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted
        • ⇐ after merging smaller rowgroups, the index quality should be improved.
        • the tuple-mover is helped by a background merge task that automatically compresses smaller OPEN delta rowgroups that have existed for some time as determined by an internal threshold, or merges COMPRESSED rowgroups from where a large number of rows has been deleted
        • via: ALTER INDEX REORGANIZE command
      • [SQL Server 2016] performs additional defragmentation
        • removes deleted rows from row groups that have 10 or more percent of the rows logically deleted [5]
        • merges closed row groups together, keeping the total number of rows less than or equal than rowgroup’s limit [5]
        • ⇐ both processes can be done together [5]
      • [SQL Server 2014] the only action performed is compressing and moving the data from closed delta stores to rowgroups [5] 
        • ⇐ delete bitmap and open delta stores stay intact [5]
      • via: ALTER INDEX REORGANIZE
        • uses all available system resources while it is running [5]
          • ⇒ speeds up the execution process 
          • reduce the time during which other sessions cannot modify or delete data in a table [5]
        • close and compress all open row groups
          • via: ALTER INDEX REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
          • row groups aren’t merged during this operation [5]
    • {operation} estimate compression savings
      • [SQL Server 2019] COLUMNSTORE and COLUMNSTORE_ARCHIVE added
        • allows estimating the space savings if either of these indexes are used on a table [8]
        • {limitation} not available in all editions 
      • via: sp_estimate_data_compression_savings 
    • {operation} [bulk loads] when the number of rows is less than deltastore’s limit, all the rows go directly to the deltastore
      • [large bulk load] most of the rows go directly to the columnstore without passing through the deltastore
        • some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup
          • ⇒ the final rows go to the deltastore instead of the columnstore
      • bulk insert operations provide the number of rows in the batch as part of the API call [5]
        • best results are achieved by choosing a batch size that is divisible by rowgroup’s limit [5]
          • ⇐ guarantees that every batch produces one or several fully populated row groups [5]
            • ⇒ reduce the total number of row groups in a table [5]
            • ⇒ improves query performance
          • ⇐ the batch size shouldn’t exceed rowgroup’s limit [5]
            • row groups can be still created on the fly in a manner to similar a bulk insert when the size of the insert batch is close to or exceeds [5]
  • {operation} [non-bulk operations] trickle inserts go directly to a delta store
  • {feature} parallel inserts
    • [SQL Server 2016] requires following conditions for parallel insert on CCI [6]
      • must specify TABLOCK
      • no NCI on the clustered columnstore index
      • no identity column
      • database compatibility is set to 130
  • {recommendation} minimize the use of string columns in facts tables [5]
    • string data use more space
    • their encoding involves additional overhead during batch mode execution [5]
    • queries with predicates on string columns may have less efficient execution plans that also require significantly larger memory grants as compared to their non-string counterparts [5]
  • {recommendation} [SQL Server 2012|2014] do not push string predicates down toward the lowest operators in execution plans.
  • {recommendation} add another dimension table and replace the string value in the facts table with a synthetic, integer-based ID key that references a new table [5]
  • {operation} upgrading to SQL Server 2016
    • make sure that queries against the tables with columnstore indexes can utilize parallelism in case if database compatibility level less than 130 [5]
  • {feature} [SQL Server 2019] automated columnstore index maintenance [8]
  • {improvement} [SQL Server 2019] better columnstore metadata memory management
  • {improvement} [SQL Server 2019] low-memory load path for columnstore tables
  • {improvement} [SQL Server 2019] improved performance for bulk loading to columnstore indexes
  • {improvement} [SQL Server 2019] server startup process has been made faster for databases that use in-memory columnstore tables for HTAP
  • {feature} DMVs

References:
[1] SQL Docs (2020) Columnstore indexes: Overview [link]
[2] Microsoft Learn (2024) SQL: What's new in columnstore indexes  [link]
[3] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
[4] SQL Docs (2019) Columnstore indexes - Query performance [link]
[5] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
[6] Microsoft Learn (2016) Columnstore Index: Parallel load into clustered columnstore index from staging table [link]
[7] Microsoft Learn (2016) Columnstore Index Defragmentation using REORGANIZE Command [link]
[8] Microsoft (2018) Microsoft SQL Server 2019: Technical white paper [link]

Acronyms:
CCI - clustered columnstore index
CI - columnstore index
DBCC - Database Console Commands
DMV - Dynamic Management View
ETL - Extract, Transform, Load
HTAP - Hybrid Transactional/Analytical Processing 
LOB - Line of Business
NCCI - nonclustered columnstore index
OLTP - On-Line Transaction Processing
SP - Service Pack

20 October 2023

💎🏭SQL Reloaded: Extended LTrim/RTrim in SQL Server 2022 (Before and After)

In SQL Server 2022, the behavior of LTrim (left trimming) and RTrim (right trimming) functions was extended with one more string parameter. When provided, the engine checks whether the first parameter starts (for LTrim), respectively ends (for RTrim) with the respective value and removes it, the same as the space character char(32) was removed previously:

-- prior behavior of LTrim/RTrim
DECLARE @text as nvarchar(50) = '  123  '
SELECT '(' + LTrim(@text) + ')' LeftTrimming
, '(' + RTrim(@text) + ')' RightTrimming
, '(' + Ltrim(RTrim(@text)) + ')' Trimming1 -- prior SQL Server 2017
, '(' + Trim(@text) + ')' Trimming2 -- starting with SQL 2017
LeftTrimming RightTrimming Trimming1 Trimming2
(123 ) ( 123) (123) (123)

Here's the new behavior:

-- extended behavior of LTrim/LTrim (SQL Server 2022+)
DECLARE @text as nvarchar(50) = '123abc123abc'
SELECT LTrim(@text , '123') LeftTrimming
, RTrim(@text , 'abc') RightTrimming;
LeftTrimming RightTrimming
abc123abc 123abc123

 Previously, to obtain the same result one could write something like:

-- prior solution via Left/Right for the same (SQL Server 2000+)
DECLARE @text as nvarchar(50) = '123abc123abc'
SELECT CASE WHEN Left(@text, 3) = '123' THEN Right(@text,Len(@text)-3) ELSE @text END LeftTrimming
, CASE WHEN Right(@text, 3) = 'abc' THEN Left(@text,Len(@text)-3) ELSE @text END  RightTrimming

-- prior solution via "LIKE" for the same (SQL Server 2000+)
DECLARE @text as nvarchar(50) = '123abc123abc'
SELECT CASE WHEN @text LIKE '123%' THEN Right(@text,Len(@text)-3) ELSE @text END LeftTrimming
, CASE WHEN @text LIKE '%abc' THEN Left(@text,Len(@text)-3) ELSE @text END  RightTrimming

As can be seen, the syntax is considerable simplified. However, there are few the situations when is needed. In the past I had to write code to remove parenthesis, quotes or similar characters:

-- removing parantheses
DECLARE @text as nvarchar(50) = '(testing)'
SELECT LTrim(@text , '(') LeftTrimming
, RTrim(@text , ')') RightTrimming
, RTrim(LTrim(Trim(@text), '('), ')') Trimming 

-- removing double quotes
DECLARE @text as nvarchar(50) = '"testing"'
SELECT LTrim(@text , '"') LeftTrimming
, RTrim(@text , '"') RightTrimming
, RTrim(LTrim(Trim(@text), '"'), '"') Trimming 

The Trim for the 3rd value in both queries was used to remove the eventual spaces before the character to be replaced:

-- removing paranteses with lead/end spaces
SELECT RTrim(LTrim(Trim('   (testing)   '), '('), ')');

Then I thought, maybe I could use the same to remove the tags from an XML element. I tried the following code and unfortunately it doesn't seem to work:

-- attempting to remove the start/end tags from xml elements
DECLARE @text as nvarchar(50) = '<string>testing</string>'
SELECT LTrim(@text , '<string>') LeftTrimming
, RTrim(@text , '</string>') RightTrimming
, RTrim(LTrim(Trim(@text), '<string>'), '</string>') Trimming
LeftTrimming RightTrimming Trimming
esting</string> <string>te e

That's quite an unpleasant surprise!  In exchange, the value type can be defined as XML and use the following code to obtain the needed result:
-- extracting the value from a tag element
DECLARE @text XML = '<string>testing</string>'
SELECT @text.query('data(/string)') as value

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

19 November 2022

💎🏭SQL Reloaded: Tricks with Strings via STRING_SPLIT, PATINDEX and TRANSLATE

Searching for a list of words within a column can be easily achieved by using the LIKE operator:

-- searching for several words via LIKE (SQL Server 2000+)
SELECT * 
FROM Production.Product 
WHERE Name LIKE '%chain%'
   OR Name LIKE '%lock%'
   OR Name LIKE '%rim%'
   OR Name LIKE '%spindle%'

The search is quite efficient, if on the column is defined an index, a clustered index scan being more likely chosen.

If the list of strings to search upon becomes bigger, the query becomes at least more difficult to maintain. Using regular expressions could be a solution. Unfortunately, SQL Server has its limitations in working with patterns. For example, it doesn't have a REGEXP_LIKE function, which is used something like (not tested):

-- Oracle 
SELECT * 
FROM Production.Product 
WHERE REGEXP_LIKE(lower(Name), 'chain|lock|rim|spindle')

However, there's a PATINDEX function which returns the position of a pattern within a string, and which uses the same wildcards that can be used with the LIKE operator:

-- searching for a value via PATINDEX (SQL Server 2000+)
SELECT * 
FROM [Production].[Product] 
WHERE PATINDEX('%rim%', Name)>0

Even if together with the Name can be provided only one of the values, retrieving the values from a table or a table-valued function (TVF) would do the trick. If the values need to be reused in several places, they can be stored in a table or view. If needed only once, a common table expression is more indicated:

-- filtering for several words via PATHINDEX (SQL Server 2008+)
WITH CTE 
AS (
  -- table from list of values (SQL Server 2008+)
SELECT * FROM (VALUES ('chain') , ('lock') , ('rim') , ('spindle')) DAT(words) ) SELECT * FROM Production.Product PRD WHERE EXISTS ( SELECT * FROM CTE WHERE PATINDEX('%'+ CTE.words +'%', PRD.Name)>0 )

The query should return the same records as above in the first query!

Besides own's UDFs (see SplitListWithIndex or SplitList), starting with SQL Server 2017 can be used the STRING_SPLIT function to return the same values as a TVF:

-- filtering for several words via PATHINDEX & STRING_SPLIT (SQL Server 2017+)
SELECT * 
FROM Production.Product PRD
WHERE EXISTS (
	SELECT *
	FROM STRING_SPLIT('chain|lock|rim|spindle', '|') SPL
	WHERE PATINDEX('%'+ SPL.value +'%', PRD.Name)>0
	)

A dynamic list of values can be built as well. For example, the list of words can be obtained from a table and the STRING_SPLIT function:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Name, ' ') SPL
ORDER BY SPL.value

One can remove the special characters, the numeric values, respectively the 1- and 2-letters words:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Replace(Replace(Replace(Replace(Name, '-', ' '), ',', ' '), '/', ' '), '''', ' '), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

The output looks better, though the more complex the text, the more replacements need to be made. An alternative to a UDF (see ReplaceSpecialChars) is the TRANSLATE function, which replaces a list of characters with another. One needs to be careful and have a 1:1 mapping, the REPLICATE function doing the trick:

-- replacing special characters via TRANSLATE (SQL Server 2017+)
SELECT TRANSLATE(Name, '-,/''', Replicate(' ', 4))
FROM Production.Product PRD

Now the query becomes:

-- listing the words appearing in a column using TRANSLATE (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(TRANSLATE(Name, '-,/''', Replicate(' ', 4)), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

Notes:
1) The SQL Server-based queries work also in a SQL databases in Microsoft Fabric. Just replace the Production with SalesLT schema (see post, respectively GitHub repository with the changed code).

Happy coding!

30 October 2018

💠🛠️SQL Server: Administration (Troubleshooting Login Failed for User)

    Since the installation of an SQL Server 2017 on a virtual machine (VM) in the Microsoft Cloud started to appear in the error log records with the following message:

Login failed for user '<domain>\<computer>$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]
Error: 18456, Severity: 14, State: 5.


   From the text it seemed like a permission problem, thing confirmed by the documentation (see [1]), the Error Number and State correspond to a „User Id is not valid“ situation. In a first step I attempted to give permissions to the local account (dollar sign included). The account wasn’t found in the Active Directory (AD), though by typing the account directly in the “Login name” I managed to give temporarily sysadmin permission to the account. The error continued to appear in the error log. I looked then at the accounts under which the SQL Services run - nothing suspect in there.

   Except the error message, which was appearing with an alarming frequency (a few seconds apart), everything seemed to be working on the server. The volume of  records (a few hundred thousands over a few days) bloating the error log, as well the fact that I didn’t knew what’s going on made me take the time and further investigate the issue.

  Looking today at the Windows Logs for Applications I observed that the error is caused by an account used for the Microsoft SQL Server IaaS Agent and IaaS Query Service. Once I gave permissions to the account the error disappeared.

   The search for a best practice on what permissions to give to the IaaS Agent and IaaS Query Service lead me to [2]. To quote, the “Agent Service needs Local System rights to be able to install and configure SQL Server, attach disks and enable storage pool and manage automated security patching of Windows and SQL server”, while the “IaaS Query Service is started with an NT Service account which is a Sys Admin on the SQL Server”. In fact, this was the only resource I found that made a reference to the IaaS Query Service.

   This was just one of the many scenarios in which the above error appears. For more information see for example  [3], [4] or [5].

References:
[1] Microsoft (2017) MSSQLSERVER_18456 [Online] Available from: https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-18456-database-engine-error?view=sql-server-2017
[2] SQL Database Engine Blog (2018) SQL Server IaaS Extension Query Service for SQL Server on Azure VM, by Mine Tokus Altug [Online] Available from:  https://blogs.msdn.microsoft.com/sqlserverstorageengine/2018/10/25/sql-server-iaas-extension-query-service-for-sql-server-on-azure-vm/
[3] Microsoft Support (2018) "Login failed for user" error message when you log on to SQL Server [Online] Available from: https://support.microsoft.com/en-sg/help/555332/login-failed-for-user-error-message-when-you-log-on-to-sql-server
[4] Microsoft Technet (2018) How to Troubleshoot Connecting to the SQL Server Database [Online] Available from: Engine https://social.technet.microsoft.com/wiki/contents/articles/2102.how-to-troubleshoot-connecting-to-the-sql-server-database-engine.aspx 
[5] Microsoft Blogs (2011)Troubleshoot Connectivity/Login failures (18456 State x) with SQL Server, by Sakthivel Chidambaram [Online] Available from: https://blogs.msdn.microsoft.com/sqlsakthi/2011/02/06/troubleshoot-connectivitylogin-failures-18456-state-x-with-sql-server/

26 October 2018

💎🏭SQL Reloaded: Trimming Strings (Before and After)

One of the annoying things when writing queries is the repetitive lengthy expressions that obfuscate in general the queries making them more difficult to read, understand and troubleshoot, and sometimes such expressions come with a performance penalty as well.    Loading data from Excel, text files and other sources involving poorly formatted data often requires trimming (all) the text values. In the early versions of SQL Server, the equivalent of a Trim function was obtained by using the combined LTrim and RTrim functions. This resumed in writing code like this (based on AdventureWorks 2014 database):

-- trimming via LTrim, RTrim 
SELECT LTrim(RTrim(AddressLine1)) AddressLine1
, LTrim(RTrim(AddressLine2)) AddressLine2
, LTrim(RTrim(City)) City
, LTrim(RTrim(PostalCode)) PostalCode
FROM Person.Address

This might not look much though imagine you have to deal with 30-50 text attributes, that the code is not written in a readable format (e.g. the way is stored in database), that some attributes require further processing (e.g. removal of special characters, splitting, concatenating).
 
 Often developers preferred encapsulating the call to the two functions within a user-defined function:

-- Trim user-defiend function
CREATE FUNCTION dbo.Trim(
@string nvarchar(max))
RETURNS nvarchar(max)
BEGIN
    RETURN LTrim(RTrim(@string))
END

With it the code is somehow simplified, but not by much and includes the costs of calling a user-defined function:

-- trimming via dbo.Trim
SELECT dbo.Trim(AddressLine1) AddressLine1
, dbo.Trim(AddressLine2) AddressLine2
, dbo.Trim(City) City
, dbo.Trim(PostalCode) PostalCode
FROM Person.Address

In SQL Server 2017 was introduced the Trim function which not only replaces the combined use of LTrim and RTrim functions, but it also allows to replace other specified characters (including CR, LF, Tab) from the start or end of a string.

By default the function removes the space from both sides of a string:

-- trimming via Trim
SELECT Trim(AddressLine1) AddressLine1
, Trim(AddressLine2) AddressLine2
, Trim(City) City
, Trim(PostalCode) PostalCode
FROM Person.Address

When a set of characters is provided the function removes the specified characters:

SELECT Trim ('#' FROM '# 843984') Example1
, Trim ('[]' FROM '[843984]') Example2
, Trim ('+' FROM '+49127298000') Example3
, Trim ('+-' FROM '+ 49-12729-8000 ') + ';' Example4
, Trim ('+ ' FROM '+ 49-12729-8000 ') + ';' Example5
, ASCII(Left(Trim (char(13) FROM char(13) + '49127298000'), 1)) Example6

Output:
Example1   Example2     Example3        Example4            Example5            Example6
--------          --------          ------------           -----------------       -----------------        -----------
  843984      843984        49127298000   49-12729-8000 ;  49-12729-8000;    52

As can be seen, when is needed to remove other characters together with the space then is needed to include the space in the list of characters.

Notes:
1) The dbo.Trim function can be created in SQL Server 2017 environments as well.
2) The collation of the database will affect the behavior of Trim function, therefore the results might look different when a case sensitive collection is used.
3)  The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

18 June 2017

💠🛠️SQL Server: Administration (Database Recovery on SQL Server 2017)

I installed today SQL Server 2017 CTP 2.1 on my Lab PC without any apparent problems. It was time to recreate some of the databases I used for testing. As previously I had an evaluation version of SQL Server 2016, it expired without having a backup for one of the databases. I could recreate the database from scripts and reload the data from various text files. This would have been a relatively laborious task (estimated time > 1 hour), though the chances were pretty high that everything would go smoothly. As the database is relatively small (about 2 GB) and possible data loss was neglectable, I thought it would be possible to recover the data from the database with minimal loss in less than half of hour. I knew this was possible, as I was forced a few times in the past to recover data from damaged databases in SQL Server 2005, 2008 and 2012 environments, though being in a new environment I wasn’t sure how smooth will go and how long it would take.  

Plan A - Create the database with  ATTACH_REBUILD_LOG option:

As it seems the option is available in SQL Server 2017, so I attempted to create the database via the following script:
 
CREATE DATABASE  ON 
(FILENAME='I:\Data\.mdf') 
FOR ATTACH_REBUILD_LOG 

And as expected I run into the first error:
Msg 5120, Level 16, State 101, Line 1 Unable to open the physical file "I:\Data\.mdf". Operating system error 5: "5(Access is denied.)".
Msg 1802, Level 16, State 7, Line 1 CREATE DATABASE failed. Some file names listed could not be created. Check related errors.

It looked like a permissions problem, though I wasn’t entirely sure which account is causing the problem. In the past I had problems with the Administrator account, so it was the first thing to try. Once I removed the permissions for Administrator account to the folder containing the database and gave it full control permissions again, I tried to create the database anew using the above script, running into the next error:

File activation failure. The physical file name "D:\Logs\_log.ldf" may be incorrect. The log cannot be rebuilt because there were open transactions/users when the database was shutdown, no checkpoint occurred to the database, or the database was read-only. This error could occur if the transaction log file was manually deleted or lost due to a hardware or environment failure.
Msg 1813, Level 16, State 2, Line 1 Could not open new database ''. CREATE DATABASE is aborted.
This approach seemed to lead nowhere, so it was time for Plan B.

Plan B - Recover the database into an empty database with the same name:

Step 1: Create a new database with the same name, stop the SQL Server, then copy the old file over the new file, and delete the new log file manually. Then restarted the server. After the restart the database will appear in Management Studio with the SUSPECT state.

Step 2:
Set the database in EMERGENCY mode:

ALTER DATABASE  SET EMERGENCY, SINGLE_USER

Step 3:
Rebuild the log file:

ALTER DATABASE <database_name> 
REBUILD LOG ON (Name=_Log', 
FileName='D:\Logs\.ldf')

The rebuild worked without problems.

Step 4: Set the database in MULTI_USER mode:

ALTER DATABASE  SET MULTI_USER 

Step 5:
Perform a consistency check:

DBCC CHECKDB () WITH ALL_ERRORMSGS, NO_INFOMSG 

After 15 minutes of work the database was back online.

Warnings:
Always attempt to recover the data for production databases from the backup files! Use the above steps only if there is no other alternative!
The consistency check might return errors. In this case one might need to run CHECKDB with REPAIR_ALLOW_DATA_LOSS several times [2], until the database was repaired.
After recovery there can be problems with the user access. It might be needed to delete the users from the recovered database and reassign their permissions!  

Resources:
[1] In Recovery (2008) Creating, detaching, re-attaching, and fixing a SUSPECT database, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/creating-detaching-re-attaching-and-fixing-a-suspect-database/ 
[2] In Recovery (2009) Misconceptions around database repair, by Paul S Randal [Online] Available from: https://www.sqlskills.com/blogs/paul/misconceptions-around-database-repair/
[3] Microsoft Blogs (2013) Recovering from Log File Corruption, by Glen Small [Online] Available from: https://blogs.msdn.microsoft.com/glsmall/2013/11/14/recovering-from-log-file-corruption/

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.