Showing posts with label database design. Show all posts
Showing posts with label database design. Show all posts

21 January 2024

SQL Server: Clustered & Non-clustered Indexes (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.
Columnstore indexes will be considered separately.

Last updated: 22-Jan-2024

Indexes

  • {definition} a database object associated with a table or view that provides efficient access path to data in the base object, based on one or more columns
    • contains keys built from one or more columns stored physically as B-trees
      • ⇒ provides the path that is necessary to get to the data in the quickest and most efficient manner possible [7]
      • ⇒ can dramatically speed up the process of finding data as well as the process of joining tables together [2]
    • on-disk structure
      • index placement 
        • placed on either 
          • particular filegroup 
          • partitioned according to a predefined partition scheme
        • {default} placed on the same filegroup as the base object [5]
    • must be maintained and updated as data are added or changed in the base object
      • each index brings with it a certain amount of overhead
        • ⇐ the updates, inserts, and deletes are slower
  • {type} clustered index (CI)
    • based on a clustering key
      • tells SQL Server how to order the table's data pages [11]
      • static columns (aka unchanging columns, non-volatile columns)
        • the clustering key is never updated [11]
      • nonstatic columns 
        • when the clustering key value is updated, the data may need to be moved elsewhere (aka record relocation) in the clustered index so that the clustering order is maintained [11]
          • triggers a nonclustered index update as well 
            • so that the value of the clustering key is correct for the relevant nonclustered index rows [5]
              • wastes time and space
              • causes page splits and fragmentation 
              • adds unnecessary overhead to every modification of the column(s) that make up the clustering key [5]
          • updating the clustering key is clearly more expensive than updating a non-key column [11]
      • can be either
        • a single column key
        • a composite key
          • [SQL Server 2005] up to 16 columns
          • [SQL Server 2016] up to 32 columns can be combined into a single composite index key
          • all the columns in a composite index key must be in the same table or view [8]
          • the most redundant data within the entire table
    • {restriction}a table can have only one clustered index [5]
    • {restriction} maximum allowable size of the combined index values is 900 bytes 
    • {restriction} LOB data types can’t be specified as columns
    • {subtype} unique clustered index (UCI)
      • created with the UNIQUE keyword
    • {subtype} non-unique clustered indexes
      • created without the UNIQUE keyword
      • SQL Server does not require a clustered index to be unique
        • ⇒ must have some means of uniquely identifying every row
        • forces the index to be unique by appending a 4-byte value (aka uniqueifier) to key values as necessary to differentiate identical key values from one another  [2]
          • stored in every level of the B-tree [11]
            • in both clustered and non-clustered indexes
          • stored as a variable-length column [11]
            • if a table does not already contain any other variable-length columns, each duplicate value is actually consuming 8-bytes of overhead: 
              • 4 bytes for the uniqueifier value 
              • 4 bytes to manage variable-length columns on the row [11]
              • the overhead for rows with a NULL uniqueifier value is zero bytes [11]
          • added everywhere the clustering key is stored [11]
          • if there are many rows using the same clustering key value, the operation can become quite expensive [11]
          • is not readily visible and cannot be queried [11]
          • it is impossible to estimate how much additional storage overhead will result from the addition of a uniqueifier, without first having a thorough understanding of the data being stored [11]
          • meaningless operation in the context of the data [11]
          • all clustered indexes must be unique so that nonclustered index entries can point to exactly one specific row [5]
  • {type} nonclustered index (NCI)
    • the index is a completely separate structure from the data itself [5]
      • ⇒ its presence or absence doesn’t affect how the data pages are organized [5]
      • heap or a table with a clustered index affects the structure of nonclustered indexes [5]
      • best at singleton selects
        • ⇐ queries that return a single row. 
        • once the nonclustered B-tree is navigated, the actual data can be accessed with just one page I/O, that is, the read of the page from the underlying table
    •  each table can have up to 999 nonclustered indexes [8]
    • {subtype} nonclustered indexes with included columns
      • uses the INCLUDE keyword along with a list of one or more columns to indicate the columns to be saved in the leaf level of an index [1]
        • the data for these columns are duplicated in index leaf-level pages
          • ⇒ increases the disk space requirement [1]
          • ⇒ fewer index rows can fit on index pages [1]
            • ⇒ might increase the I/O and decrease the database cache efficiency [1]
          • ⇒ data is stored and updated twice   [1]
            • ⇒ index maintenance may increase in terms of the time that it takes  [1]
      • can be specified up to 1,023 columns [1]
      • the included columns are not part of the index key
        • ⇒ keeps the key size small and efficient [1]
      • having the columns available at the leaf pages with the index means avoid querying the base table if the included columns can satisfy the query [1]
        • ⇒ increases the chances that the query will find all it needs in the index pages itself, without any further lookups [1]
      • provides the benefits of a covering index  [1]
      • results in smaller, efficient, and narrow keys [1]
      • removes the 900 bytes/16 columns key size restriction [1]
      • {restriction} columns of data type text, ntext, and image are not allowed as non-key included columns [1]
  • {option} FILLFACTOR
    • its value determines the percentage of space on each leaf-level page to be filled with data, reserving the remainder on each page as free space for future growth [16]
    • it's applied when the index is first created but is not enforced afterward
      • ⇒ isn't maintained over time
    • affects only the leaf-level pages in the index 
      • normally reserves enough empty space on intermediate index pages to store at least one row of the index's maximum size [2]
    • the empty space is reserved between the index rows rather than at the end of the index [16]
    • its value is a percentage from 1 to 100
      • {default} 0 which means that the leaf-level pages are filled to capacity [16]
    • provided for fine-tuning index data storage and performance [16]
    • affects performance 
      • creating an index with a relatively low fillfactor 
        • helps avoid page splits during inserts
          • with pages only partially full, the potential for needing to split one of them in order to insert new rows is lower than it would be with completely full pages
          • can be good for performance if the new data is evenly distributed throughout the table [16]
            • if all the data is added to the end of the table, the empty space in the index pages will not be filled [16]
        •  the index will require more storage space and can decrease read performance [16]
          • ⇒ can decrease database read performance by an amount inversely proportional to the fill-factor setting [16]
      • a high fillfactor can help compact pages so that less I/O is required to service a query
        • common technique with data warehouses
        • retrieving pages that are only partially full wastes I/O bandwidth
  • {option} PAD_INDEX option
    • instructs SQL Server to apply the FILLFACTOR to the intermediate-level pages of the index [2]
    • if FILLFACTOR setting is so high that there isn't room on the intermediate pages for even a single row (e.g., a FILLFACTOR of 100%),  the percentage is overridden so that at least one row fits [2]
    • if FILLFACTOR setting is so low that the intermediate pages cannot store at least two rows, SQL Server will override the fillfactor percentage on the intermediate pages so that at least two rows fit on each page [2]
  • {option} IGNORE_DUP_KEY
    • {enabled} when a UNIQUE index is created the option is used a duplicate key error on a multiple-row INSERT won’t cause the entire statement to be rolled back [5]
      • ⇒ the nonunique row is discarded, and all other rows are inserted [5]
        • ⇒ it makes a violation in a multiple-row data modification nonfatal to all the nonviolating rows [5]
    • {disabled} during changes even if one row is found that would cause duplicates of keys defined as unique, the entire statement is aborted and no rows are affected [5]
  • {option} STATISTICS_NORECOMPUTE 
    • determines whether the statistics on the index should be updated automatically [5]
    • typically enabled
    • used to set a specific statistic or index to not update automatically [5]
    • overrides an ON value for the AUTO_UPDATE_STATISTICS database option [5]
    • if the database option is set to OFF, that behavior for a particular index can’t be overridden 
      •  all statistics in the database must be updated manually 
        • via UPDATE STATISTICS or sp_updatestats [5]
  • {option} MAXDOP
    • controls the maximum number of processors that can be used for index creation [5]
    • can override the server configuration option [max degree of parallelism for index building] [5]
    • allows multiple processors to be used for index creation can greatly enhance the performance of index build operations [5]
      • each processor builds an equal-sized chunk of the index in parallel
        • each parallel thread builds a separate tree
          • the math used to determine the theoretical minimum number of required pages varies from the actual number [5]
          • the tree might not be perfectly balanced
          • after each thread is completed, the trees are essentially concatenated [5]
            • extra page space reserved during this parallel process can be used for later modifications [5]
  • {operation} index tuning 
    • iterative process 
    • {operation}index consolidation
      • typically performed when indexes overlap 
        • includes duplicated indexes
      • {caution} consider the various aspects when modifying indexes
      • the number and design of indexes impact performance
        • ⇐ the goal is to obtain the best performance with the smallest number of indexes
  • {operation} creation (aka index creation)
    • [clustered index] 
      • the data becomes the leaf level of the clustered index [5]
        • data in the table is moved to new pages and ordered by the clustering key
        • maintained logically rather than physically
          • the order is maintained through a doubly linked list (aka page chain)
            • based on the definition of the clustered index
              • same as the order of rows on the data pages
            • deciding on which column(s) to cluster is an important performance consideration [5] 
          • can be ordered in only one way
    • although heap pages do have pointers to next and previous pages, the ordering is arbitrary and doesn’t represent any ordering in the table itself [5]
  • {operation} [SQL Server 2005] rebuilding (aka rebuilding indexes)
    • {goal} generate new statistics [18]
    • {goal} change the number of pages used by the index (fill factor) [18]
    • {goal} reorganize the data on database pages in relative proximity to each other [18]
    • introduced mainly for SQL Server's own internal purposes when applying upgrades and service packs [6]
    • drops and re-creates the index
      • always rebuilds the entire index [3]
        • regardless of the extent of fragmentation [3]
        • an index rebuild for a lightly fragmented index is really overkill [3]
      • there must be enough free space in the database to accommodate the new index [3]
        • otherwise the database will grow to provide the required free space [3]
          • this can be problematic for large indexes [3]
      • reclaims disk space by compacting the pages based on the specified or existing fill factor setting [15]
      • reorders the index rows in contiguous pages [15]
        • removes fragmentation
      • automatically rebuilds all index statistics [18]
        • columns statistics aren’t touched 
        • usually faster than reorganizing indexes if the logical fragmentation is high [18]
      • when ALL is specified, all indexes on the table are dropped and rebuilt in a single transaction
      • indexes with more than 128 extents are rebuilt in two separate phases [15]
        • logical phase
          • the existing allocation units used by the index are marked for deallocation [15]
          • the data rows are copied and sorted [15]
          • then moved to new allocation units created to store the rebuilt index [15]
        • physical phase
          • the allocation units previously marked for deallocation are physically dropped in short transactions that happen in the background, and do not require many locks [15]
      • serial index rebuild
        • forces an allocation order scan, ignoring the key placement and scanning the object from first IAM to last IAM order [13]
          • via WITH(NOLOCK, INDEX=0)
        • takes longer than a [parallel index rebuild]
      • parallel index rebuild 
        • during rebuild a portion of the data (range) is assigned to each of the parallel workers [13]
          • for each parallel worker is assigned its own CBulkAllocator when saving the final index pages [13]
        • a leap frog behavior of values, across extents occurs as the workers copy the final keys into place [13]
          • each worker spreads 1/nth of the data across the entire file instead of packing the key values together in a specific segment of the file [13]
      • can use minimal-logging to reduce transaction log growth [3]
      • may require long-term locks on the table [3]
        • ⇒ can limit concurrent operations [3]
      • replaces
        • DBCC DBREINDEX command 
        • DROP_EXISTING option of CREATE INDEX command
      • can be executed 
        • online (aka online operation)
          • table and indexes are available while the operation is performed 
        • offline (aka offline operation)
      • permissions requirements
        • ALTER permission on the table or view
          • users must be members of either:
            • sysadmin
            • db_ddladmin 
            • db_owner 
    • {operation} reorganize (aka reorganize indexes, index reorganization)
        • doesn’t update statistics at all [3]
          • always single-threaded [3]
            • always fully logged [3]
              • ⇐ doesn’t prevent transaction log clearing [3]
            • doesn’t hold blocking locks
              • always executed online
              • won’t update statistics at all [3]
              • uses minimal system resources 
                • requires 8KB of additional space in the database [3]
                • doesn’t hold blocking locks [3]
                • takes care of the existing fragmentation [3]
                  • ⇒ makes it a better choice for removing fragmentation from a lightly fragmented index  [3]
                  • ⇒ makes it a poor choice for removing fragmentation from a heavily fragmented index [3]
              • {limitation} can’t use index options 
              • replaces the DBCC INDEXDEFRAG command [6]
              • removes some of the fragmentation from an index
                • it’s not guaranteed to remove all the fragmentation
                  • just like DBCC INDEXDEFRAG 
              • doesn’t have a corresponding option in the CREATE INDEX command [6]
                • because when creating an index there is nothing to reorganize [6]
              • it defragments the leaf level of clustered and nonclustered indexes on tables and views by physically reordering the leaf-level pages to match the logical, left to right, order of the leaf nodes [15]
              • compacts the index pages
                • based on the existing fill factor value
            • {operation} disabling (aka index disabling)
              • makes an index completely unavailable
                • ⇒ it can't be used for finding rows for any operations
                • ⇒ the index won't be maintained as changes to the data are made
              • disables one or ALL indexes with a single command
              • indexes must be completely rebuilt to make them useful again
                • ⇐ there is no ENABLE option
                  • ⇐ because no maintenance is done while an index is disabled
              • {warning} disabling a [clustered index] on a table, makes the  data unavailable [6]
                • ⇐ because the data is stored in the clustered index leaf level [6]
            • {operation} dropping (aka index dropping)
            • {operation} changing index options 
              • via ALTER INDEX command
            • {warning} regular rebuilding of all the indexes in a database is a poor way to manage indexes
              • leads to resources waste by rebuilding indexes that are not fragmented
              • it can affect the availability of a system when the indexes can’t be rebuilt online
          • {concept} index extent
            • a group of eight index pages
          • {concept} narrow index
            • in terms of the number of bytes it stores [11]
              • usually has few columns in the index key
              • can accommodate more rows into an 8KB index page than a wide index [1]
                • ⇒ {benefit} reduces I/O
                • ⇒ {benefit} reduces storage requirements
                • ⇒ {benefit} improves database caching
              • {benefit} reduces maintenance overhead
            • small indexes might not have an intermediate level at all
            • the goal when defining indexes isn’t to have only very narrow indexes [5]
              • extremely narrow indexes usually have fewer uses than slightly wider indexes [5]
          • {concept} wide index
            • typically an index with many columns in the index key
              • wider keys cause more I/O and permit fewer key rows to fit on each B-tree page
                • indexes require a larger number of pages than it otherwise would and causes it to take up more disk space a lot of space (and potentially buffer pool memory) is wasted
              • large indexes often have more than one intermediate level
            • {benefit} covers more queries
            • {downside} increases maintenance overhead 
            • {downside} can adversely affect disk space
          • {concept} ever increasing index
            • e.g. a numeric value can be continuously incremented by the value defined at creation
              • e.g. identity integers
            • {benefit} avoids fragmentation
              • results in sequential IO and maximizes the amount of data stored per page
                • the most efficient use of system resources
            • {benefit} improve write performance
              • SQL Server can much more efficiently write data if it knows the row will always be added to the most recently allocated, or last, page [11]
            • non-sequential key column can result in a much higher overhead during insertion [11]
              • SQL Server has to find the correct page to write to and pull it into memory
                • if the page is full, SQL Server will need to perform a page split to create more space [11] 
                • during a page split, a new page is allocated, and half the records are moved from the old page to the newly-allocated page [11]
                • each page has a pointer to the previous and next page in the index, so those pages will also need to be updated [11]
          • {concept} fragmentation results from data modifications 
            • can take the form of gaps in data pages, so wasting space, and a logical ordering of the data that no longer matches the physical ordering [11]
          • {concept} covering index
            • nonclustered index that contains all the columns requested by the query without going to the base table 
              • allows it to skip the bookmark lookup step and simply return the data the query seeks from its own B-tree [2]
              • when a clustered index is present, a query can be covered using a combination of nonclustered and clustered key columns since the clustered key is the nonclustered index's bookmark [2]
              • the next best thing to having multiple clustered indexes on the same table [2]
            • included columns 
              • columns added in the INCLUDE clause of the CREATE INDEX command
              • allow exceeding the 900-byte or 16-key column limits in the leaf level of a nonclustered index [5]
              • appear only in the leaf level 
              • don’t affect the sort order of the index rows in any way [5]
              • SQL Server can silently add an included column to indexes [5]
                • e.g. when an index is created on a partitioned table and no ON filegroup or ON partition_scheme_name is specified [5]
                  • when partitioning a non-unique, nonclustered index, the Database Engine adds the partitioning column as a non-key (included) column of the index, if it is not already specified [8]
              • column names cannot be repeated in the list [8]
              • column namescannot be used simultaneously as both key and non-key columns [8]
              • {restriction} all data types are allowed except text, ntext, and image [8
              • {restriction} index must be created or rebuilt offline if any one of the specified non-key columns are varchar(max), nvarchar(max), or varbinary(max) data types [8]
              • [computed columns] that are deterministic and either precise or imprecise can be included columns [8]
              • [computed columns] derived from image, ntext, text, varchar(max), nvarchar(max), varbinary(max), and xml data types can be included in non-key columns as long as the computed column data types is allowable as an included column [8]
            • [SQL Server 2005] can be stored in the same filegroup as the underlying table, or on a different filegroup [12]
          • {concept} [SQL Server 2008] filtered indexes 
            • indexes with simple predicates that restrict the set of rows included in the index [4]
            • leaf level contains an index row only for keys that match the filter definition [5]
            • created using a new WHERE clause on a CREATE INDEX statement
            • alternative to indexed views
              • which are more expensive to use and maintain
              • their matching capability is not supported in all editions
              • tend to be more useful for the more classical relational query precomputation scenarios
            • {benefit} have fewer rows and are also narrower than the base table
              • require fewer pages as well
              • specific constraints that are used on queries with large tables where space is an issue, this kind of index can be quite useful
            • usage scenarios
              • fields are used only occasionally
                • resulting in many NULL entries for that column
                  • a traditional index stores a lot of NULLs and wastes a lot of storage space
                  • updates to the table have to maintain this index for every row
                • querying a table with a small number of distinct values and are using a multicolumn predicate where some of the elements are fixed
                  • this might be useful for a regular report with special purpose
                    • it speeds up a small set of queries while not slowing down updates as much for everyone else
                • when there is a known query condition on an expensive query on a large table
          • {concept} indexed views
            • similar to materialized views in Oracle
            • without indexes, views are purely logical
              • ⇒ the data involved has no physical storage [5]
            • the first index built must be a clustered index
              • data is stored in the leaf level of the index [5]
              • further nonclustered indexes can be built
            • {constraint} view’s columns must be deterministic
            • {constraint} session-level restrictions
              • options must be set to a specific value (see )
                • if these options aren’t set as specified, an error message is raised when attempted to create an indexed view [5]
            • {constraint} only deterministic functions are allowed
              • check via SELECT OBJECTPROPERTY (object_id('<function_name>'), 'IsDeterministic')
              • imprecise values (float or real values) can be used only if the computed value is persisted [5]
            • {constraint} the definition of any underlying object’s schema can’t change [5]
              • must be enforced through WITH SCHEMABINDING option
            • {constraint} if a view's definition contains GROUP BY
              • the SELECT list must include the aggregate COUNT_BIG (*)
                • COUNT_BIG returns a BIGINT, a 8-byte integer
              • can’t contain HAVING, CUBE, ROLLUP, or GROUP BY ALL
              • all GROUP BY columns must appear in the SELECT list
              • constraints apply only to view definition, not to the queries that might use the indexed views [5]
            • {benefit}materialize summary aggregates of large tables [5]
              • eliminates the need of scanning the underlying, large tables [5]
            • {activity} check whether a view is indexable
              • via SELECT OBJECTPROPERTY (OBJECT_ID ('<view_name>'), 'IsIndexable');
            • {activity} check whether a view is indexed
              • via SELECT OBJECTPROPERTY (OBJECT_ID ('<view_name>'), 'IsIndexed');
            • alternative solutions
              • temporary tables
              • persistent tables 
            • [Query Optimizer] doesn’t always choses an indexed view for query’s execution plan [5]
              • {restriction} considered only in [Enterprise], [Developer] and [Evaluation] editions
              • the base tables might be accessed directly [5]
              • use NOEXPAND hint in FROM clause to make sure that the indexed view isn’t expanded into its underlying SELECT statement [5]
                • {recommendation} NOEXPAND hint should be used only when improves query’s performance [5]
          • indexes on computed columns
            • without indexes, computed columns are purely logical
              • the data involved has no physical storage [5]
                • recomputed every time a row is accessed 
                  • {exception}the computed column is marked as PERSISTED [5]
              • {restriction} see the requirements for indexed views 
                • {exception} table-level restrictions don’t apply [5]
              • {restriction} allowed on deterministic, precise (and persisted imprecise) computed columns where the resulting data type is otherwise indexable [5]
            • {activity} check whether a column is deterministic
              • via SELECT COLUMNPROPERTY (OBJECT_ID('<tabelle>'), '<column>', 'IsDeterministic');
            • {activity} check whether a column is precise
              • via SELECT COLUMNPROPERTY (OBJECT_ID('<tabelle>'), '<column>', 'IsPrecise');
            • {activity} check if a column is persisted 
              • SELECT is_persisted
                FROM sys.computed_columns
                WHERE object_id = object_Id('<tabelle>')
                AND name = '<column_name>'
          • {concept} missing index
            • index that would have lead to a optimal query plan
            • [query optimizer] when generating a query plan, it analyzes what are the best indexes for a particular filter condition [17]
              • if the best indexes do not exist it generates a suboptimal query plan [17]
              • stores information about the missing indexes
              • {activity} review missing indexes
                • via sys.dm_db_missing_index_group_stats
                  • when table metadata changes, all missing index information about the table is deleted [17]
                    • e.g. column added/dropped, index adde
                • via  MissingIndexes element in XML Showplans
                  • correlate indexes that the query optimizer considers missing with the queries for which they are missing [17]
                  • turn on via SET STATISTICS XML ON option
              • it’s not intended to fine tune an indexing configuration [17]
                • it cannot gather statistics for more than 500 missing index groups [17]
                  • after this threshold is reached, no more missing index group data is gathered [17]
                    • the threshold is not a tunable parameter and cannot be changed [17]
                • it does not specify an order for columns to be used in an index [17]
                • for queries involving only inequality predicates, it returns less accurate cost information [17]
                • it reports only include columns for some queries
                  • index key columns must be manually selected [17]
                • it returns only raw information about columns on which indexes might be missing [17]
                  • information returned might require additional processing before being used to create an index [17]
                • it does not suggest [filtered indexes] [17]
                • it can return different costs for the same missing index group that appears multiple times in XML Showplans [17]
                  • it can occur when different parts of a single query benefit differently from the same missing index group [17]
                • it does not consider [trivial query plans] [17]
          • {concept} fragmented index: 
            • index that is not in the same logic order as it is stored in memory or on storage [19]
            • caused by 
              • modifications to data can cause the information in the index to become scattered in the database (fragmented) [15]
              • autoshrink option
            • heavily fragmented indexes can degrade query performance and cause your application to respond slowly [15]
            • index fragmentation 
              • exists when indexes have pages in which the logical ordering, based on the key value, does not match the physical ordering inside the data file [15]
              • {type} extent fragmentation (aka external index fragmentation)
                • the pages get out of physical order, as a result of data modifications [11]
                  • can result in random I/O
                    • does not perform as well as sequential I/O [11]
                • is truly bad only when SQL Server is doing an ordered scan of all or part of a table or an index [6]
                  • only when the pages need to be fetched in logical order is needed to follow the page chain.
                    • if the pages are heavily fragmented, this operation is more expensive than if there were no fragmentation [6]
                  •  if seeking individual rows through an index, it doesn't matter where those rows are physically located [6]
              • {type} page fragmentation (aka internal index fragmentation)
                • occurs when there are gaps in the data pages
                  • reduces the amount of data that can be stored on each page
                    • increases the overall amount of space needed to store the data [11]
                  • [scanning] the entire table or index involves more read operations than if no free space were available on your pages [6]
                • some fragmentation is desirable in order to minimize the [page splits] [6]
            • fragmentation on small indexes is often not controllable
              • the pages of small indexes are stored on mixed extents [15]
                • mixed extents are shared by up to eight objects [15]
                  • the fragmentation in a small index might not be reduced after reorganizing or rebuilding the index [15]
            • {affect} primarily impacts range-scan queries
              • singleton queries would not notice much impact [11]
              • can benefit from routine defragmentation efforts [11]
            • {affect} disk performance and the effectiveness of the SQL Server read-ahead manager [10]
            • disk latency
              • [small-scale environment] only the highest fragmentation levels had a significant negative impact on disk latency [10]
              • [large-scale environment] significantly lower and never became an issue 
                • due to increased I/O performance provided by the SAN [10]
            • {action} detect fragmentation
              • analyze the index to determine the degree of fragmentation
                • via sys.dm_db_index_physical_stats 
              • controlled through
                • fillfactor setting 
                • regular defrag operations
            • {remedy} reorganize indexes
            • {remedy} rebuild indexes
          • {concept} page splits
            • when a new row is added to a full index page, half the rows are moved to a new page to make room for the new row
              • the new page must be linked into the indexes page chain, and usually the new page is not contiguous to the page being split [6]
            • resource intensive operation
            • occurs frequently
            • can lead to external fragmentation [6]
          • {concept} interleaving 
            • occurs when index extents are not completely contiguous within the data file, leaving extents from one or more indexes intermingled in the file [18]
            • can occur even when there is no logical fragmentation
              • because all index pages are not necessarily contiguous, even when logical ordering matches physical ordering [18]
          • {concept} leap frog behavior 
            • behavior occurs during [parallel index rebuild]
            • when the key range is leap frogged the fragmentation limits [SQL Server]’s I/O size to 160K instead of 508K and drives the number of I/O requests much higher [13]
            • {workaround} [partitioned table] partition the table on separate files matching the DOP used to build the index
              • allows better alignment of parallel workers to specific partitions, avoiding the [leap frog behavior]
            • {workaround} [non-partitioned table] aligning the number of files with the DOP may be helpful [13]
              • with reasonably even distribution of free space in each file the allocation behavior is such that alike keys will be placed near each other [13]
            • {workaround} for single partition rebuild operations consider serial index building behaviors to minimize fragmentation behaviors [13]
          • {concept} defragmentation
            • does not yield performance gains in every case [10]
            • steps
              • {activity} ensure there are no resource issues
                • common resource issues are related to
                  • I/O subsystem performance
                  • memory usage
                  • CPU utilization
                • physical disk fragmentation [10]
                  • {recommendation}  [small-scale environments] correct disk fragmentation before running index defragmentation [10]
                    • not necessary on SAN environments [10]
                • inappropriate schemas [10]
                • out-of-date statistics
                  • {recommendation} resolve any out-of-date statistics before defragmenting [10]
              • {activity} determine the amount of index fragmentation 
                • focus on the larger indexes 
                  • their pages are less likely to be cached [10]
                  • {best practice} monitor regularly the fragmentation levels on indexes [10]
              • {activity} determine workload type 
                • not all workload types benefit from defragmenting [10]
                  • read-intensive workload types that do significant disk I/O benefit the most [10]
                  • DSS workload types benefit much more than OLTP workload types [10]
              • {activity} determine queries that perform poorly [10]
                • determine the amount of I/O performed by a query [10]
                • queries that scan large ranges of index pages are affected most by fragmentation and gain the most from defragmenting [10]
              • {activity} understand the effect of fragmentation on disk throughput and the read-ahead manager [10]
                • for queries that scan one or more indexes, the read-ahead manager is responsible for scanning ahead through the index pages and bringing additional data pages into the SQL Server data cache [10] 
                  • the read-ahead manager dynamically adjusts the size of reads it performs based on the physical ordering of the underlying pages [10]
                  • when there is low fragmentation, the read-ahead manager can read larger blocks of data at a time, more efficiently using the I/O subsystem [10] 
                  • as the data becomes fragmented, the read-ahead manager must read smaller blocks of data [10]
                  • the amount of read-ahead operations that can be issued is independent of the physical ordering of the data; however, smaller read requests take more CPU resources per block, resulting in less overall disk throughput [10]
                    • when fragmentation exists and the read-ahead manager is unable to read the larger block sizes, it can lead to a decrease in overall disk throughput [10]
          • {constraint}UNIQUE 
            • allows the columns that make up the constraint to allow NULLs, but it doesn’t allow all key columns to be NULL for more than one row [5]
            • unique indexes created using the CREATE INDEX command are no different from indexes created to support constraints [5]
              • [Query Optimizer] makes decisions based on the presence of the unique index rather than on whether the column was declared as a constraint [5]
                • ⇐ is irrelevant how the index was created [5]
          • {constraint} PRIMARY KEY 
            • all the columns involved in the PRIMARY KEY don’t allow NULL values
              • if any of the columns allow NULL values, the PRIMARY KEY constraint can’t be created [5]
            • its value is unique within the table
              • uniqueness defined via the UNIQUE constraint
              • ⇒ a unique index is created on the columns that make up the constraint [5]
            • when a PRIMARY KEY or UNIQUE constraint is created, the underlying index structure that’s created is the same as though you had used the CREATE INDEX command directly [5]
              • ⇐ however usage and features have some differences
                • a constraint-based index can’t have other features added (e.g. included columns, filters), while a UNIQUE index can have these features while still enforcing uniqueness over the key definition of the index [5]
                • when referencing a UNIQUE index - which doesn’t support a constraint - through a FOREIGN KEY constraint, indexes can’t be referenced with filters [5]
                  • ⇐ an index that doesn’t use filters or an index that uses included columns can be referenced [5]
            • ⇐ the names of the indexes built to support these constraints are the same as the constraint names [5]
          • {concept} index design
            • designing and using proper indexes is one of the keys to maximizing query performance [1]
              • over-indexing 
                • can be worse than under-indexing
                • extreme case: index every column
              • under-indexing
                • extreme case: having no index
            • very complex, time-consuming, and error-prone process
              • even for moderately complex databases and workloads [12]
            • requires knowing
              • the data
              • the workload
              • how SQL Server works [5]
                • index internals
                • statistics
                • query optimization
                • maintenance
            • {activity} understand the characteristics of the database itself [12]
              • e.g. characteristics of OLAP vs OLTP
            • {activity}understand the characteristics of the most frequently used queries [12]
            • {activity} understand the characteristics of the columns used in the queries [12]
              • data type
              • column uniqueness
                • unique indexes are more useful for the query optimizer than nonunique indexes [12]
              • data distribution in the column
              • NOT NULL values percentage 
              • columns with categories of values
              • columns with distinct ranges of values
              • sparse columns 
              • computed columns
            • {activity} determine which index options might enhance performance when the index is created or maintained [12]
              • e.g. creating a clustered index on an existing large table would benefit from the ONLINE index option 
            • {activity} determine the optimal storage location for the index [12]
              • storage location of indexes can improve query performance by increasing disk I/O performance [12]
                • ⇐ storing a nonclustered index on a filegroup that is on a different disk than the table filegroup can improve performance because multiple disks can be read at the same time [12] 
              • [partitioning] clustered and nonclustered indexes can use a partition scheme across multiple filegroups
                • ⇐ [partitioning] makes large tables or indexes more manageable by letting you access or manage subsets of data quickly and efficiently, while maintaining the integrity of the overall collection
                • {default} indexes created on a partitioned table will also use the same partitioning scheme and partitioning column [9]
                  • ⇐ the index is aligned with the table [9]
                  • allows for easier management and administration, particularly with the sliding-window scenario [9]
            • {guideline} avoid over-indexing heavily updated tables 
              • large numbers of indexes on a table affect the CRUD statements [12]
            • {guideline} use many indexes to improve query performance on tables with low update requirements, but large volumes of data [12]
            • {guideline} indexing small tables may not be optimal 
              • because it can take the query optimizer longer to traverse the index searching for data than to perform a simple table scan [12]
            • {guideline} indexes on views can provide significant performance gains when the view contains
              • aggregations
              • table joins
              • combination of aggregations and joins
              • views does not have to be explicitly referenced in the query for the query optimizer to use it [12]
            • {guideline} keep your indexes as compact and narrow as possible while still meeting the business needs your system was designed to address [2]
              • ⇐ the index needs to be efficient
            • {guideline}use the Database Engine Tuning Advisor to analyze your database and make index recommendations [12]
            • {guideline} create nonclustered indexes on the columns that are frequently used in predicates and join conditions in queries [12]
            • {guideline} write queries that insert or modify as many rows as possible in a single statement
              • instead of using multiple queries to update the same rows
              • exploits optimized index maintenance
            • {guideline} consider using filtered indexes on columns that have well-defined subsets [12]
              • e.g. sparse columns
              • e.g. columns with mostly NULL values
              • e.g. columns with categories of values
              • e.g. columns with distinct ranges of values
              • e.g. consider indexing computed columns
            • {recommendation} use a couple of relatively narrow columns that, together, can form a unique key [11]
              • saves the cost of the uniqueifier in the data pages of the leaf level of your clustered index [11]
              • still results in an increase increase of the row size for your clustering key in the index pages of both your clustered and nonclustered indexes
            • {recommendation} carefully design and determine when and what columns should be included [1]
              • determine whether the gains in query performance outweigh the effect on performance during data modification and the additional disk space requirements [1]
            • {recommendation} keep the index size small
              • tends to be efficient [1]
              • features such as INCLUDE and filtered indexes can profoundly affect the index in both size and usefulness [5]
            • {recommendation} have all the data required by the query obtained from the covering index without touching the clustered index or table data page 
              • results in lesser I/O and better query throughput
              • poorly designed indexes and a lack of indexes are primary sources of database application bottlenecks [12]
              • the right index, created for the right query, can reduce query execution time from hours down to seconds [5]
                • without a useful index the entire table is scanned to retrieve the data
                • in joins some tables may be scanned multiple times to find all the data needed to satisfy the query [2]
            • {recommendation} create a clustered index on all user tables [14]
              • the performance benefits of having a clustered index on a table outweigh the negatives ones [14]
                • tables with clustered indexes have a higher 'page splits/sec' values [14]
                • heaps can have higher throughput (rows/sec) for individual or batch CRUD operations [14]
                • concurrent insert operations performed are often susceptible to encountering hot spots [14]
                  •  the contention at the particular insert location in the table adversely affects performance [14]
              • disk space required is smaller than for a heap [14]
              • deleting rows frees up more disk space when compared to a heap table
                • [clustered indexes] empty extents are automatically deallocated from the object [14]
                • [heaps] extents are not deallocated automatically [14]
                  • extents are held on for later reuse [14]
                    • results in table size bloating [14]
                • the residual disk space for heap tables could be reclaimed by using maintenance tasks [14]
                  • ⇐ requires additional work and processing [14]
            • {recommendation} choose the optimum sort order by examining the individual workload queries [17]
            • {recommendation} limit changes to no more than 1-2 indexes
          • metadata
            • {recommendation}for the data not persisted make periodic copies to use the data for further analysis
            • [SQL Server 2000] sys.sysindexes
              • system table that stores system-level information about indexes
              • a pointer to each index's root node is stored in sysindexes' root column
              • {limitation} doesn't support partition tables
                • use sys.indexes instead
              • {warning} susceptible to be removed in further versions 
            • [SQL Server 2005] sys.indexes
              • contains a row per index or heap of a tabular object, such as a table, view, or table-valued function
              • metadata's visibility is limited to the securables a user either owns or to which has been granted permissions
            • [SQL Server 2005] sys.index_columns
              • contains one row per column that is part of an index or unordered table (heap).
            • [SQL Server 2005] sys.dm_db_index_physical_stats
              • returns the size and fragmentation information for the data and indexes of the specified table or view
              • useful for determining the size and health of indexes [5]
              • replaces DBCC SHOWCONTIG 
              • when the function is called, SQL Server traverses the page chains for the allocated pages for the specified partitions of the table or index [5]
              • has similar parameters as sys.dm_db_index_physical_stats
            • [SQL Server 2008] sys.dm_db_missing_index_group_stats
              • returns summary information about missing index groups [17]
              • updated by every query execution
                • ⇒ not by every query compilation or recompilation
              • not persisted
            • [SQL Server 2008] sys.dm_db_missing_index_groups 
              • returns information about a specific group of missing indexes [17]
                • ⇐ the group identifier and the identifiers of all missing indexes that are contained in that group [17]
              • not persisted
              • {required permissions} VIEW SERVER STATE
              • {required permissions}[SQL Server 2022] VIEW SERVER PERFORMANCE
            • [SQL Server 2008] sys.dm_db_missing_index_details 
              • DMV that returns detailed information about missing indexes [17]
              • updated when a query is optimized by the query optimizer
              • not persisted
              • {limitation} the result set is limited to 600 rows
                • ⇒  address the missing indexes issues to see further values
              • {required permissions} VIEW SERVER STATE
              • {required permissions}[SQL Server 2022] VIEW SERVER PERFORMANCE STATE on the server
            • [SQL Server 2008] sys.dm_db_missing_index_columns 
              • returns information about the database table columns that are missing an index [17]
              • not persisted
              • {required permissions} VIEW SERVER STATE
              • {required permissions}[SQL Server 2022] VIEW SERVER PERFORMANCE
            • {undocumented} [SQL Server 2012] sys.dm_db_database_page_allocations 
              • returns a row for every page used or allocated based on the given parameters [5]
              • replaces the DBCC IND undocumented command [5]
            • [SQL Server 2019] sys.dm_db_page_info
              • function that returns information about a page in a database
                • returns one row that contains the header information from the page, including the object_id, index_id, and partition_id
              • replaces  DBCC PAGE in most cases
          • {command} DBCC SHOWCONTIG 
            • determines how full the pages in a table and/or index really are
            • shows three types of fragmentation: 
              • extent scan fragmentation
              • logical scan fragmentation
                • the percentage of out-of-order pages in an index [10]
                • not relevant on heaps and text indexes [10]
                • high values for logical scan fragmentation can lead to degraded performance of index scans [10]
            • scan density
              • not valid when indexes span multiple [10]
            • {concept} average page density
              • measure of fullness for leaf pages of an index [10]
              • low values for average page density can result in more pages that must be read to satisfy a query [10]
                • reorganizing the pages so they have a higher page density can result in less I/O to satisfy the same query [10]
                • generally, tables have a high page density after initial loading of data [10]
                  • page density may decrease over time as data is inserted, resulting in splits of leaf pages [10]
                • value dependent on the fillfactor specified when the table was created [10]
            • key indicators
              • Logical Scan Fragmentation 
              • Avg. Page Density
            • {option} WITH FAST 
              • allows DBCC SHOWCONTIG to avoid scanning the leaf pages of an index [10]
                •  it cannot report page density numbers [10]
              • consider using it on a busy server [10]
          • {command} DBCC INDEXDEFRAG
            • defragments index pages
            • defragments only individual indexes [10]
              •  cannot rebuild all indexes with a single statement [10]
            • performs an in-place update of the pages to defragment them [20]
              • ⇐ does not require locking the table for the entirety of the process, [20]
                •  allows users to access the table [20]
              • only defragments the pages within the extents [20]
                • ⇐ it does not defragment the extents [20]
                  • ⇐ doesn’t affect extent scan fragmentation
                •   reduces logical scan fragmentation
            • online operation 
            • skips locked pages as it encounters them
              •  may not be able to completely eliminate fragmentation [18] [6]
            • each unit of work performed occurs as a separate transaction [18]
              •   can be stopped and restarted without losing any work [18]
            • {drawback} does not help to untangle indexes that have become interleaved within a data file [10]
              • even if indexes can be rebuilt so that there is minimal interleaving, this does not have a significant effect on performance [10]
            • {drawback} does not correct extent fragmentation on indexes [10]
            • {drawback} can produce large transaction log files [20]
              • {workaround} decrease the backup interval for transaction backups while DBCC INDEXDEFRAG is running [20]
              • ⇐ keeps the transaction log files from growing too large [20]
            • {drawback} can make extent switching worse [20]
              • because a page order scan, which might be in order after DBCC INDEXDEFRAG, might switch between more extents [20]
              • {phase: 1} compacts the pages and attempt to adjust the page density to the fillfactor that was specified when the index was created
                • attempts to raise the page-density level of pages to the original fillfactor [10]
              • {phase: 2} defragments the index by shuffling the pages so that the physical ordering matches the logical ordering of the leaf nodes of the index [10]
                • ⇐ performed as a series of small discrete transactions
                  • has a small impact to overall system performance [10]
          • {command} DBCC DBREINDEX
            • rebuilds an index or all indexes on a table 
              • ⇐ builds indexes the same way as CREATE INDEX [10]
                • restores the page density levels to the original fillfactor or to the provided value [10]
                • rebuilds statistics automatically during the rebuild 
                  • have dramatic improvements on workload performance [10]
                • can take advantage of multiple-processor computers [10]
                  • can be significantly faster when rebuilding large or heavily fragmented indexes [10]
              • rearranges the extents and the pages within a table so that they are in index order [19]
                • ⇐ pages scan faster when they have been written in heap memory in the same order as their next page's pointers (index order) [19]
                • extents can be paged from disk faster during a scan operation when they're written to storage in the same order as their next extent pointers (index order) [19]
            • offline operation
            • its effectiveness can be influenced by the available free space [10]
              • ⇐ without large enough contiguous blocks of free space, DBREINDEX may be forced to reuse other areas of free space within the data files, resulting in indexes being rebuilt with a small amount of logical fragmentation [10]
              • the amount of free space needed varies and is dependent on the number of indexes being created in the transaction [10]
                • {guideline} [clustered indexes] required free space = 1.2 * (average row size) * (number of rows) [10]
                • [nonunique clustered index] when rebuilt is also need free space for both the clustered and any nonclustered indexes
                  • implicitly rebuilt because SQL Server must generate new unique identifiers for the rows [10]
                  • {guideline} [nonclustered indexes] average row size *number of rows
                    • average row size of each row in the nonclustered index considers the length of the nonclustered key plus the length of clustering key or row ID [10]
            • runs as one atomic transaction [18]
              • when stopped the changes are rolled back [18]
            • [clustered indexes][locking] puts an exclusive table lock on the table [19]
              • prevents any users from accessing the table [19]
            • [nonclustered indexes][locking] puts a shared table lock
              • prevents all but SELECT operations from being performed on it [19]
            • {good practice} specify individual indexes for defragmentation [10]
              • ⇐ gives more control over the operations being performed [10]
              • ⇐ can help to avoid unnecessary work [10]
            • {limitation} does not reduce page density levels on pages that currently have a higher page density than the original fillfactor [10]
            • {advantage} as the fragmentation level and size of the indexes increase, DBCC DBREINDEX can rebuild the indexes much faster than DBCC INDEXDEFRAG [10]
            • via DBCC DBREINDEX (‘<table_name>')

          Acronyms:
          CI - clustered index
          CRUD - Create, Read, Update, Delete
          DBCC - Database Console Command
          DMV - Dynamic Management View
          IAM - Index Allocation Map
          NCI - nonclustered index
          UCI - unique clustered index

          Resources:
          [1] Scalability Experts (2005) "Microsoft® SQL Server 2005: Changing the Paradigm",
          [2] Ken Henderson (2003) "Guru's Guide to SQL Server Architecture and Internals"
          [3] Paul S Randal (2011) SQL Q&A: The Lore of Logs (link)
          [4] Kalen Delaney et al (2009) "Microsoft® SQL Server® 2008 Internals"
          [5] Kalen Delaney (2013) Microsoft SQL Server 2012 Internals
          [6] Kalen Delaney (2006) "Inside Microsoft® SQL Server™ 2005: The Storage Engine"
          [7] Jason Strate & Grant Fritchey (2015) "Expert Performance Indexing in SQL Server"
          [8] Microsoft Learn (2023) CREATE INDEX (link)
          [9] Kimberly L Tripp (2005) Partitioned Tables and Indexes in SQL Server 2005 (link)
          [10] Mike Ruthruff (2009) Microsoft SQL Server 2000 Index Defragmentation Best Practices [white paper]
          [11] Michelle Ufford (2011) Effective Clustered Indexes, (link)
          [12] Technet (2016) SQL Server Index Design Guide (link)
          [13] Bob Dorr (2015) How It Works: MAX DOP Level and Parallel Index Builds (old link)
          [14] Stuart Ozer et al (2007) SQL Server Best Practices Article (link)  
          [15] Microsoft Learn (2023) Optimize index maintenance to improve query performance and reduce resource consumption (link)
          [16] MSDN (2015) Specify Fill Factor for an Index (link)
          [17] Technet (2015) Finding Missing Indexes (link)
          [18] MSDN Blogs (2010) Notes - SQL Server Index Fragmentation, Types and Solutions, by Pankaj Mittal (old link)
          [19] Wayne W Berry (2010) The DBCC DBREINDEX Command
          [20] 
          Wayne W Berry (2010) The DBCC INDEXDEFRAG Command
          [21] Microsoft Learn (2023) SQL Server and Azure SQL index architecture and design guide (link)


          01 May 2019

          Database Management: SQL Server Feature Bloat

          Database Management
          Database Management Series

          In an old SSWUG editorial “SQL Server Feature Bloat” by Ben Taylor, the author raises the question on whether SQL Server features like the support for Unicode, the increase in page size for data storage to 8k or the storage of additional metadata and statistics create a feature bloat. He further asks on whether customers may consider other database solutions, and on whether this aspect is important for customers.

          A software or feature bloat is the “process whereby successive versions of a computer program become perceptibly slower, use more memory, disk space or processing power, or have higher hardware requirements than the previous version - whilst making only dubious user-perceptible improvements or suffering from feature creep” (Wikipedia).

          Taylor’s question seems to be entitled, especially when is considered the number of features added in the various releases of SQL Server. Independently on whether they attempt to improve performance, extend existing functionality or provide new functionality, many of these features target special usage and are hardly used by average applications that use SQL Server only for data storage. Often after upgrading to a new release, it may happen that the customers see no performance improvement in the features used or the performance even decays, while the new release needs more resources to perform the same tasks. This can make customers wonder on whether all these new features bring any benefit for them.

          It’s easy to neglect the fact that the SQL Server is just used as storage layer in an architecture and more likely that some of the problems reside in the business or presentation layers. In addition, not always a solution is designed to take advantage of a database’s (latest) features. Besides, it may happen that the compatibility level is set to a lower value, so the latest functionality won’t be used at all.

          Probably the customers hope that the magic will happen immediately after the upgrade. For some features like the ones regarding engine’s optimization are enabled by default and is expected a performance gain, however, to take advantage of the new features the existing applications need to be redesigned. With each new edition it’s important to look at the opportunities provided by the upgrades and analyze the performance benefit as there’s often a trade-off between benefit and effort on one side, respectively between technical advantages and disadvantages on the other.

          The examples used by Taylor aren’t necessarily representative because they refer to changes made prior to SQL Server 2005 edition and there are good arguments for their implementation. The storage of additional metadata and statistics is neglectable in comparison with the size of the databases and the benefits, given that the database engine needs statistics so it can operate optimally. SQL Server moved from 2 KB pages to 8 KB pages between versions 6.5 and 7.0 probably because it offers a good performance with efficient use of space. The use of Unicode character set become a standard given that databases needed to support multiple languages.

          Feature bloating is not a problem that concerns only SQL Server but also other database products like Oracle, DB2 or MySQL, and other types of software. Customers’ choice of using one vendor’s products over another is often a strategic decision in which the database is just a piece of a bigger architecture. In the TPC-H benchmarks SQL Server 2014 and 2016 scored during the last years higher than the competitors. It’s more likely that customers will move to SQL Server than vice-versa, when possible. Customers expect performance, stability and security and are willing to pay for them, as long the gain is visible.

          09 June 2018

          Data Migrations (DM): Guiding Principles

          Data Migration
          Data Migrations Series

          Introduction

          “An army of principles can penetrate where an army of soldiers cannot."
          Thomas Paine

          In life as well in IT principles serve as patterns of advice in form of general or fundamental ideas, truths or values stated in a context-independent manner. They can be used as guidelines in understanding and modeling the reality, the world we live in. With the invasion of technologies in our lives principles serve as a solid ground on which we can build castles – solutions for our problems. Each technology comes with its own set of principles that defines in general terms its usage. That's why most of the IT books attempt to catch these sets of principles. Unfortunately, few of the technical writers manage to define some meaningful principles and showcase their usages.

          Many of the ideas considered as principles in papers on Data Migration (DM) are at best just practices, and some can be considered as best/good practices. Just because something worked good in a previous migration doesn’t mean automatically that the idea behind the respective decision turns automatically in a principle. Some of the advices advanced are just lessons learned in disguise. Principles through their generality apply to a broad range of cases, while practices are more activity specific.

          A DM through its nature finds its characteristics at the intersection of several area - database-based architecture design, ETL workflows, data management, project management (PM) and services. From these areas one can pull a set of principles that can be used in building DM architectures.

          Architecture Principles

          “Architecture starts when you carefully put two bricks together.”
          Ludwig Mies van der Rohe

          There are several general principles that apply to the architecture of applications, independently of the technologies used or the industry, e.g. research first, keep it simple/small, start with the end in mind, model first, design to handle failure, secure by design (aka safety first), prototype, progress iteratively, focus on value, reuse (aka don't reinvent the wheel), test early, early feedback, refactor, govern, validate, document, right tool – right people, make it to last, make it sustainable, partition around limits, scale out, defensive coding, minimal intervention, use common sense, process orientation, follow the data, abstract, anticipate obsolescence, benchmark, single-responsibility, single dispatch, separation of concerns, right perspective.

          To them add a range of application design characteristics that can be considered as principles as well: extensibility, modularity, adaptability, reusability, repeatability, modularity, performance, revocability, auditability, subject-orientation, traceability, robustness, locality, heterogeneity, consistency, atomicity, increased cohesion, reduced coupling, monitoring, usability, etc. There are several principles that can be transported from problem solving into design - divide and conquer, prioritize, system’s approach, take inventory, and so on.

          A DM’s architecture has more to do with a data warehouse as it relies heavily on ETL tasks and data need to be stored for various purposes. Besides the principles of good database design, a few other principles apply: model (the domain) first, denormalize, design for performance, maintainability and security, validate continuously. From ETL area following principles can be considered: single point of processing, each step must have a purpose, minimize touch points, rest data for checkpoints, leverage existing knowledge, automate the steps, batch processing.

           In addition, considering their data-specific character, a DM can be regarded as one or several data products, though in contrast with typical data products DM have typically a limited purpose. From this area following principles could be considered: build trust with transparency, blend in, visualize the complex.

          Data Management Principles

          Considering that a DM’s focus is an organization's data, some principles need to focus on the management and governance of Data. Data Governance together with Data Quality, Data Architecture, Metadata Management, Master Data Management are functions of Data Management. The focus is on data, metadata and their lifecycle, on processes, ownership and roles and their responsibilities. With this in mind there can be defined several principles supposed to facilitate the functions of Data Management: manage data as asset, manage data lifecycle, the business owns the data, integration across the organization, make data/metadata accessible, transparent and auditable processes, one source of truth.

          As part of DM there are customer, employee and vendor information which fall under the General Data Protection Regulation (GDPR) EU 2016/679 regulation which defines the legal framework for data protection and privacy for all individuals within the European Union (EU) and the European Economic Area (EEA) as well the export of personal data outside the EU and EEA. The regulation defines a set of principles that make its backbone: fairness, lawfulness and transparency, purpose limitation, data minimization, accuracy, storage limitation, integrity and confidentiality, accountability [6].

          Overseas, the US Federal Trade Commission (FTC) issued in 2012, a report recommending organizations design and implement their own privacy programs based on a set of best practices. The report reaffirms the FTC’s focus on Fair Information Processing Principles, which include notice/awareness, choice/consent, access/participation, integrity/security, enforcement/redress [6].


          Project Management (PM) Principles

          "Management is doing things right […]"
          Peter Drucker

          A DM though its characteristics is a project and, to increase the chances of success, it needs to be managed as a project. Managing DM as a project is one of the most important principles to consider. The usage of a PM framework will further increase the chances of success, as long the framework is adequate for the purpose and the organization team is able to use the framework. PMI, Prince2, Agile/Scrum/Kanban are probably the most used PM methodologies and they come with their own sets of principles.

          In general, all or some of the PM principles apply independently on whether is used alone or in combination with other PM methodologies: a single project manager, an informed and supportive management, a dedicated team of qualified people to do the work of the project, clearly defined goals addressing stakeholders’ priorities, an integrated plan and schedule, as well a budget of costs and/or resources required [1].

          On the other side, an agile approach could prove to be a better match for a DM given that requirements change a lot, frequent and continuous deliveries are needed, collaboration is necessary, agile processes as well self-organizing teams can facilitate the migration. These are just a few of the catchwords that make the backbone of the Agile Manifesto (see [3]).

          An agile form of Prince2 could be something to consider as well, especially when Prince2 is used as methodology for other projects. For Prince2 are the following principles to consider: continued business justification, learn from experience, defined roles and responsibilities, manage by stages, management by exception, focus on products, tailor to suit the project environment [2].

          All these PM principles reveal important aspects to ponder upon, and maybe with a few exceptions, all can be incorporated in the way the DM project is managed.


          Service Principles

          Considering the dependencies existing between the DM and Data Quality as well to the broader project, a DM can have the characteristics of a service. It’s not an IT Service per se, as IT only supports technically and eventually from a PM perspective the project. Even if a DM is not a ITSM service, some of the ITIL principles can still apply: focus on value, design for experience, start where you are, work holistically, progress iteratively, observe directly, be transparent, collaborate and keep it simple [4].


          Conclusion

          “Obey the principles without being bound by them.”
          Bruce Lee

          Within a DM all the above principles can be considered, though the network of implication they create can easily shift the focus from the solution to the philosophical aspects, and that’s a marshy road to follow. Even if all principles are noble, not all can be considered. It would be utopic to consider each possible principle. The trick is to identify the most “important” principles (principles that make sense) and prioritize them according to existing requirements. In theory, this is a one-time process that involves establishing a “framework” of best/good practices for the DM, in next migrations needing only to consider the new facts and aspects.

          Previous Post <<||>> Next Post

          References:
          [1] “Principles of project management”, by J. A. Bing, PM Network, 1994 (link)
          [2] Axelos (2018) What is PRINCE2? (link)
          [3] Agile Manifesto (2001) Principles behind the Agile Manifesto (link)
          [4] Axelos (2018) ITIL® Practitioner 9 Guiding Principles (link)
          [5] The Data Governance Institute (2018) Goals and Principles for Data Governance (link) 
          [6] Navigating the Labyrinth: An Executive Guide to Data Management, by Laura Sebastian-Coleman for DAMA International, Technics Publications, 2018 (link)  

          24 February 2018

          SQL Reloaded: Misusing Views and Pseudo-Constants

             Views as virtual tables can be misused to replace tables in certain circumstances, either by storing values within one or multiple rows, like in the below examples:

          -- parameters for a BI solution
          CREATE VIEW dbo.vLoV_Parameters
          AS
          SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId
           , Cast(GetDate() as Date) AS CurrentDate 
           , Cast(100 as int) AS BatchCount 
          
          GO
          
          SELECT *
          FROM dbo.vLoV_Parameters
          
          GO
          
          -- values for a dropdown 
           CREATE VIEW dbo.vLoV_DataAreas
           AS
           SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId
           , Cast('Company ABC' as nvarchar(50)) AS Description 
           UNION ALL
           SELECT 'XYZ' DataAreaId 
           , 'Company XYZ'
          
          GO
          
          SELECT *
          FROM dbo.vLoV_DataAreas
          
          GO
          

              These solutions aren’t elegant, and typically not recommended because they go against one of the principles of good database design, namely “data belong in tables”, though they do the trick when needed. Personally, I used them only in a handful of cases, e.g. when it wasn’t allowed to create tables, when it was needed testing something for a short period of time, or when there was some overhead of creating a table for 2-3 values. Because of their scarce use, I haven’t given them too much thought, not until I discovered Jared Ko’s blog posting on pseudo-constants. He considers the values from the first view as pseudo-constants, and advocates for their use especially for easier dependency tracking, easier code refactoring, avoiding implicit data conversion and easier maintenance of values.


             All these are good reasons to consider them, therefore I tried to take further the idea to see if it survives a reality check. For this I took Dynamics AX as testing environment, as it makes extensive use of enumerations (aka base enums) to store list of values needed allover through the application. Behind each table there are one or more enumerations, the tables storing master data abounding of them.  For exemplification let’s consider InventTrans, table that stores the inventory transactions, the logic that governs the receipt and issued transactions are governed by three enumerations: StatusIssue, StatusReceipt and Direction.

          -- Status Issue Enumeration 
           CREATE VIEW dbo.vLoV_StatusIssue
           AS
           SELECT cast(0 as int) AS None
           , cast(1 as int) AS Sold
           , cast(2 as int) AS Deducted
           , cast(3 as int) AS Picked
           , cast(4 as int) AS ReservPhysical
           , cast(5 as int) AS ReservOrdered
           , cast(6 as int) AS OnOrder
           , cast(7 as int) AS QuotationIssue
          
          GO
          
          -- Status Receipt Enumeration 
           CREATE VIEW dbo.vLoV_StatusReceipt
           AS
          SELECT cast(0 as int) AS None
           , cast(1 as int) AS Purchased
           , cast(2 as int) AS Received
           , cast(3 as int) AS Registered
           , cast(4 as int) AS Arrived
           , cast(5 as int) AS Ordered
           , cast(6 as int) AS QuotationReceipt
          
          GO
          
          -- Inventory Direction Enumeration 
           CREATE VIEW dbo.vLoV_InventDirection
           AS
           SELECT cast(0 as int) AS None
           , cast(1 as int) AS Receipt
           , cast(2 as int) AS Issue
          

             To see these views at work let’s construct the InventTrans table on the fly:

          -- creating an ad-hoc table  
           SELECT *
           INTO  dbo.InventTrans
           FROM (VALUES (1, 1, 0, 2, -1, 'A0001')
           , (2, 1, 0, 2, -10, 'A0002')
           , (3, 2, 0, 2, -6, 'A0001')
           , (4, 2, 0, 2, -3, 'A0002')
           , (5, 3, 0, 2, -2, 'A0001')
           , (6, 1, 0, 1, 1, 'A0001')
           , (7, 0, 1, 1, 50, 'A0001')
           , (8, 0, 2, 1, 100, 'A0002')
           , (9, 0, 3, 1, 30, 'A0003')
           , (10, 0, 3, 1, 20, 'A0004')
           , (11, 0, 1, 2, 10, 'A0001')
           ) A(TransId, StatusIssue, StatusReceipt, Direction, Qty, ItemId)
          


              Here are two sets of examples using literals vs. pseudo-constants:

          --example issued with literals 
          SELECT top 100 ITR.*
           FROM dbo.InventTrans ITR
           WHERE ITR.StatusIssue = 1 
             AND ITR.Direction = 2
          
          GO
           --example issued with pseudo-constants
           SELECT top 100 ITR.*
           FROM dbo.InventTrans ITR
                JOIN dbo.vLoV_StatusIssue SI
                  ON ITR.StatusIssue = SI.Sold
                JOIN dbo.vLoV_InventDirection ID
                  ON ITR.Direction = ID.Issue
          
          GO
          
          --example receipt with literals 
           SELECT top 100 ITR.*
           FROM dbo.InventTrans ITR
           WHERE ITR.StatusReceipt= 1
             AND ITR.Direction = 1
          
          GO
          
          --example receipt with pseudo-constants
           SELECT top 100 ITR.*
           FROM dbo.InventTrans ITR
                JOIN dbo.vLoV_StatusReceipt SR
                  ON ITR.StatusReceipt= SR.Purchased
                JOIN dbo.vLoV_InventDirection ID
                  ON ITR.Direction = ID.Receipt
          

           
            As can be seen the queries using pseudo-constants make the code somehow readable, though the gain is only relative, each enumeration implying an additional join. In addition, when further business tables are added to the logic (e.g. items, purchases or sales orders)  it complicates the logic, making it more difficult to separate the essential from nonessential. Imagine a translation of the following query:

          -- complex query 
            SELECT top 100 ITR.*
            FROM dbo.InventTrans ITR
                        <several tables here>
            WHERE ((ITR.StatusReceipt<=3 AND ITR.Direction = 1)
              OR (ITR.StatusIssue<=3 AND ITR.Direction = 2))
              AND (<more constraints here>)
          


             The more difficult the constraints in the WHERE clause, the more improbable is a translation of the literals into pseudo-constraints. Considering that an average query contains 5-10 tables, each of them with 1-3 enumerations, the queries would become impracticable by using pseudo-constants and quite difficult to troubleshoot their execution plans.

              The more I’m thinking about, an enumeration data type as global variable in SQL Server (like the ones available in VB) would be more than welcome, especially because values are used over and over again through the queries. Imagine, for example, the possibility of writing code as follows:

          -- hypothetical query
          SELECT top 100 ITR.*
          FROM dbo.InventTrans ITR
          WHERE ITR.StatusReceipt = @@StatusReceipt .Purchased
            AND ITR.Direction = @@InventDirection.Receipt
          

             From my point of view this would make the code more readable and easier to maintain. Instead, in order to make the code more readable, one’s usually forced to add some comments in the code. This works as well, though the code can become full of comments.

          -- query with commented literals
          SELECT top 100 ITR.*
          FROM dbo.InventTrans ITR
          WHERE ITR.StatusReceipt <=3  Purchased, Received, Registered 
             AND ITR.Direction = 1-- Receipt
          

             In conclusion, pseudo-constants’ usefulness is only limited, and their usage is  against developers’ common sense, however a data type in SQL Server with similar functionality would make code more readable and easier to maintain.


          PS: It is possible to simulate an enumeration data type in tables’ definition by using a CHECK constraint.

          Related Posts Plugin for WordPress, Blogger...

          About Me

          My photo
          IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.