Showing posts with label table partitions. Show all posts
Showing posts with label table partitions. Show all posts

28 February 2017

🧊Data Warehousing: Data Load Optimization (Part I: A Success Story)

Data Warehousing
Data Warehousing Series

Introduction

This topic has been waiting in the queue for almost two years already - since I finished optimizing an already existing relational data warehouse within a SQL Server 2012 Enterprise Edition environment. Through various simple techniques I managed then to reduce the running time for the load process by more than 65%, from 9 to 3 hours. It’s a considerable performance gain, considering that I didn’t have to refactor any business logic implemented in queries.

The ETL (Extract, Transform, Load) solution was making use of SSIS (SQL Server Integration Services) packages to load data sequentially from several sources into staging tables, and from stating further into base tables. Each package was responsible for deleting the data from the staging tables via TRUNCATE, extracting the data 1:1 from the source into the staging tables, then loading the data 1:1 from the staging table to base tables. It’s the simplest and a relatively effective ETL design I also used with small alterations for data warehouse solutions. For months the data load worked smoothly, until data growth and eventually other problems increased the loading time from 5 to 9 hours.

Using TABLOCK Hint

Using SSIS to bulk load data into SQL Server provides an optimum of performance and flexibility. Within a Data Flow, when “Table Lock” property on the destination is checked, it implies that the insert records are minimally logged, speeding up the load by a factor of two. The TABLOCK hint can be used also for other insert operations performed outside of SSIS packages. At least in this case the movement of data from staging into base tables was performed in plain T-SQL, outside of SSIS packages. Also further data processing had benefitted from this change. Only this optimization step alone provided 30-40% performance gain.

Drop/Recreating the Indexes on Big Tables

As the base tables were having several indexes each, it proved beneficial to drop the indexes for the big tables (e.g. with more than 1000000 records) before loading the data into the base tables, and recreate the indexes afterwards. This was done within SSIS, and provided an additional 20-30% performance gain from the previous step.

Consolidating the Indexes

Adding missing indexes, removing or consolidating (overlapping) indexes are typical index maintenance tasks, apparently occasionally ignored. It doesn’t always bring much performance as compared with the previous methods, though dropping and consolidating some indexes proved to be beneficial as fewer data were maintained. Data processing logic benefited from the creation of new indexes as well.

Running Packages in Parallel

As the packages were run sequentially (one package at a time), the data load was hardly taking advantage of the processing power available on the server. Even if queries could use parallelism, the benefit was minimal. Enabling packages run in parallel added additional performance gain, however this minimized the availability of processing resources for other tasks. When the data load is performed overnight, this causes minimal overhead, however it should be avoided when the data are loading to business hours.

Using Nonclustered Indexes

In my analysis I found out that many tables, especially the ones storing prepared data, were lacking a clustered index, even if further indexes were built on them. I remember that years back there was a (false) myth that fact and/or dimension tables don’t need clustered indexes in SQL Server. Of course clustered indexes have downsides (e.g. fragmentation, excessive key-lookups) though their benefits exceed by far the downsides. Besides missing clustered index, there were cases in which the tables would have benefited from having a narrow clustered index, instead of a multicolumn wide clustered index. Upon case also such cases were addressed.

Removing the Staging Tables

Given the fact that the source and target systems are in the same virtual environment, and the data are loaded 1:1 between the various layers, without further transformations and conversions, one could load the data directly into the base tables. After some tests I came to the conclusion that the load from source tables into the staging table, and the load from staging table into base table (with TABLOCK hint) were taking almost the same amount of time. This means that the base tables will be for the same amount of the time unavailable, if the data were loaded from the sources directly into the base tables. Therefore one could in theory remove the staging tables from the architecture. Frankly, one should think twice when doing such a change, as there can be further implications in time. Even if today the data are imported 1:1, in the future this could change.

Reducing the Data Volume

Reducing the data volume was identified as a possible further technique to reduce the amount of time needed for data loading. A data warehouse is built based on a set of requirements and presumptions that change over time. It can happen for example that even if the reports need only 1-2 years’ worth of data, the data load considers a much bigger timeframe. Some systems can have up to 5-10 years’ worth of data. Loading all data without a specific requirement leads to waste of resources and bigger load times. Limiting the transactional data to a given timeframe can make a considerable difference. Additionally, there are historical data that have the potential to be archived.

There are also tables for which a weekly or monthly refresh would suffice. Some tables or even data sources can become obsolete, however they continue to be loaded in the data warehouse. Such cases occur seldom, though they occur. Also some unused or redundant column could have been removed from the packages.

Further Thoughts

There are further techniques to optimize the data load within a data warehouse like partitioning large tables, using columnstore indexes or optimizing the storage, however my target was to provide maximum sufficient performance gain with minimum of effort and design changes. Therefore I stopped when I considered that the amount of effort is considerable higher than the performance gain.

Further Reading:
[1] TechNet (2009) The Data Loading Performance Guide, by Thomas Kejser, Peter Carlin & Stuart Ozer (link)
[2] MSDN (2010) Best Practices for Data Warehousing with SQL Server 2008 R2, by Mark Whitehorn, Keith Burns & Eric N Hanson (link)
[3] MSDN (2012) Whitepaper: Fast Track Data Warehouse Reference Guide for SQL Server 2012, by Eric Kraemer, Mike Bassett, Eric Lemoine & Dave Withers (link)
[4] MSDN (2008) Best Practices for Data Warehousing with SQL Server 2008, by Mark Whitehorn & Keith Burns (link)
[5] TechNet (2005) Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server, by Gandhi Swaminathan (link)
[6] SQL Server Customer Advisory Team (2013) Top 10 Best Practices for Building a Large Scale Relational Data Warehouse (link)

08 February 2009

🛢DBMS: Partitioning (Definitions)

"To divide a table into logical subsets based on characteristics of the data. Partitioning is used to improve application performance or reduce the potential for conflicts in multisite update replication." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Physically separating data into areas that are more easily maintained or accessed. Data partitioning increases performance and aids in maintenance processes." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The process of splitting a database object (usually a tablespace, table, or index) into two or more physical locations, or partitions, that is, a splitting of a logical group of pages (for example, the pages of a table) into chains or files which are physically removed from each other, perhaps on separate disks. Informix calls this fragmentation." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"The process of replacing a table with multiple smaller tables. Each smaller table has the same format as the original table, but with a subset of the data. Each partitioned table has rows allocated to it based on some characteristic of the data, such as specific key ranges. The rules that define into which table the rows go must be unambiguous. For example, a table is partitioned into two tables. All rows with primary key values lower than a specified value are allocated to one table, and all keys equal to or greater than the value are allocated to the other. Partitioning can improve application processing speeds and reduce the potential for conflicts in multisite update replication. You can improve the usability of partitioned tables by creating a view. The view, created by a union of select operations on all the partitioned tables, presents the data as if it all resided in a single table." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"Physical splitting of tables into separate sections (partitions), including parallel processing on multiple partitions and individual operations on individual partitions. One particularly efficient aspect is the capability when querying a table to read fewer than all the partitions making up a table, perhaps even a single partition. This is also known as partition pruning." (Gavin Powell, "Beginning Database Design", 2006)

[data partitioning:] "Process of moving data from a single server to one or more different data repositories. This can be vertical, in which data from a single table is split into multiple tables, or horizontal, in which the number of rows in a table are restricted and partitioned by one or more columns." (Sara Morganand & Tobias Thernstrom , "MCITP Self-Paced Training Kit : Designing and Optimizing Data Access by Using Microsoft SQL Server 2005 - Exam 70-442", 2007)

"The process of replacing a table with multiple smaller units. Tables can be partitioned horizontally or vertically." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"The method for dividing a database into manageable parts for the purpose of easier management and better performance." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"The process of splitting a table into subsets of rows or columns." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"The process of replacing a table with multiple smaller tables." (Microsoft, "SQL Server 2012 Glossary,", 2012)

"With respect to distributed databases, partitioning refers to splitting documents, tables, or graphs and distributing them to different servers." (Dan Sullivan, "NoSQL for Mere Mortals®", 2015)

[index-controlled partitioning:] "A type of partitioning in which partition boundaries for a partitioned table are controlled by values that are specified on the CREATE INDEX statement." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"Partitioning is the spreading of data across multiple files across a cluster to balance large amounts of data across disks or nodes. Read-only partitions make a read-only table space that prevents updates on all tables in the table space. Other patterns can be applied on this table space to improve performance." (Piethein Strengholt, "Data Management at Scale", 2020)

[composite partitioning:] "In partitioning strategy in which a table is partitioned by one data distribution method and then each partition is further divided into subpartitions using a second data distribution method." (Oracle, "Oracle Database Concepts")

"The ability to decompose very large tables and indexes into smaller and more manageable pieces called partitions." (Oracle, "Oracle Database Concepts")
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.