SQL Troubles: data loading

Showing posts with label data loading. Show all posts

13 February 2025

🧊💠🗒️Data Warehousing: Table Partitioning in SQL Server [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 13-Feb-2025

[Data Warehousing] Table Partitioning

{def} the spreading of data across multiple tables based on a set of rules to balance large amounts of data across disks or nodes

data is distributed based on a function that defines a range of values for each partition [2]

the table is partitioned by applying the partition scheme to the values in a specified column [2]

{operation} partition creation

[large partitioned table]

should be created two auxiliary nonindexed empty tables with the same structure, including constraints and data compression options [4]

first table: create a check constraint that guarantees that all data from the table fits exactly with one empty partition of the fact table

the constraint must be created on the partitioning column [4]
a columnstore index can be created on the fact table, as long as it is aligned with the table [4]
after truncation of <table 2> the <table 1> is prepared to accept the next partition from your fact table for the next minimally logged deletion [4]

second table: for minimally logged deletions of large portions of data, a partition from the fact table can be switched to the empty table version without the check constraint [4]

then the table can be truncated

for minimally logged inserts, new data to the second auxiliary table should be bulk inserted in the auxiliary that has the check constraint [4]

INSERT operation can be minimally logged because the table is empty [4]
create a columnstore index on this auxiliary table, using the same structure as the columnstore index on your fact table [4]
switch data from this auxiliary table to a partition of your fact table [4]
drop the columnstore index on the auxiliary table, and change the check constraint to guarantee that all of the data for the next load can be switched to the next empty partition of the fact table [4]
the second auxiliary table is prepared for new bulk loads again [4]

{operation} [Query Optimizer] partition elimination

process in which SQL Server accesses only those partitions needed to satisfy query filters [4]

{operation} partition switching

{definition} process that switches a block of data from one table or partition to another table or partition [4]
types of switches

reassign all data from a nonpartitioned table to an empty existing partition of a partitioned table [4]
switch a partition of one partitioned table to a partition of another partitioned table [4]
reassign all data from a partition of a partitioned table to an existing empty nonpartitioned table [4]

{benefit} improves query performance [1]

by partitioning a table across filegroups [1]

specific ranges of data can be placed on different disk spindles [1]

can improve I/O performance [1]

⇐ the disk storage is already configured as a RAID 10 or RAID 5 array [1]

⇒ this usually has little benefit [1]

using a mix of fast solid state storage for recent, frequently accessed data, and mechanical disks for older, less queried rows [1]

use partitioning to balance disk performance against storage costs [1]

biggest performance gain from partitioning in a data warehouse is realized when queries return a range of rows that are filtered on the partitioning key [1]

the query optimizer can eliminate partitions that are not within the filter range [1]

dramatically reduce the number of rows that need to be read [1]

reduces contention [3]

can reduce the number of rows included in a table scan [3]

{benefit} more granular manageability [1]

some maintenance operations can be performed at partition level instead of on the whole table [1]

e.g. indexes can be created and rebuilt on a per-partition basis [1]
e.g. compression can be applied to individual partitions [1]
e.g. by mapping partitions to filegroups, partitions can be backed up and restored independently [1]

enables to back up older data once and then configure the backed up partitions as read-only [1]
future backups can be limited to the partitions that contain new or updated data [1]

{benefit} improved data load performance

enables loading many rows very quickly by switching a staging table with a partition

can dramatically reduce the time taken by ETL data loads [1]

with the right planning, it can be achieved with minimal requirements to drop or rebuild indexes [1]

{best practice} partition large fact tables

tables of around 50 GB or more
⇐ in general, fact tables benefit from partitioning more than dimension tables [1]

{best practice} partition on an incrementing date key [1]

assures that the most recent data are in the last partition and the earliest data are in the first partition [1]

{best practice} design the partition scheme for ETL and manageability [1]

the query performance gains realized by partitioning are small compared to the manageability and data load performance benefits [1]

ideally partitions should reflect the ETL load frequency

because this simplifies the load process [1]
merge partitions periodically to reduce the overall number of partitions (for example, at the start of each year [1]

could merge the monthly partitions for the previous year into a single partition for the whole year [1]

{best practice} maintain an empty partition at the start and end of the table [1]

simplifies the loading of new rows [1]
when new records need to be added, split the empty partition

⇐ to create two empty partitions)

then switch the staged data with the first empty partition [1]

⇐ loads the data into the table and leaves the second empty partition you created at the end of the table, ready for the next load [1]

a similar technique can be used to archive or delete obsolete data at the beginning of the table [1]

{best practice} chose the proper granularity

it should be aligned to the business requirements [2]

{best practice} create at least one filegroup in addition to the primary one

set it as the default filegroup

data tables are thus separated from system tables [2]

creating dedicated filegroups for extremely large fact tables [2]

place the fact tables on their own logical disks [2]

use a file and a filegroup separate from the fact and dimension tables [2]

{exception} staging tables that will be switched with partitions to perform fast loads [2]

staging tables must be created on the same filegroup as the partition with which they will be switched [2]

{def} partition scheme

a scheme that maps partitions to filegroups

{def} partition function
- object that maps rows to partitions by using values from specific columns (aka partitioning columns)
- performs logical mapping
{def} aligned index

index built on the same partition scheme as its base table [4]

if all indexes are aligned with their base table, switching a partition is a metadata operation only [4]

⇒ it’s very fast [4]

Previous Post <<||>> Next Post

References:
[1] 20467A - Designing Business Intelligence Solutions with Microsoft SQL Server 2012

[2] 20463C - Implementing a Data Warehouse with Microsoft SQL Server

[3] 10777A - Implementing a Data Warehouse with Microsoft SQL Server 2012

[4] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

[5] Microsoft Learn (2009) How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005 [link]

16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part I: Inserts)

Independently on whether one is a database developer, administrator, or any other professional in the data area, as part of the job is important to be able to manipulate data on the fly. From the availability of DTS (Data Transformation Services) replaced since SQL Server 2005 with SSIS (SQL Server Integration Services), the use of packages for data import become one of the simplest and fastest ways of importing data into a database. It still is, however when dealing with small volume of data, it’s easier to import the data over a few SQL scripts, like in the below examples.
The old-fashion way of loading data is by defining first the target table based on the source’s definition:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

CREATE TABLE [dbo].[T_Courses](
     [CourseId] [int] NOT NULL,
     [CourseName] [varchar](50) NOT NULL
 ) ON [PRIMARY]

-- old-fashioned invidual inserts(SQL Server 2000)
INSERT INTO dbo.T_Courses
VALUES (1, 'Course 1')

INSERT INTO dbo.T_Courses(CourseId, CourseName)
VALUES (2, 'Course 2')

-- inserting multiple rows to a table at once (SQL Server 2005)
INSERT INTO dbo.T_Courses
--OUTPUT inserted.CourseId, inserted.CourseName -- 
VALUES (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')

 SELECT *
 FROM dbo.T_Courses

Sometimes the definition of the table is not important. With SQL Server 2008 it was possible to create the table on the fly, and thus the inserts can be simplified as follows:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

SELECT *
INTO dbo.T_Courses
FROM (
  VALUES (1, 'Course 1')
 , (2, 'Course 2')
 , (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')
 ) C (CourseId, CourseName)


 SELECT *
 FROM dbo.T_Courses

Now, to demonstrate further techniques it is useful to define besides Courses also a Students, respectively an Allocation table, the later storing the relation between a Course and a Student. As can be seen there is in the Students table a reference to the Courses, showing the course in which the student is enrolled currently The model is only for demonstration purposes:

-- Student data with current courses
-- DROP TABLE IF EXISTS dbo.T_Students

SELECT StudentId
 , StudentName
 , CourseId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 INTO dbo.T_Students
 FROM (
  VALUES (1, 'Student 1', 1, '20170201', '20170228')
 , (2, 'Student 2', 1, '20170201', '20170228')
 , (3, 'Student 3', 2, '20170201', '20170228')
 , (4, 'Student 4', 3, '20170201', '20170228')
 , (5, 'Student 5', NULL, NULL, NULL)
 , (6, 'Student 6', NULL, NULL, NULL)
 , (7, 'Student 7', NULL, NULL, NULL)
 ) S (StudentId, StudentName, CourseId, StartDate, EndDate)


 SELECT *
 FROM dbo.T_Students

-- Course allocations 
-- DROP TABLE IF EXISTS dbo.T_Allocations

SELECT CourseId
 , StudentId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 , Hours
 , Mark 
 INTO dbo.T_Allocations
 FROM (
  VALUES  (1, 1, '20170101', '20170131', 4, 1)
 , (1, 2, '20170101', '20170131', 5, 2)
 , (1, 3, '20170101', '20170131', 12, 3)
 , (2, 1, '20170201', '20170128', 23, 1)
 , (2, 3, '20170201', '20170128', 12, 3)
 , (2, 4, '20170201', '20170128', 21, 1)
 , (1, 5, '20170201', '20170128', 12, 1)
 , (4, 1, '20170101', '20170131', 5, 2)
 , (4, 1, '20170101', '20170131', 4, 2)
 ) A(CourseId, StudentId, StartDate, EndDate, Hours, Mark)

ALTER TABLE dbo.T_Allocations 
ADD AllocationId int IDENTITY(1,1) 

SELECT *
FROM dbo.T_Allocations

With this the data model is ready for use. By the initial design the courses in which the Students are currents enrolled were not supposed to be into the Allocations table. The following query repairs this:

-- insert data from query
INSERT INTO dbo.T_Allocations (CourseId, StudentId, StartDate, EndDate, Hours)
SELECT C.CourseId
, C.StudentId
, C.StartDate
 , C.EndDate
 , 0 Hours
 FROM dbo.T_Students C
      LEFT JOIN dbo.T_Allocations A
        ON C.CourseId = A.CourseId
       AND C.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL
   AND C.CourseId IS NOT NULL

If is needed to append more data, one can modify the corresponding query from above as follows:

-- insert data from values with check against existing data 
INSERT INTO dbo.T_Allocations
SELECT DAT.CourseId
 , DAT.StudentId
 , Cast(DAT.StartDate as Date) StartDate
 , Cast(DAT.EndDate as Date) EndDate
 , DAT.Hours
 , DAT.Mark 
 FROM (
  VALUES  (5, 5, '20170101', '20170131', 4, 2)
 , (5, 5, '20170101', '20170131', 5, 3)
 , (5, 5, '20170101', '20170131', 12, 4)
 , (4, 1, '20170101', '20170131', 4, 2) --already exists 
 ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
   LEFT JOIN dbo.T_Allocations A
     ON DAT.CourseId = A.CourseId
    AND DAT.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL

Starting with the 2008 version, a MERGE statement was introduced in SQL Server, which allows adding further behavior within the same statement for INSERTS, UPDATES or DELETES. The above insert based on a MERGE becomes:

-- insert data via a merge
 MERGE INTO dbo.T_Allocations A USING
 (
     SELECT DAT.CourseId
     , DAT.StudentId
     , Cast(DAT.StartDate as Date) StartDate
     , Cast(DAT.EndDate as Date) EndDate
     , DAT.Hours
     , DAT.Mark 
     FROM (
      VALUES  (5, 5, '20170101', '20170131', 4, 2)
     , (5, 5, '20170101', '20170131', 5, 3)
     , (5, 5, '20170101', '20170131', 12, 4)
     , (4, 1, '20170101', '20170131', 4, 2) --already exists 
     ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 ) DAT
   ON DAT.CourseId = A.CourseId
  AND DAT.StudentId = A.Studentid 
 WHEN NOT MATCHED THEN 
 INSERT (CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 VALUES (DAT.CourseId, DAT.StudentId, DAT.StartDate, DAT.EndDate, DAT.Hours, DAT.Mark);

These are the types of scripts I typically use to insert data into a database, and this independently of tables’ size. For more check Microsoft’s documentation on INSERT. The preparation of the data to be imported can be easily done in Excel with a few macros.

Notes:

The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

SQL Troubles

Pages

13 February 2025

🧊💠🗒️Data Warehousing: Table Partitioning in SQL Server [Notes]

16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part I: Inserts)

About Me