SQL Troubles: denormalization

Showing posts with label denormalization. Show all posts

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

31 July 2010

💎SQL Reloaded: Self-Joins and Denormalized Data Loading in Normalized Models

One of the scenarios in which I often make use of self-joins is when needing to load denormalized data into a normalized data model. A characteristic of the not-normalized or denormalized data is that there are repeating data, typically the so called header data, which need to be handled specifically.

Note:
Header data it’s improper said because an entity could contain more than 2 levels of data, for example the Purchase Orders (POs) in an ERP system could be split in PO Headers, Lines, Shipments and Distributions, thus a (denormalized) extract based on the respective data at Distribution level will contain repeating data from the higher levels.

So for this post I needed to find an entity that contains a parent-child or header-lines structure. Actually it’s not difficult to find such an example, the world of ERP systems is full of such examples – POs, Invoices, Customer Orders, Receipts, Payments, to mention some of the important transactional data, or Customer, Vendors and even the Items, when considering the master data. The difficulty is to simplify the example to a level that could be easier understood also by people who had less tangency with ERP systems or database models. For this I will consider the Receipts received when paying the goods in a (super-)market, if we take such a bill we could typically see the Receipt Number, the name of the Vendor from which we purchased was made, the Receipt Date, the Date when the purchase was made, the Items purchased together with the Quantity and Price. Such information could be easily entered in Excel and later loaded in a denormalized table, or enter them directly in the respective denormalized table:

-- Receipts denormalized table 
CREATE TABLE [dbo].[Receipts]( 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime] NULL, 
[CurrencyCode] [nvarchar](3) NULL, 
[ItemNumber] [nvarchar] (50) NULL,  
[Quantity] [decimal](12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) 
ON [PRIMARY] 

-- inserting test data 
INSERT INTO dbo.Receipts 
VALUES ('012034', 'TOOM', '2010-06-23', 'EUR', 'KABANOS PARIKA', 1, 2.19) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'ZITRONE', 1, 0.79) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'BREAKFAST BAKON', 1, 1.59) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'PILLOW', 1, 23.99) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'BED SHEET', 2, 11.99) 
 
-- checking the data 
SELECT * 
FROM dbo.Receipts

Supposing we have the above data and that we would like to load them in a normalized structure formed from the Header information – Receipt Number, Vendor Name and Receipt Date, and Line information – Item Number, Quantity and Price:

-- Receipt Headers (normalized) 
CREATE TABLE [dbo].[ReceiptHeaders]( 
[ReceiptHeaderID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime]NULL, 
[CurrencyCode] [nvarchar](3) NULL 
) ON [PRIMARY]  

-- Receipt Lines (normalized)

CREATE TABLE [dbo].[ReceiptLines]( 
[ReceiptLineID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptHeaderID] int NOT NULL, 
[ItemNumber] [nvarchar] (50) NULL, 
[Quantity] [decimal] (12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) ON [PRIMARY]

In order to load the denormalized data in a normalized structure we could write two queries, the first populates the ReceiptHeaders table, for this needing to select the distinct header attributes that make the header, while the second populates the ReceiptLines table:

-- inserting the Receipt Header data 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 

-- inserting the Receipt Lines data

INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price

As can be seen from the second query, the Receipts table was joined with the ReceiptHeaders in order to retrieve the corresponding Header information for each line record. For this action to be possible we need an attribute or combination of attributes unique across the header data, in this case the ReceiptNumber in combination with the Vendor Name. If no such unique combination exists then the match between header and line data is not possible without resulting duplicated data, in such scenario it’s recommended to clean the data before loading them, for example by introducing an attribute that makes the combination unique. The same problem of uniqueness could be applied to the lines too, needing to be possible to identify uniquely a line in the source dataset. This could be done for example by introducing a Line Number in the source dataset or, as in this case, in case there are multiple lines with the same information then we could aggregate the quantities for the respective lines, the Price being moved in the GROUP BY clause as in theory the products with the same Item Number bought at the same time have the same price (that doesn’t necessarily happen in reality though for our exemplification will do).

After inserting the data in the normalized model it makes sense to check the logic by comparing the data inserted against the initial dataset. It’s always a good idea to do that, in this way could be trapped for example the errors in logic. In Excel for SQL Developers – Part IV: Differences Between Two Datasets I shown how the query for comparing two datasets could be created in a semiautomatic manner and shown also the resulting query. A similar query could be written also in this case, for this purpose being useful to create a view which denormalizes our structure:

-- Receipts View 
CREATE VIEW dbo.vReceipts 
AS 
SELECT RL.ReceiptLineID  
, RL.ReceiptHeaderID 
, RH.ReceiptNumber  
, RH.Vendor  
, RH.ReceiptDate  
, RH.CurrencyCode  
, RL.ItemNumber  
, RL.Quantity  
, RL.Price  
FROM dbo.ReceiptLines RL 
    JOIN dbo.ReceiptHeaders RH 
      ON RL.ReceiptHeaderID = RH.ReceiptHeaderID  

-- testing the view &amp; updates 
SELECT * 
FROM dbo.vReceipts

Until now we made it to load the data from a denormalized structure but no trace of a self-join! In many cases writing queries similar with the above ones is enough, though there are many cases when is needed to load the data successively, either incrementally or complete datasets. In both situations we could deal with data already loaded, so we have to avoid entering duplicates. Now it comes the self join into play, because in both insert queries we have to remove the records already loaded. Even if we deal with incremental data that form partitions (any record is provided only once) it’s safer and recommended to check for possible records provided again. So, we’ll have to modify the above two inserts to ignore the records already loaded:

-- inserting the Receipt Header data (with checking for loaded data) 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 
     LEFT JOIN dbo.ReceiptHeaders RH 
       ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
WHERE RH.ReceiptNumber IS NULL 

-- inserting the Receipt Lines data (with checking for loaded data) 
INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
     LEFT JOIN dbo.vReceipts VR 
       ON R.ReceiptNumber = VR.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
    AND R.ItemNumber = VR.ItemNumber 
WHERE VR.ReceiptNumber IS NULL 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price

I hope the queries are easy to be understood, if not then might be a good idea to check the posts on left joins. In order to test the queries let’s insert new data in the Receipts table:

-- inserting new test data 
INSERT INTO dbo.Receipts 
VALUES ('012455', 'TOOM', '2010-07-25', 'EUR', 'KABANOS PARIKA', 1, 2.20) 
, ('012455', 'TOOM', '2010-07-25', 'EUR', 'ZITRONE', 1, 0.79)

After running the three inserts the same data should be available in input denormalized and normalized structures:

-- testing the updates 
SELECT * 
FROM dbo.vReceipts

Notes:
In this post wasn’t considered the case in which the data are modified between loads, thus excepting the above inserts must be written also two update statements that should reflect the changes occurred in loaded data. I will try to approach this subject in another post.

05 February 2010

🕋Data Warehousing: Star Schema (Definitions)

"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to the fact table by a key column." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A star schema is a dimensional data model implemented on a relational database." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A star schema is a set of tables comprised of a single, central fact table surrounded by dimension tables. Each dimension is represented by a single dimension table. Star schemas implement dimensional data structures with denormalized dimensions. Snowflake schemas are an alternative to a star schema design." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A single fact table surrounded by a single hierarchical layer of dimensional tables, in a data warehouse database." (Gavin Powell, "Beginning Database Design", 2006)

"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The instantiation of a dimensional model in a relational database. A star schema consists of a fact table and the dimension tables that it references. The fact table contains facts and foreign keys; the dimension tables contain dimensional attributes by which the facts will be filtered, rolled up, or grouped." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"The implementation of a dimensional model in a relational database. The tables are organized around a single central fact table possessing a multi-part key, and each surrounding dimension table has its own primary key." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"The arrangement of the collection of fact and dimension tables in the dimensional data model, resembling a star formation, with the fact table placed in the middle surrounded by the dimension tables. Each dimension table is in a one-to-many relationship with the fact table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design" 4th Ed., 2010)

"A common form of a dimensional data model, where a fact table is directly linked by foreign keys to several dimension tables." (Craig S Mullins, "Database Administration", 2012)

"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to and usually joined to the fact table by a key column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A relational schema whose design represents a dimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"A type of relational database schema that is composed of a set of tables comprising a single, central fact table surrounded by dimension tables. See also dimension table, star join." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

02 July 2009

🛢DBMS: Denormalization (Definitions)

"The technique of placing data often accessed/used together in a physical location that optimizes the performance of the system." (Margaret Y Chu, "Blissful Data ", 2004)

"An intentional violation of the rules of normalization done to increase performance of a database. It typically occurs in varying degrees during all phases of physically implementing a database. Database designs are often denormalized to accomplish a specific performance-related goal. Denormalization can’t be done without a thorough understanding of the data and the needs of the customer." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The process of adding planned redundancy to an already fully normalized data model." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"The technique of placing normalized data in a physical location that optimizes the performance of the system." (William H Inmon, "Building the Data Warehouse", 2005)

"Most often the opposite of normalization, more commonly used in data warehouse or reporting environments. Denormalization decreases granularity by reversing normalization, and otherwise." (Gavin Powell, "Beginning Database Design", 2006)

"Organization of data by minimizing joins between tables and storing redundant values in a single table to reduce query time." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The process of adding planned redundancy to an already fully normalized data model." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006)

"Denormalization is the process of combining tables so that they are easier to query. Denormalization is opposite to normalization. Denormalization is done to improve query performance." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"The formal process of introducing redundancy back into the database design to improve performance." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Denormalization is the process of extracting data from normalized tables in the relational model of a data warehouse." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The consolidation of database tables to increase performance in data retrieval (query), despite the potential loss of data integrity. Decisions on when to denormalize tables are based on cost/benefit analysis by the DBA." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"A process by which a table is changed from a higher level normal form to a lower level normal form. Usually done to increase processing speed. Potentially yields data anomalies." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"Undoing the effect of normalization; the process of putting one fact in numerous places in the database." (Craig S Mullins, "Database Administration", 2012)

"The intentional duplication of columns in multiple tables to increase data redundancy. Denormalization is sometimes used to improve performance." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

SQL Troubles

Pages