About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Saturday, July 31, 2010

Self-Joins and Denormalized Data Loading in Normalized Models

    One of the scenarios in which I often make use of self-joins is when needing to load denormalized data into a normalized data model. A characteristic of the not-normalized or denormalized data is that there are repeating data, typically the so called header data, which need to be handled specifically.

Note:
   Header data it’s improper said because an entity could contain more than 2 levels of data, for example the Purchase Orders (POs) in an ERP system could be split in PO Headers, Lines, Shipments and Distributions, thus a (denormalized) extract based on the respective data at Distribution level will contain repeating data from the higher levels.

  So for this post I needed to find an entity that contains a parent-child or header-lines structure. Actually it’s not difficult to find such an example, the world of ERP systems is full of such examples – POs, Invoices, Customer Orders, Receipts, Payments, to mention some of the important transactional data, or Customer, Vendors and even the Items, when considering the master data. The difficulty is to simplify the example to a level that could be easier understood also by people who had less tangency with ERP systems or database models. For this I will consider the Receipts received when paying the goods in a (super-)market, if we take such a bill we could typically see the Receipt Number, the name of the Vendor from which we purchased was made, the Receipt Date, the Date when the purchase was made, the Items purchased together with the Quantity and Price. Such information could be easily entered in Excel and later loaded in  a denormalized table, or enter them directly in the respective denormalized table:
-- Receipts denormalized table
CREATE TABLE [dbo].[Receipts](
[ReceiptNumber] [nvarchar](50) NULL,
[Vendor] [nvarchar](100) NULL,
[ReceiptDate] [smalldatetime] NULL,

[CurrencyCode] [nvarchar](3) NULL,

[ItemNumber] [nvarchar] (50) NULL,

[Quantity] [decimal](12, 3) NULL,
[Price] [numeric](12, 2) NULL
)
ON [PRIMARY]


-- inserting test data
INSERT INTO dbo.Receipts
VALUES
('012034', 'TOOM', '2010-06-23', 'EUR', 'KABANOS PARIKA', 1, 2.19)
,
('012034', 'TOOM', '2010-06-23', 'EUR', 'ZITRONE', 1, 0.79)
,
('012034', 'TOOM', '2010-06-23', 'EUR', 'BREAKFAST BAKON', 1, 1.59)
,
('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'PILLOW', 1, 23.99)
,
('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'BED SHEET', 2, 11.99)


-- checking the data
SELECT *
FROM
dbo.Receipts


   Supposing we have the above data and that we would like to load them in a normalized structure formed from the Header information – Receipt Number, Vendor Name and Receipt Date, and Line information – Item Number, Quantity and Price:
-- Receipt Headers (normalized)
CREATE TABLE [dbo].[ReceiptHeaders](
[ReceiptHeaderID] [int]IDENTITY(1,1) NOT NULL,

[ReceiptNumber] [nvarchar](50) NULL,

[Vendor] [nvarchar](100) NULL,

[ReceiptDate] [smalldatetime]NULL,

[CurrencyCode] [nvarchar](3) NULL
)
ON [PRIMARY]


-- Receipt Lines (normalized)
CREATE TABLE [dbo].[ReceiptLines](
[ReceiptLineID] [int]IDENTITY(1,1) NOT NULL,

[ReceiptHeaderID] int NOT NULL,

[ItemNumber] [nvarchar] (50) NULL,

[Quantity] [decimal] (12, 3) NULL,

[Price] [numeric](12, 2) NULL
)
ON [PRIMARY]


   In order to load the denormalized data in a normalized structure we could write two queries, the first populates the ReceiptHeaders table, for this needing to select the distinct header attributes that make the header, while the second populates the ReceiptLines table:
-- inserting the Receipt Header data
INSERT INTO dbo.ReceiptHeaders

SELECT
DISTINCT R.ReceiptNumber

,
R.Vendor

,
R.ReceiptDate

,
R.CurrencyCode

FROM
dbo.Receipts R


-- inserting the Receipt Lines data
INSERT INTO dbo.ReceiptLines
SELECT
RH.ReceiptHeaderID

,
R.ItemNumber

,
SUM(R.Quantity) Quantity

,
R.Price

FROM
dbo.Receipts R
     JOIN dbo.ReceiptHeaders RH
      ON R.ReceiptNumber = RH.ReceiptNumber
     AND R.Vendor = RH.Vendor

GROUP
BY RH.ReceiptHeaderID

,
R.ItemNumber

,
R.Price


    As can be seen from the second query, the Receipts table was joined with the ReceiptHeaders in order to retrieve the corresponding Header information for each line record. For this action to be possible we need an attribute or combination of attributes unique across the header data, in this case the ReceiptNumber in combination with the Vendor Name. If no such unique combination exists then the match between header and line data is not possible without resulting duplicated data, in such scenario it’s recommended to clean the data before loading them, for example by introducing an attribute that makes the combination unique. The same problem of uniqueness could be applied to the lines too, needing to be possible to identify uniquely a line in the source dataset. This could be done for example by introducing a Line Number in the source dataset or, as in this case, in case there are multiple lines with the same information then we could aggregate the quantities for the respective lines, the Price being moved in the GROUP BY clause as in theory the products with the same Item Number bought at the same time have the same price (that doesn’t necessarily happen in reality though for our exemplification will do).

    After inserting the data in the normalized model it makes sense to check the logic by comparing the data inserted against the initial dataset. It’s always a good idea to do that, in this way could be trapped for example the errors in logic. In Excel for SQL Developers – Part IV: Differences Between Two Datasets I shown how the query for comparing two datasets could be created in a semiautomatic manner and shown also the resulting query. A similar query could be written also in this case, for this purpose being useful to create a view which denormalizes our structure:
-- Receipts View
CREATE VIEW dbo.vReceipts
AS
SELECT
RL.ReceiptLineID

,
RL.ReceiptHeaderID

,
RH.ReceiptNumber

,
RH.Vendor

,
RH.ReceiptDate

,
RH.CurrencyCode

,
RL.ItemNumber

,
RL.Quantity

,
RL.Price

FROM
dbo.ReceiptLines RL
    JOIN dbo.ReceiptHeaders RH
      ON RL.ReceiptHeaderID = RH.ReceiptHeaderID


-- testing the view & updates
SELECT *
FROM
dbo.vReceipts

self-join normalization
    Until now we made it to load the data from a denormalized structure but no trace of a self-join! In many cases writing queries similar with the above ones is enough, though there are many cases when is needed to load the data successively, either incrementally or complete datasets. In both situations we could deal with data already loaded, so we have to avoid entering duplicates. Now it comes the self join into play, because in both insert queries we have to remove the records already loaded. Even if we deal with incremental data that form partitions (any record is provided only once) it’s safer and recommended to check for possible records provided again. So, we’ll have to modify the above two inserts to ignore the records already loaded:
-- inserting the Receipt Header data (with checking for loaded data)
INSERT INTO dbo.ReceiptHeaders
SELECT
DISTINCT R.ReceiptNumber

,
R.Vendor

,
R.ReceiptDate

,
R.CurrencyCode

FROM
dbo.Receipts R
     LEFT JOIN dbo.ReceiptHeaders RH
       ON R.ReceiptNumber = RH.ReceiptNumber
     AND R.Vendor = RH.Vendor

WHERE
RH.ReceiptNumber IS NULL


-- inserting the Receipt Lines data (with checking for loaded data)
INSERT INTO dbo.ReceiptLines
SELECT
RH.ReceiptHeaderID

,
R.ItemNumber

,
SUM(R.Quantity) Quantity

,
R.Price

FROM
dbo.Receipts R
     JOIN dbo.ReceiptHeaders RH
      ON R.ReceiptNumber = RH.ReceiptNumber
     AND R.Vendor = RH.Vendor
     LEFT JOIN dbo.vReceipts VR
       ON R.ReceiptNumber = VR.ReceiptNumber
     AND R.Vendor = RH.Vendor
    AND R.ItemNumber = VR.ItemNumber

WHERE
VR.ReceiptNumber IS NULL

GROUP
BY RH.ReceiptHeaderID

,
R.ItemNumber

,
R.Price


    I hope the queries are easy to be understood, if not then might be a good idea to check the posts on left joins. In order to test the queries let’s insert new data in the Receipts table:
-- inserting new test data

INSERT INTO dbo.Receipts
VALUES
('012455', 'TOOM', '2010-07-25', 'EUR', 'KABANOS PARIKA', 1, 2.20)
,
('012455', 'TOOM', '2010-07-25', 'EUR', 'ZITRONE', 1, 0.79)


    After running the three inserts the same data should be available in input denormalized and normalized structures:
-- testing the updates
SELECT *
FROM
dbo.vReceipts

self-join normalization incremental 
Notes:
    In this post wasn’t considered the case in which the data are modified between loads, thus excepting the above inserts must be written also two update statements that should reflect the changes occurred in loaded data. I will try to approach this subject in another post.

No comments: