Showing posts with label self-join. Show all posts
Showing posts with label self-join. Show all posts

31 July 2010

SQL Reloaded: Self-Joins and Denormalized Data Loading in Normalized Models

    One of the scenarios in which I often make use of self-joins is when needing to load denormalized data into a normalized data model. A characteristic of the not-normalized or denormalized data is that there are repeating data, typically the so called header data, which need to be handled specifically.

Note:
   Header data it’s improper said because an entity could contain more than 2 levels of data, for example the Purchase Orders (POs) in an ERP system could be split in PO Headers, Lines, Shipments and Distributions, thus a (denormalized) extract based on the respective data at Distribution level will contain repeating data from the higher levels.

  So for this post I needed to find an entity that contains a parent-child or header-lines structure. Actually it’s not difficult to find such an example, the world of ERP systems is full of such examples – POs, Invoices, Customer Orders, Receipts, Payments, to mention some of the important transactional data, or Customer, Vendors and even the Items, when considering the master data. The difficulty is to simplify the example to a level that could be easier understood also by people who had less tangency with ERP systems or database models. For this I will consider the Receipts received when paying the goods in a (super-)market, if we take such a bill we could typically see the Receipt Number, the name of the Vendor from which we purchased was made, the Receipt Date, the Date when the purchase was made, the Items purchased together with the Quantity and Price. Such information could be easily entered in Excel and later loaded in  a denormalized table, or enter them directly in the respective denormalized table:

-- Receipts denormalized table 
CREATE TABLE [dbo].[Receipts]( 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime] NULL, 
[CurrencyCode] [nvarchar](3) NULL, 
[ItemNumber] [nvarchar] (50) NULL,  
[Quantity] [decimal](12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) 
ON [PRIMARY] 

-- inserting test data 
INSERT INTO dbo.Receipts 
VALUES ('012034', 'TOOM', '2010-06-23', 'EUR', 'KABANOS PARIKA', 1, 2.19) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'ZITRONE', 1, 0.79) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'BREAKFAST BAKON', 1, 1.59) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'PILLOW', 1, 23.99) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'BED SHEET', 2, 11.99) 
 
-- checking the data 
SELECT * 
FROM dbo.Receipts  

   Supposing we have the above data and that we would like to load them in a normalized structure formed from the Header information – Receipt Number, Vendor Name and Receipt Date, and Line information – Item Number, Quantity and Price:

-- Receipt Headers (normalized) 
CREATE TABLE [dbo].[ReceiptHeaders]( 
[ReceiptHeaderID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime]NULL, 
[CurrencyCode] [nvarchar](3) NULL 
) ON [PRIMARY]  

-- Receipt Lines (normalized) 
CREATE TABLE [dbo].[ReceiptLines]( 
[ReceiptLineID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptHeaderID] int NOT NULL, 
[ItemNumber] [nvarchar] (50) NULL, 
[Quantity] [decimal] (12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) ON [PRIMARY]

   In order to load the denormalized data in a normalized structure we could write two queries, the first populates the ReceiptHeaders table, for this needing to select the distinct header attributes that make the header, while the second populates the ReceiptLines table:

-- inserting the Receipt Header data 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 

-- inserting the Receipt Lines data 
INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price  

    As can be seen from the second query, the Receipts table was joined with the ReceiptHeaders in order to retrieve the corresponding Header information for each line record. For this action to be possible we need an attribute or combination of attributes unique across the header data, in this case the ReceiptNumber in combination with the Vendor Name. If no such unique combination exists then the match between header and line data is not possible without resulting duplicated data, in such scenario it’s recommended to clean the data before loading them, for example by introducing an attribute that makes the combination unique. The same problem of uniqueness could be applied to the lines too, needing to be possible to identify uniquely a line in the source dataset. This could be done for example by introducing a Line Number in the source dataset or, as in this case, in case there are multiple lines with the same information then we could aggregate the quantities for the respective lines, the Price being moved in the GROUP BY clause as in theory the products with the same Item Number bought at the same time have the same price (that doesn’t necessarily happen in reality though for our exemplification will do).

    After inserting the data in the normalized model it makes sense to check the logic by comparing the data inserted against the initial dataset. It’s always a good idea to do that, in this way could be trapped for example the errors in logic. In Excel for SQL Developers – Part IV: Differences Between Two Datasets I shown how the query for comparing two datasets could be created in a semiautomatic manner and shown also the resulting query. A similar query could be written also in this case, for this purpose being useful to create a view which denormalizes our structure:  

-- Receipts View 
CREATE VIEW dbo.vReceipts 
AS 
SELECT RL.ReceiptLineID  
, RL.ReceiptHeaderID 
, RH.ReceiptNumber  
, RH.Vendor  
, RH.ReceiptDate  
, RH.CurrencyCode  
, RL.ItemNumber  
, RL.Quantity  
, RL.Price  
FROM dbo.ReceiptLines RL 
    JOIN dbo.ReceiptHeaders RH 
      ON RL.ReceiptHeaderID = RH.ReceiptHeaderID  

-- testing the view & updates 
SELECT * 
FROM dbo.vReceipts 
 
self-join normalization
    Until now we made it to load the data from a denormalized structure but no trace of a self-join! In many cases writing queries similar with the above ones is enough, though there are many cases when is needed to load the data successively, either incrementally or complete datasets. In both situations we could deal with data already loaded, so we have to avoid entering duplicates. Now it comes the self join into play, because in both insert queries we have to remove the records already loaded. Even if we deal with incremental data that form partitions (any record is provided only once) it’s safer and recommended to check for possible records provided again. So, we’ll have to modify the above two inserts to ignore the records already loaded:

-- inserting the Receipt Header data (with checking for loaded data) 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 
     LEFT JOIN dbo.ReceiptHeaders RH 
       ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
WHERE RH.ReceiptNumber IS NULL 

-- inserting the Receipt Lines data (with checking for loaded data) 
INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
     LEFT JOIN dbo.vReceipts VR 
       ON R.ReceiptNumber = VR.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
    AND R.ItemNumber = VR.ItemNumber 
WHERE VR.ReceiptNumber IS NULL 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price 

    I hope the queries are easy to be understood, if not then might be a good idea to check the posts on left joins. In order to test the queries let’s insert new data in the Receipts table:

-- inserting new test data 
INSERT INTO dbo.Receipts 
VALUES ('012455', 'TOOM', '2010-07-25', 'EUR', 'KABANOS PARIKA', 1, 2.20) 
, ('012455', 'TOOM', '2010-07-25', 'EUR', 'ZITRONE', 1, 0.79) 

    After running the three inserts the same data should be available in input denormalized and normalized structures:

-- testing the updates 
SELECT * 
FROM dbo.vReceipts 
 
self-join normalization incremental 
Notes:
    In this post wasn’t considered the case in which the data are modified between loads, thus excepting the above inserts must be written also two update statements that should reflect the changes occurred in loaded data. I will try to approach this subject in another post.

30 July 2010

SQL Reloaded: Self-Join in Update Query II

    In yesterday’s post on the same topic I tried to exemplify how a self-join update query could be written in order to avoid an error raised by the database engine. Even if this type of query is normally written for hierarchical structures consisting of one-level parent-child relations stored in the same table, my example was using a simple join based directly on the primary key. During the day I was thinking that might be straightforward to create a simple hierarchical structure based on a relatively realistic example. For this let’s consider a department in which the performance of the the manager is a function of the performance of its employees, while the number of hours of training allocated to each employee is proportional to the number of hours allocated to its manager. For this let’s consider the following table:  

--Employee table's script 
CREATE TABLE [dbo].[Employees]( 
[EmployeeID] [int]  NOT NULL 
, [ManagerID] [int]  NULL 
, [Name] [nvarchar] (50) NULL 
, [Performance] [numeric]  (5, 3) NULL 
, [TrainingHours] [numeric]  (5, 2) NULL ) 
ON [PRIMARY] 

-- inserting the test data 
INSERT INTO dbo.Employees 
VALUES (1, NULL, 'Joe', NULL, 10) 
, (2, 1, 'Jack', .65, NULL) 
, (3, 1, 'Mary', .45, NULL) 
, (4, 1, 'Ross', .54, NULL) 
, (5, 1, 'Scott', .48, NULL) 
, (6, NULL, 'Jane', NULL, 15) 
, (7, 6, 'Sam', .50, NULL) 
, (8, 6, 'Ron', .45, NULL) 

   In order to calculate Manager’s performance we could use a query similar with the following, of course the formula could be more complicated and not a simple average:  

-- updating Manager's Performance 
UPDATE dbo.Employees 
SET Performance = EMP.Performance 
FROM ( -- inline view 
     SELECT ManagerID 
     , AVG(Performance) Performance 
     FROM dbo.Employees 
     WHERE ManagerID IS NOT NULL 
     GROUP BY ManagerID 
) EMP 
WHERE dbo.Employees.ManagerID IS NULL 
AND dbo.Employees.EmployeeID = EMP.ManagerID 

    Let’s check the updates and their correctitude: 


-- Checking updated data 
SELECT * 
FROM dbo.Employees 
WHERE ManagerID IS NULL -- Verifying output SELECT ManagerID 
, AVG(Performance) Performance 
FROM dbo.Employees 
WHERE ManagerID IS NOT NULL 
GROUP BY ManagerID       

self-join output 1      

Note:     
    The average needs to be calculated in the inline view. In case are retrieved more records for each record from the updated table, the query will still work though the result is “unpredictable”:
 


-- updating Manager's Performance w/o aggregates 
UPDATE dbo.Employees 
SET Performance = EMP.Performance 
FROM ( -- inline view 
     SELECT ManagerID 
     , Performance 
     FROM dbo.Employees 
     WHERE ManagerID IS NOT NULL 
) EMP WHERE dbo.Employees.ManagerID IS NULL 
AND dbo.Employees.EmployeeID = EMP.ManagerID       

      
    The reverse update based on the number of hours of training could be written as follows: 
 

-- updating Employees' Training Hours 
UPDATE dbo.Employees 
SET TrainingHours = 0.75 * EMP.TrainingHours 
FROM ( -- inline view 
    SELECT EmployeeID 
    , TrainingHours 
    FROM dbo.Employees 
   WHERE ManagerID IS NULL 
) EMP 
WHERE dbo.Employees.ManagerID IS NOT NULL 
AND dbo.Employees.ManagerID = EMP.EmployeeID       
      
-- Checking updated data 
SELECT * 
FROM dbo.Employees 
WHERE ManagerID IS NOT NULL -- Verifying output SELECT EmployeeID 
, 0.75 * TrainingHours TrainingHours 
FROM dbo.Employees 
WHERE ManagerID IS NULL   

  
self-join output 2     



28 July 2010

SQL Reloaded: Self-Join in Update Query I

    While reading R. Scheldon’s article on “UPDATE Basics in SQL Server” I remembered about a problem I had long time ago while attempting to do a self-join in an Update query. Such a query is quite useful in hierarchical structures consisting of one-level parent-child relations stored in the same table, when needed to update the parent based on child information, or vice-versa. The problem I had could be also exemplified by using a simple table (no hierarchical structure), here’s the query:  

-- self-join update - problematic query 
UPDATE Production.Product 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.Product ITM 
WHERE Production.Product.ProductID = ITM.ProductID 

 The above statement returns the following error:
Msg 4104, Level 16, State 1, Line 5
The multi-part identifier "Production.Product.ProductID" could not be bound.

    In order to avoid this error message, the table from the FROM clause could be included in an inline view, something like:

-- self-join update - solution 
UPDATE Production.Product 
SET StandardCost = ITM.StandardCost*(1+.012) 
FROM ( -- inline view 
    SELECT * 
    FROM Production.Product ITM 
    ) ITM  
WHERE Production.Product.ProductID = ITM.ProductID 

    The inline view could be replaced with a standard view, table-valued UDF or the update could be done through a view. For exemplification I will use only the first and third case, here’s the view based on Production.Product table and the queries corresponding to the two cases:

-- view based on Production.Product 
CREATE VIEW Production.vProductTest 
AS 
SELECT *  
FROM Production.Product   

-- self-join update through view update 
UPDATE Production.vProductTest 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.Product ITM 
WHERE Production.vProductTest.ProductID = ITM.ProductID     

-- self-join update from view 
UPDATE Production.Product 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.vProductTest ITM 
WHERE Production.Product.ProductID = ITM.ProductID 

Note:
1.    The first update query was written in this way only to exemplify the self-update within an hierarchical structure, if it’s needed to modify the StandardCost and nothing more, then the update could be written simply in any of the following forms:

-- Query 1: update statement 
UPDATE Production.Product 
SET StandardCost = StandardCost* (1+.012) 

--Query 2: update statement 
UPDATE ITM 
SET StandardCost = StandardCost * (1+.012) 
FROM Production.Product ITM 

2.    Because the join constraint is based on a hierarchical structure that considers in the join different columns, the query can’t be written as follows:

 
--simple update statement 

UPDATE Production.Product 
SET StandardCost = ITM.StandardCost* (1+.012) 
FROM Production.Product ITM 
 
   This query is nothing more than an alternative to the queries shown in the first note.

15 September 2007

SQL Reloaded: Updating Data With Values From The Same Table

I have to deal with cases in which I have to update data from a table with values from the same table across records, a simple example I can sketch is the one of Standard Prices, which are normally valid during a year. For example if I would have to reset the Standard Prices from a year to the ones from the previous year, how this can be done? Actually that's pretty sample, for exemplification purposes I will create a table having only 3 columns, ProductID, Year and StandardPrice: 

-- creating the table 
CREATE TABLE dbo.StandardPrices( 
ProductID int 
, [Year] int 
, StandardPrice decimal(13,2) )      

--creating test data 
DECLARE @index int 
SET @index = 0 
  WHILE @index < 50 
  BEGIN 
      INSERT dbo.StandardPrices(ProductID, [Year], StandardPrice) 
      VALUES (@index % 25+1, 2000 + (@index/25), 100 * RAND()) 
      SET @index = @index + 1 
END     

Let's see the content, the RAND() function allows me to assign random prices to each product. As can be seen there are 25 products with Standard Prices for the Years 2000 and 2001.

 --testing StandardPrices table's content 
SELECT * 
FROM dbo.StandardPrices 
ORDER BY ProductID, [Year]     

Let's try to use a normal update: 

--first update attempt 
UPDATE dbo.StandardPrices 
SET StandardPrice = A.StandardPrice 
FROM StandardPrices A 
WHERE StandardPrices.ProductID = A.ProductID 
AND StandardPrices.[Year] = 2001 AND A.[Year] = 2000     

Ups, I got an error (SQL Server 2005):
Msg 4104, Level 16, State 1, Line 1 The multi-part identifier "StandardPrices.ProductID" could not be bound. Msg 4104, Level 16, State 1, Line 1 The multi-part identifier "StandardPrices.Year" could not be bound. 
SQL Server 2000:
Server: Msg 107, Level 16, State 3, Line 1 The column prefix 'StandardPrices' does not match with a table name oralias name used in the query. Server: Msg 107, Level 16, State 1, Line 1 The column prefix 'StandardPrices' does not match with a table name oralias name used in the query.     

It seems we can't use the same table in updates of this type, but we can use a subquery or view which refers to the same table, so we can write: 

  --the update 
UPDATE dbo.StandardPrices 
SET StandardPrice = A.StandardPrice 
FROM ( 
     SELECT ProductID , [Year] , StandardPrice 
     FROM dbo.StandardPrices 
     WHERE [Year] = 2000 
) A 
WHERE StandardPrices.ProductID = A.ProductID 
     AND StandardPrices.[Year] = 2001     

Now there are no errors, 25 records updated, let's see the output: 

--testing table's content after update 
SELECT * 
FROM dbo.StandardPrices 
ORDER BY ProductID
, [Year]
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.