SQL Troubles: Normalization

Showing posts with label Normalization. Show all posts

01 April 2021

💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.

Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process.

As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values.

It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:

-- looking at the JSON file 
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile

In a second step one can add the OPENJSON function by looking only at the first record:

-- querying a json file (one record)
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$[0]')

In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly.

-- querying a json file (all records, a few fields)
SELECT Country 
, CountryCode 
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
) AS DAT;

In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly:

-- importing a json file (all records) on the fly
SELECT DAT.Country
, DAT.CountryCode
, DAT.Y1960
, DAT.Y1961
, DAT.Y1962
, DAT.Y1963
, DAT.Y1964
, DAT.Y1965
, DAT.Y1966
, DAT.Y1967
, DAT.Y1968
, DAT.Y1969
, DAT.Y1970
, DAT.Y1971
, DAT.Y1972
, DAT.Y1973
, DAT.Y1974
, DAT.Y1975
, DAT.Y1976
, DAT.Y1977
, DAT.Y1978
, DAT.Y1979
, DAT.Y1980
, DAT.Y1981
, DAT.Y1982
, DAT.Y1983
, DAT.Y1984
, DAT.Y1985
, DAT.Y1986
, DAT.Y1987
, DAT.Y1988
, DAT.Y1989
, DAT.Y1990
, DAT.Y1991
, DAT.Y1992
, DAT.Y1993
, DAT.Y1994
, DAT.Y1995
, DAT.Y1996
, DAT.Y1997
, DAT.Y1998
, DAT.Y1999
, DAT.Y2000
, DAT.Y2001
, Cast(DAT.Y2002 as bigint) Y2002
, Cast(DAT.Y2003 as bigint) Y2003
, Cast(DAT.Y2004 as bigint) Y2004
, Cast(DAT.Y2005 as bigint) Y2005
, Cast(DAT.Y2006 as bigint) Y2006
, Cast(DAT.Y2007 as bigint) Y2007
, Cast(DAT.Y2008 as bigint) Y2008
, Cast(DAT.Y2009 as bigint) Y2009
, Cast(DAT.Y2010 as bigint) Y2010
, Cast(DAT.Y2011 as bigint) Y2011
, Cast(DAT.Y2012 as bigint) Y2012
, Cast(DAT.Y2013 as bigint) Y2013
, Cast(DAT.Y2014 as bigint) Y2014
, Cast(DAT.Y2015 as bigint) Y2015
, DAT.Y2016
INTO dbo.CountryPopulation
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
, Y1960 bigint '$.Year_1960'
, Y1961 bigint '$.Year_1961'
, Y1962 bigint '$.Year_1962'
, Y1963 bigint '$.Year_1963'
, Y1964 bigint '$.Year_1964'
, Y1965 bigint '$.Year_1965'
, Y1966 bigint '$.Year_1966'
, Y1967 bigint '$.Year_1967'
, Y1968 bigint '$.Year_1968'
, Y1969 bigint '$.Year_1969'
, Y1970 bigint '$.Year_1970'
, Y1971 bigint '$.Year_1971'
, Y1972 bigint '$.Year_1972'
, Y1973 bigint '$.Year_1973'
, Y1974 bigint '$.Year_1974'
, Y1975 bigint '$.Year_1975'
, Y1976 bigint '$.Year_1976'
, Y1977 bigint '$.Year_1977'
, Y1978 bigint '$.Year_1978'
, Y1979 bigint '$.Year_1979'
, Y1980 bigint '$.Year_1980'
, Y1981 bigint '$.Year_1981'
, Y1982 bigint '$.Year_1982'
, Y1983 bigint '$.Year_1983'
, Y1984 bigint '$.Year_1984'
, Y1985 bigint '$.Year_1985'
, Y1986 bigint '$.Year_1986'
, Y1987 bigint '$.Year_1987'
, Y1988 bigint '$.Year_1988'
, Y1989 bigint '$.Year_1989'
, Y1990 bigint '$.Year_1990'
, Y1991 bigint '$.Year_1991'
, Y1992 bigint '$.Year_1992'
, Y1993 bigint '$.Year_1993'
, Y1994 bigint '$.Year_1994'
, Y1995 bigint '$.Year_1995'
, Y1996 bigint '$.Year_1996'
, Y1997 bigint '$.Year_1997'
, Y1998 bigint '$.Year_1998'
, Y1999 bigint '$.Year_1999'
, Y2000 bigint '$.Year_2000'
, Y2001 bigint '$.Year_2001'
, Y2002 decimal(19,1) '$.Year_2002'
, Y2003 decimal(19,1) '$.Year_2003'
, Y2004 decimal(19,1) '$.Year_2004'
, Y2005 decimal(19,1) '$.Year_2005'
, Y2006 decimal(19,1) '$.Year_2006'
, Y2007 decimal(19,1) '$.Year_2007'
, Y2008 decimal(19,1) '$.Year_2008'
, Y2009 decimal(19,1) '$.Year_2009'
, Y2010 decimal(19,1) '$.Year_2010'
, Y2011 decimal(19,1) '$.Year_2011'
, Y2012 decimal(19,1) '$.Year_2012'
, Y2013 decimal(19,1) '$.Year_2013'
, Y2014 decimal(19,1) '$.Year_2014'
, Y2015 decimal(19,1) '$.Year_2015'
, Y2016 bigint '$.Year_2016'
) AS DAT;

As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary.

Also, the columns’ names were changed either for simplification/convenience or simply taste.

Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:

Source field	Target field	DataType	Value	Import Clause	Select Clause
Country	Country	nvarchar(max)	emen Rep.	, Country nvarchar(max) '$.Country'	, DAT.Country
Country_Code	CountryCode	nvarchar(3)	YEM	, CountryCode nvarchar(3) '$.Country_Code'	, DAT.CountryCode
Year_1960	Y1960	bigint	5172135	, Y1960 bigint '$.Year_1960'	, DAT.Y1960
Year_1961	Y1961	bigint	5260501	, Y1961 bigint '$.Year_1961'	, DAT.Y1961
Year_1962	Y1962	bigint	5351799	, Y1962 bigint '$.Year_1962'	, DAT.Y1962
Year_1963	Y1963	bigint	5446063	, Y1963 bigint '$.Year_1963'	, DAT.Y1963
Year_1964	Y1964	bigint	5543339	, Y1964 bigint '$.Year_1964'	, DAT.Y1964
Year_1965	Y1965	bigint	5643643	, Y1965 bigint '$.Year_1965'	, DAT.Y1965
Year_1966	Y1966	bigint	5748588	, Y1966 bigint '$.Year_1966'	, DAT.Y1966
Year_1967	Y1967	bigint	5858638	, Y1967 bigint '$.Year_1967'	, DAT.Y1967
Year_1968	Y1968	bigint	5971407	, Y1968 bigint '$.Year_1968'	, DAT.Y1968
Year_1969	Y1969	bigint	6083619	, Y1969 bigint '$.Year_1969'	, DAT.Y1969
Year_1970	Y1970	bigint	6193810	, Y1970 bigint '$.Year_1970'	, DAT.Y1970
Year_1971	Y1971	bigint	6300554	, Y1971 bigint '$.Year_1971'	, DAT.Y1971
Year_1972	Y1972	bigint	6407295	, Y1972 bigint '$.Year_1972'	, DAT.Y1972
Year_1973	Y1973	bigint	6523452	, Y1973 bigint '$.Year_1973'	, DAT.Y1973
Year_1974	Y1974	bigint	6661566	, Y1974 bigint '$.Year_1974'	, DAT.Y1974
Year_1975	Y1975	bigint	6830692	, Y1975 bigint '$.Year_1975'	, DAT.Y1975
Year_1976	Y1976	bigint	7034868	, Y1976 bigint '$.Year_1976'	, DAT.Y1976
Year_1977	Y1977	bigint	7271872	, Y1977 bigint '$.Year_1977'	, DAT.Y1977
Year_1978	Y1978	bigint	7536764	, Y1978 bigint '$.Year_1978'	, DAT.Y1978
Year_1979	Y1979	bigint	7821552	, Y1979 bigint '$.Year_1979'	, DAT.Y1979
Year_1980	Y1980	bigint	8120497	, Y1980 bigint '$.Year_1980'	, DAT.Y1980
Year_1981	Y1981	bigint	8434017	, Y1981 bigint '$.Year_1981'	, DAT.Y1981
Year_1982	Y1982	bigint	8764621	, Y1982 bigint '$.Year_1982'	, DAT.Y1982
Year_1983	Y1983	bigint	9111097	, Y1983 bigint '$.Year_1983'	, DAT.Y1983
Year_1984	Y1984	bigint	9472170	, Y1984 bigint '$.Year_1984'	, DAT.Y1984
Year_1985	Y1985	bigint	9847899	, Y1985 bigint '$.Year_1985'	, DAT.Y1985
Year_1986	Y1986	bigint	10232733	, Y1986 bigint '$.Year_1986'	, DAT.Y1986
Year_1987	Y1987	bigint	10628585	, Y1987 bigint '$.Year_1987'	, DAT.Y1987
Year_1988	Y1988	bigint	11051504	, Y1988 bigint '$.Year_1988'	, DAT.Y1988
Year_1989	Y1989	bigint	11523267	, Y1989 bigint '$.Year_1989'	, DAT.Y1989
Year_1990	Y1990	bigint	12057039	, Y1990 bigint '$.Year_1990'	, DAT.Y1990
Year_1991	Y1991	bigint	12661614	, Y1991 bigint '$.Year_1991'	, DAT.Y1991
Year_1992	Y1992	bigint	13325583	, Y1992 bigint '$.Year_1992'	, DAT.Y1992
Year_1993	Y1993	bigint	14017239	, Y1993 bigint '$.Year_1993'	, DAT.Y1993
Year_1994	Y1994	bigint	14692686	, Y1994 bigint '$.Year_1994'	, DAT.Y1994
Year_1995	Y1995	bigint	15320653	, Y1995 bigint '$.Year_1995'	, DAT.Y1995
Year_1996	Y1996	bigint	15889449	, Y1996 bigint '$.Year_1996'	, DAT.Y1996
Year_1997	Y1997	bigint	16408954	, Y1997 bigint '$.Year_1997'	, DAT.Y1997
Year_1998	Y1998	bigint	16896210	, Y1998 bigint '$.Year_1998'	, DAT.Y1998
Year_1999	Y1999	bigint	17378098	, Y1999 bigint '$.Year_1999'	, DAT.Y1999
Year_2000	Y2000	bigint	17874725	, Y2000 bigint '$.Year_2000'	, DAT.Y2000
Year_2001	Y2001	bigint	18390135	, Y2001 bigint '$.Year_2001'	, DAT.Y2001
Year_2002	Y2002	decimal(19,1)	18919179.0	, Y2002 decimal(19,1) '$.Year_2002'	, Cast(DAT.Y2002 as bigint) Y2002
Year_2003	Y2003	decimal(19,1)	19462086.0	, Y2003 decimal(19,1) '$.Year_2003'	, Cast(DAT.Y2003 as bigint) Y2003
Year_2004	Y2004	decimal(19,1)	20017068.0	, Y2004 decimal(19,1) '$.Year_2004'	, Cast(DAT.Y2004 as bigint) Y2004
Year_2005	Y2005	decimal(19,1)	20582927.0	, Y2005 decimal(19,1) '$.Year_2005'	, Cast(DAT.Y2005 as bigint) Y2005
Year_2006	Y2006	decimal(19,1)	21160534.0	, Y2006 decimal(19,1) '$.Year_2006'	, Cast(DAT.Y2006 as bigint) Y2006
Year_2007	Y2007	decimal(19,1)	21751605.0	, Y2007 decimal(19,1) '$.Year_2007'	, Cast(DAT.Y2007 as bigint) Y2007
Year_2008	Y2008	decimal(19,1)	22356391.0	, Y2008 decimal(19,1) '$.Year_2008'	, Cast(DAT.Y2008 as bigint) Y2008
Year_2009	Y2009	decimal(19,1)	22974929.0	, Y2009 decimal(19,1) '$.Year_2009'	, Cast(DAT.Y2009 as bigint) Y2009
Year_2010	Y2010	decimal(19,1)	23606779.0	, Y2010 decimal(19,1) '$.Year_2010'	, Cast(DAT.Y2010 as bigint) Y2010
Year_2011	Y2011	decimal(19,1)	24252206.0	, Y2011 decimal(19,1) '$.Year_2011'	, Cast(DAT.Y2011 as bigint) Y2011
Year_2012	Y2012	decimal(19,1)	24909969.0	, Y2012 decimal(19,1) '$.Year_2012'	, Cast(DAT.Y2012 as bigint) Y2012
Year_2013	Y2013	decimal(19,1)	25576322.0	, Y2013 decimal(19,1) '$.Year_2013'	, Cast(DAT.Y2013 as bigint) Y2013
Year_2014	Y2014	decimal(19,1)	26246327.0	, Y2014 decimal(19,1) '$.Year_2014'	, Cast(DAT.Y2014 as bigint) Y2014
Year_2015	Y2015	decimal(19,1)	26916207.0	, Y2015 decimal(19,1) '$.Year_2015'	, Cast(DAT.Y2015 as bigint) Y2015
Year_2016	Y2016	bigint	27584213	, Y2016 bigint '$.Year_2016'	, DAT.Y2016

Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas):

=", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" 
=", DAT." & TRIM(B2)

Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:

-- unpivoting the data 
SELECT RES.Country
, RES.CountryCode
, Cast(Replace(RES.[Year], 'Y', '') as int) [Year]
, RES.Population
--INTO dbo.CountryPopulationPerYear
FROM 
( -- basis data
	SELECT Country
	, CountryCode
	, Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
	, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
	, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
	, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
	, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
	, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016
	FROM dbo.CountryPopulation
) DAT
UNPIVOT  -- unpivot logic
   (Population FOR [Year] IN  (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)
) RES

Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table).

--reviewing the data 
SELECT Country
, CountryCode
, AVG(Population) AveragePopulation
, Max(Population) - Min(Population) RangePopulation
FROM dbo.CountryPopulationPerYear
WHERE [Year] BETWEEN 2010 AND 2019
GROUP BY Country
, CountryCode
ORDER BY Country

On the other side making comparisons between consecutive years is easier when using a matrix structure:

--reviewing the data 
SELECT Country
, CountryCode
, Y2016
, Y2010
, Y2010-Y2010 [2016-2010]
, Y2011-Y2010 [2011-2010]
, Y2012-Y2011 [2011-2011]
, Y2013-Y2012 [2011-2012]
, Y2014-Y2013 [2011-2013]
, Y2015-Y2014 [2011-2014]
, Y2016-Y2015 [2011-2015]
FROM dbo.CountryPopulation
ORDER BY Country

Unless the storage space is a problem, in theory one can store the data in both formats as there can be requests which can benefit from one structure or the other.

Happy coding!

09 February 2018

🔬Data Science: Normalization (Definitions)

"Mathematical transformations to generate a new set of values that map onto a different range." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

[Min–max normalization:] "Normalizing a variable value to a predetermine range." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

[function point normalization:] "Dividing a metric by the project’s function points to allow you to compare projects of different sizes and complexities." (Rod Stephens, "Beginning Software Engineering", 2015)

"For metrics, performing some calculation on a metric to account for possible differences in project size or complexity. Two general approaches are size normalization and function point normalization." (Rod Stephens, "Beginning Software Engineering", 2015)

[size normalization:] "For metrics, dividing a metric by an indicator of size such as lines of code or days of work. For example, bugs/KLOC tells you how buggy the code is normalized for the size of the project." (Rod Stephens, "Beginning Software Engineering", 2015)

31 July 2010

💎SQL Reloaded: Self-Joins and Denormalized Data Loading in Normalized Models

One of the scenarios in which I often make use of self-joins is when needing to load denormalized data into a normalized data model. A characteristic of the not-normalized or denormalized data is that there are repeating data, typically the so called header data, which need to be handled specifically.

Note:
Header data it’s improper said because an entity could contain more than 2 levels of data, for example the Purchase Orders (POs) in an ERP system could be split in PO Headers, Lines, Shipments and Distributions, thus a (denormalized) extract based on the respective data at Distribution level will contain repeating data from the higher levels.

So for this post I needed to find an entity that contains a parent-child or header-lines structure. Actually it’s not difficult to find such an example, the world of ERP systems is full of such examples – POs, Invoices, Customer Orders, Receipts, Payments, to mention some of the important transactional data, or Customer, Vendors and even the Items, when considering the master data. The difficulty is to simplify the example to a level that could be easier understood also by people who had less tangency with ERP systems or database models. For this I will consider the Receipts received when paying the goods in a (super-)market, if we take such a bill we could typically see the Receipt Number, the name of the Vendor from which we purchased was made, the Receipt Date, the Date when the purchase was made, the Items purchased together with the Quantity and Price. Such information could be easily entered in Excel and later loaded in a denormalized table, or enter them directly in the respective denormalized table:

-- Receipts denormalized table 
CREATE TABLE [dbo].[Receipts]( 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime] NULL, 
[CurrencyCode] [nvarchar](3) NULL, 
[ItemNumber] [nvarchar] (50) NULL,  
[Quantity] [decimal](12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) 
ON [PRIMARY] 

-- inserting test data 
INSERT INTO dbo.Receipts 
VALUES ('012034', 'TOOM', '2010-06-23', 'EUR', 'KABANOS PARIKA', 1, 2.19) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'ZITRONE', 1, 0.79) 
, ('012034', 'TOOM', '2010-06-23', 'EUR', 'BREAKFAST BAKON', 1, 1.59) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'PILLOW', 1, 23.99) 
, ('7899998', 'KAUFHOFF', '2010-06-22', 'EUR', 'BED SHEET', 2, 11.99) 
 
-- checking the data 
SELECT * 
FROM dbo.Receipts

Supposing we have the above data and that we would like to load them in a normalized structure formed from the Header information – Receipt Number, Vendor Name and Receipt Date, and Line information – Item Number, Quantity and Price:

-- Receipt Headers (normalized) 
CREATE TABLE [dbo].[ReceiptHeaders]( 
[ReceiptHeaderID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptNumber] [nvarchar](50) NULL, 
[Vendor] [nvarchar](100) NULL, 
[ReceiptDate] [smalldatetime]NULL, 
[CurrencyCode] [nvarchar](3) NULL 
) ON [PRIMARY]  

-- Receipt Lines (normalized)

CREATE TABLE [dbo].[ReceiptLines]( 
[ReceiptLineID] [int]IDENTITY(1,1) NOT NULL, 
[ReceiptHeaderID] int NOT NULL, 
[ItemNumber] [nvarchar] (50) NULL, 
[Quantity] [decimal] (12, 3) NULL, 
[Price] [numeric](12, 2) NULL 
) ON [PRIMARY]

In order to load the denormalized data in a normalized structure we could write two queries, the first populates the ReceiptHeaders table, for this needing to select the distinct header attributes that make the header, while the second populates the ReceiptLines table:

-- inserting the Receipt Header data 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 

-- inserting the Receipt Lines data

INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price

As can be seen from the second query, the Receipts table was joined with the ReceiptHeaders in order to retrieve the corresponding Header information for each line record. For this action to be possible we need an attribute or combination of attributes unique across the header data, in this case the ReceiptNumber in combination with the Vendor Name. If no such unique combination exists then the match between header and line data is not possible without resulting duplicated data, in such scenario it’s recommended to clean the data before loading them, for example by introducing an attribute that makes the combination unique. The same problem of uniqueness could be applied to the lines too, needing to be possible to identify uniquely a line in the source dataset. This could be done for example by introducing a Line Number in the source dataset or, as in this case, in case there are multiple lines with the same information then we could aggregate the quantities for the respective lines, the Price being moved in the GROUP BY clause as in theory the products with the same Item Number bought at the same time have the same price (that doesn’t necessarily happen in reality though for our exemplification will do).

After inserting the data in the normalized model it makes sense to check the logic by comparing the data inserted against the initial dataset. It’s always a good idea to do that, in this way could be trapped for example the errors in logic. In Excel for SQL Developers – Part IV: Differences Between Two Datasets I shown how the query for comparing two datasets could be created in a semiautomatic manner and shown also the resulting query. A similar query could be written also in this case, for this purpose being useful to create a view which denormalizes our structure:

-- Receipts View 
CREATE VIEW dbo.vReceipts 
AS 
SELECT RL.ReceiptLineID  
, RL.ReceiptHeaderID 
, RH.ReceiptNumber  
, RH.Vendor  
, RH.ReceiptDate  
, RH.CurrencyCode  
, RL.ItemNumber  
, RL.Quantity  
, RL.Price  
FROM dbo.ReceiptLines RL 
    JOIN dbo.ReceiptHeaders RH 
      ON RL.ReceiptHeaderID = RH.ReceiptHeaderID  

-- testing the view &amp; updates 
SELECT * 
FROM dbo.vReceipts

Until now we made it to load the data from a denormalized structure but no trace of a self-join! In many cases writing queries similar with the above ones is enough, though there are many cases when is needed to load the data successively, either incrementally or complete datasets. In both situations we could deal with data already loaded, so we have to avoid entering duplicates. Now it comes the self join into play, because in both insert queries we have to remove the records already loaded. Even if we deal with incremental data that form partitions (any record is provided only once) it’s safer and recommended to check for possible records provided again. So, we’ll have to modify the above two inserts to ignore the records already loaded:

-- inserting the Receipt Header data (with checking for loaded data) 
INSERT INTO dbo.ReceiptHeaders 
SELECT DISTINCT R.ReceiptNumber  
, R.Vendor  
, R.ReceiptDate  
, R.CurrencyCode 
FROM dbo.Receipts R 
     LEFT JOIN dbo.ReceiptHeaders RH 
       ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
WHERE RH.ReceiptNumber IS NULL 

-- inserting the Receipt Lines data (with checking for loaded data) 
INSERT INTO dbo.ReceiptLines 
SELECT RH.ReceiptHeaderID 
, R.ItemNumber  
, SUM(R.Quantity) Quantity 
, R.Price  
FROM dbo.Receipts R 
     JOIN dbo.ReceiptHeaders RH 
      ON R.ReceiptNumber = RH.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
     LEFT JOIN dbo.vReceipts VR 
       ON R.ReceiptNumber = VR.ReceiptNumber 
     AND R.Vendor = RH.Vendor 
    AND R.ItemNumber = VR.ItemNumber 
WHERE VR.ReceiptNumber IS NULL 
GROUP BY RH.ReceiptHeaderID 
, R.ItemNumber  
, R.Price

I hope the queries are easy to be understood, if not then might be a good idea to check the posts on left joins. In order to test the queries let’s insert new data in the Receipts table:

-- inserting new test data 
INSERT INTO dbo.Receipts 
VALUES ('012455', 'TOOM', '2010-07-25', 'EUR', 'KABANOS PARIKA', 1, 2.20) 
, ('012455', 'TOOM', '2010-07-25', 'EUR', 'ZITRONE', 1, 0.79)

After running the three inserts the same data should be available in input denormalized and normalized structures:

-- testing the updates 
SELECT * 
FROM dbo.vReceipts

Notes:
In this post wasn’t considered the case in which the data are modified between loads, thus excepting the above inserts must be written also two update statements that should reflect the changes occurred in loaded data. I will try to approach this subject in another post.

09 July 2009

🛢DBMS: Fifth Normal Form (Definitions)

"A table is in 5NF if it is in 4NF and contains no related multi-valued dependencies." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A table is in fifth normal form (5NF) if and only if there are no possible lossless decompositions into any subset of tables; in other words, if there is no possible lossless decomposition, then the table is in 5NF." (Toby J Teorey, ", Database Modeling and Design" 4th Ed., 2010)

"Specifies that every join dependency for the entity must be a consequence of its candidate keys." (Craig S Mullins, "Database Administration", 2012)

08 July 2009

🛢DBMS: Fourth Normal Form (Definitions)

"A relation is in fourth normal form if it is in BCNF and contains no multivalued dependencies." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A table is in 4NF if it is in BCNF and contains no unrelated multi-valued dependencies." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A table is in fourth normal form (4NF) if and only if it is at least in BCNF and if whenever there exists a nontrivial multivalued dependency of the form X->>Y, then X must be a superkey in the table." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"In relational theory, the fourth of Dr. Codd’s constraints on a relational design: No column within a primary key may be completely dependent on another column within the same primary key." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A table is in 4NF when it is in 3NF and contains no multiple independent sets of multivalued dependencies." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"An entity is in fourth normal form (4NF) if and only if it is in 3NF and has no multiple sets of multivalued dependencies." (Craig S Mullins, "Database Administration", 2012)

07 July 2009

🛢DBMS: Third Normal Form (Definitions)

"Table data that complies with both the first and second normal forms and directly relates to each rows primary key. See also first normal form; second normal form." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"In relational theory, the third of Dr. Codd’s constraints on a relational design: Each attribute must depend only on the primary key." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Database design approach that eliminates redundancy and therefore facilitates insertion of new rows into tables in an OLTP application without introducing excessive data locking problems. Sometimes referred to as normalized." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"A level of normalization in which all attributes in a table are fully dependent on its entire key. Third normal form is widely accepted as the optimal design for a transaction system. A schema in third normal form is often referred to as fully normalized, although there are actually additional degrees of normalization possible." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A table is in 3NF if it is in 2NF and it contains no transitive dependencies." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A table is in third normal form (3NF) if and only if for every functional dependency X->A, where X and A are either simple or composite attributes (data items), either X must be a superkey or A must be a member attribute of a candidate key in that table." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"A table is in 3NF when it is in 2NF and no non-key attribute is functionally dependent on another non-key attribute; that is, it cannot include transitive dependencies." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management 9th Ed", 2011)

"An entity is in third normal form if and only if it is in second normal form and every non-key attribute is non-transitively dependent on the primary key." (Craig S Mullins, "Database Administration", 2012)

SQL Troubles

Pages