Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.
Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process.
As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values.
It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:
-- looking at the JSON file SELECT * FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB) as jsonfile
In a second step one can add the OPENJSON function by looking only at the first record:
-- querying a json file (one record) SELECT * FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB) as jsonfile CROSS APPLY OPENJSON(BulkColumn,'$[0]')
In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly.
-- querying a json file (all records, a few fields) SELECT Country , CountryCode FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB) as jsonfile CROSS APPLY OPENJSON(BulkColumn,'$') WITH ( Country nvarchar(max) '$.Country' , CountryCode nvarchar(3) '$.Country_Code' ) AS DAT;
In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly:
-- importing a json file (all records) on the fly SELECT DAT.Country , DAT.CountryCode , DAT.Y1960 , DAT.Y1961 , DAT.Y1962 , DAT.Y1963 , DAT.Y1964 , DAT.Y1965 , DAT.Y1966 , DAT.Y1967 , DAT.Y1968 , DAT.Y1969 , DAT.Y1970 , DAT.Y1971 , DAT.Y1972 , DAT.Y1973 , DAT.Y1974 , DAT.Y1975 , DAT.Y1976 , DAT.Y1977 , DAT.Y1978 , DAT.Y1979 , DAT.Y1980 , DAT.Y1981 , DAT.Y1982 , DAT.Y1983 , DAT.Y1984 , DAT.Y1985 , DAT.Y1986 , DAT.Y1987 , DAT.Y1988 , DAT.Y1989 , DAT.Y1990 , DAT.Y1991 , DAT.Y1992 , DAT.Y1993 , DAT.Y1994 , DAT.Y1995 , DAT.Y1996 , DAT.Y1997 , DAT.Y1998 , DAT.Y1999 , DAT.Y2000 , DAT.Y2001 , Cast(DAT.Y2002 as bigint) Y2002 , Cast(DAT.Y2003 as bigint) Y2003 , Cast(DAT.Y2004 as bigint) Y2004 , Cast(DAT.Y2005 as bigint) Y2005 , Cast(DAT.Y2006 as bigint) Y2006 , Cast(DAT.Y2007 as bigint) Y2007 , Cast(DAT.Y2008 as bigint) Y2008 , Cast(DAT.Y2009 as bigint) Y2009 , Cast(DAT.Y2010 as bigint) Y2010 , Cast(DAT.Y2011 as bigint) Y2011 , Cast(DAT.Y2012 as bigint) Y2012 , Cast(DAT.Y2013 as bigint) Y2013 , Cast(DAT.Y2014 as bigint) Y2014 , Cast(DAT.Y2015 as bigint) Y2015 , DAT.Y2016 INTO dbo.CountryPopulation FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB) as jsonfile CROSS APPLY OPENJSON(BulkColumn,'$') WITH ( Country nvarchar(max) '$.Country' , CountryCode nvarchar(3) '$.Country_Code' , Y1960 bigint '$.Year_1960' , Y1961 bigint '$.Year_1961' , Y1962 bigint '$.Year_1962' , Y1963 bigint '$.Year_1963' , Y1964 bigint '$.Year_1964' , Y1965 bigint '$.Year_1965' , Y1966 bigint '$.Year_1966' , Y1967 bigint '$.Year_1967' , Y1968 bigint '$.Year_1968' , Y1969 bigint '$.Year_1969' , Y1970 bigint '$.Year_1970' , Y1971 bigint '$.Year_1971' , Y1972 bigint '$.Year_1972' , Y1973 bigint '$.Year_1973' , Y1974 bigint '$.Year_1974' , Y1975 bigint '$.Year_1975' , Y1976 bigint '$.Year_1976' , Y1977 bigint '$.Year_1977' , Y1978 bigint '$.Year_1978' , Y1979 bigint '$.Year_1979' , Y1980 bigint '$.Year_1980' , Y1981 bigint '$.Year_1981' , Y1982 bigint '$.Year_1982' , Y1983 bigint '$.Year_1983' , Y1984 bigint '$.Year_1984' , Y1985 bigint '$.Year_1985' , Y1986 bigint '$.Year_1986' , Y1987 bigint '$.Year_1987' , Y1988 bigint '$.Year_1988' , Y1989 bigint '$.Year_1989' , Y1990 bigint '$.Year_1990' , Y1991 bigint '$.Year_1991' , Y1992 bigint '$.Year_1992' , Y1993 bigint '$.Year_1993' , Y1994 bigint '$.Year_1994' , Y1995 bigint '$.Year_1995' , Y1996 bigint '$.Year_1996' , Y1997 bigint '$.Year_1997' , Y1998 bigint '$.Year_1998' , Y1999 bigint '$.Year_1999' , Y2000 bigint '$.Year_2000' , Y2001 bigint '$.Year_2001' , Y2002 decimal(19,1) '$.Year_2002' , Y2003 decimal(19,1) '$.Year_2003' , Y2004 decimal(19,1) '$.Year_2004' , Y2005 decimal(19,1) '$.Year_2005' , Y2006 decimal(19,1) '$.Year_2006' , Y2007 decimal(19,1) '$.Year_2007' , Y2008 decimal(19,1) '$.Year_2008' , Y2009 decimal(19,1) '$.Year_2009' , Y2010 decimal(19,1) '$.Year_2010' , Y2011 decimal(19,1) '$.Year_2011' , Y2012 decimal(19,1) '$.Year_2012' , Y2013 decimal(19,1) '$.Year_2013' , Y2014 decimal(19,1) '$.Year_2014' , Y2015 decimal(19,1) '$.Year_2015' , Y2016 bigint '$.Year_2016' ) AS DAT;
As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary.
Also, the columns’ names were changed either for simplification/convenience or simply taste.
Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:
Source field | Target field | DataType | Value | Import Clause | Select Clause |
Country | Country | nvarchar(max) | emen Rep. | , Country nvarchar(max) '$.Country' | , DAT.Country |
Country_Code | CountryCode | nvarchar(3) | YEM | , CountryCode nvarchar(3) '$.Country_Code' | , DAT.CountryCode |
Year_1960 | Y1960 | bigint | 5172135 | , Y1960 bigint '$.Year_1960' | , DAT.Y1960 |
Year_1961 | Y1961 | bigint | 5260501 | , Y1961 bigint '$.Year_1961' | , DAT.Y1961 |
Year_1962 | Y1962 | bigint | 5351799 | , Y1962 bigint '$.Year_1962' | , DAT.Y1962 |
Year_1963 | Y1963 | bigint | 5446063 | , Y1963 bigint '$.Year_1963' | , DAT.Y1963 |
Year_1964 | Y1964 | bigint | 5543339 | , Y1964 bigint '$.Year_1964' | , DAT.Y1964 |
Year_1965 | Y1965 | bigint | 5643643 | , Y1965 bigint '$.Year_1965' | , DAT.Y1965 |
Year_1966 | Y1966 | bigint | 5748588 | , Y1966 bigint '$.Year_1966' | , DAT.Y1966 |
Year_1967 | Y1967 | bigint | 5858638 | , Y1967 bigint '$.Year_1967' | , DAT.Y1967 |
Year_1968 | Y1968 | bigint | 5971407 | , Y1968 bigint '$.Year_1968' | , DAT.Y1968 |
Year_1969 | Y1969 | bigint | 6083619 | , Y1969 bigint '$.Year_1969' | , DAT.Y1969 |
Year_1970 | Y1970 | bigint | 6193810 | , Y1970 bigint '$.Year_1970' | , DAT.Y1970 |
Year_1971 | Y1971 | bigint | 6300554 | , Y1971 bigint '$.Year_1971' | , DAT.Y1971 |
Year_1972 | Y1972 | bigint | 6407295 | , Y1972 bigint '$.Year_1972' | , DAT.Y1972 |
Year_1973 | Y1973 | bigint | 6523452 | , Y1973 bigint '$.Year_1973' | , DAT.Y1973 |
Year_1974 | Y1974 | bigint | 6661566 | , Y1974 bigint '$.Year_1974' | , DAT.Y1974 |
Year_1975 | Y1975 | bigint | 6830692 | , Y1975 bigint '$.Year_1975' | , DAT.Y1975 |
Year_1976 | Y1976 | bigint | 7034868 | , Y1976 bigint '$.Year_1976' | , DAT.Y1976 |
Year_1977 | Y1977 | bigint | 7271872 | , Y1977 bigint '$.Year_1977' | , DAT.Y1977 |
Year_1978 | Y1978 | bigint | 7536764 | , Y1978 bigint '$.Year_1978' | , DAT.Y1978 |
Year_1979 | Y1979 | bigint | 7821552 | , Y1979 bigint '$.Year_1979' | , DAT.Y1979 |
Year_1980 | Y1980 | bigint | 8120497 | , Y1980 bigint '$.Year_1980' | , DAT.Y1980 |
Year_1981 | Y1981 | bigint | 8434017 | , Y1981 bigint '$.Year_1981' | , DAT.Y1981 |
Year_1982 | Y1982 | bigint | 8764621 | , Y1982 bigint '$.Year_1982' | , DAT.Y1982 |
Year_1983 | Y1983 | bigint | 9111097 | , Y1983 bigint '$.Year_1983' | , DAT.Y1983 |
Year_1984 | Y1984 | bigint | 9472170 | , Y1984 bigint '$.Year_1984' | , DAT.Y1984 |
Year_1985 | Y1985 | bigint | 9847899 | , Y1985 bigint '$.Year_1985' | , DAT.Y1985 |
Year_1986 | Y1986 | bigint | 10232733 | , Y1986 bigint '$.Year_1986' | , DAT.Y1986 |
Year_1987 | Y1987 | bigint | 10628585 | , Y1987 bigint '$.Year_1987' | , DAT.Y1987 |
Year_1988 | Y1988 | bigint | 11051504 | , Y1988 bigint '$.Year_1988' | , DAT.Y1988 |
Year_1989 | Y1989 | bigint | 11523267 | , Y1989 bigint '$.Year_1989' | , DAT.Y1989 |
Year_1990 | Y1990 | bigint | 12057039 | , Y1990 bigint '$.Year_1990' | , DAT.Y1990 |
Year_1991 | Y1991 | bigint | 12661614 | , Y1991 bigint '$.Year_1991' | , DAT.Y1991 |
Year_1992 | Y1992 | bigint | 13325583 | , Y1992 bigint '$.Year_1992' | , DAT.Y1992 |
Year_1993 | Y1993 | bigint | 14017239 | , Y1993 bigint '$.Year_1993' | , DAT.Y1993 |
Year_1994 | Y1994 | bigint | 14692686 | , Y1994 bigint '$.Year_1994' | , DAT.Y1994 |
Year_1995 | Y1995 | bigint | 15320653 | , Y1995 bigint '$.Year_1995' | , DAT.Y1995 |
Year_1996 | Y1996 | bigint | 15889449 | , Y1996 bigint '$.Year_1996' | , DAT.Y1996 |
Year_1997 | Y1997 | bigint | 16408954 | , Y1997 bigint '$.Year_1997' | , DAT.Y1997 |
Year_1998 | Y1998 | bigint | 16896210 | , Y1998 bigint '$.Year_1998' | , DAT.Y1998 |
Year_1999 | Y1999 | bigint | 17378098 | , Y1999 bigint '$.Year_1999' | , DAT.Y1999 |
Year_2000 | Y2000 | bigint | 17874725 | , Y2000 bigint '$.Year_2000' | , DAT.Y2000 |
Year_2001 | Y2001 | bigint | 18390135 | , Y2001 bigint '$.Year_2001' | , DAT.Y2001 |
Year_2002 | Y2002 | decimal(19,1) | 18919179.0 | , Y2002 decimal(19,1) '$.Year_2002' | , Cast(DAT.Y2002 as bigint) Y2002 |
Year_2003 | Y2003 | decimal(19,1) | 19462086.0 | , Y2003 decimal(19,1) '$.Year_2003' | , Cast(DAT.Y2003 as bigint) Y2003 |
Year_2004 | Y2004 | decimal(19,1) | 20017068.0 | , Y2004 decimal(19,1) '$.Year_2004' | , Cast(DAT.Y2004 as bigint) Y2004 |
Year_2005 | Y2005 | decimal(19,1) | 20582927.0 | , Y2005 decimal(19,1) '$.Year_2005' | , Cast(DAT.Y2005 as bigint) Y2005 |
Year_2006 | Y2006 | decimal(19,1) | 21160534.0 | , Y2006 decimal(19,1) '$.Year_2006' | , Cast(DAT.Y2006 as bigint) Y2006 |
Year_2007 | Y2007 | decimal(19,1) | 21751605.0 | , Y2007 decimal(19,1) '$.Year_2007' | , Cast(DAT.Y2007 as bigint) Y2007 |
Year_2008 | Y2008 | decimal(19,1) | 22356391.0 | , Y2008 decimal(19,1) '$.Year_2008' | , Cast(DAT.Y2008 as bigint) Y2008 |
Year_2009 | Y2009 | decimal(19,1) | 22974929.0 | , Y2009 decimal(19,1) '$.Year_2009' | , Cast(DAT.Y2009 as bigint) Y2009 |
Year_2010 | Y2010 | decimal(19,1) | 23606779.0 | , Y2010 decimal(19,1) '$.Year_2010' | , Cast(DAT.Y2010 as bigint) Y2010 |
Year_2011 | Y2011 | decimal(19,1) | 24252206.0 | , Y2011 decimal(19,1) '$.Year_2011' | , Cast(DAT.Y2011 as bigint) Y2011 |
Year_2012 | Y2012 | decimal(19,1) | 24909969.0 | , Y2012 decimal(19,1) '$.Year_2012' | , Cast(DAT.Y2012 as bigint) Y2012 |
Year_2013 | Y2013 | decimal(19,1) | 25576322.0 | , Y2013 decimal(19,1) '$.Year_2013' | , Cast(DAT.Y2013 as bigint) Y2013 |
Year_2014 | Y2014 | decimal(19,1) | 26246327.0 | , Y2014 decimal(19,1) '$.Year_2014' | , Cast(DAT.Y2014 as bigint) Y2014 |
Year_2015 | Y2015 | decimal(19,1) | 26916207.0 | , Y2015 decimal(19,1) '$.Year_2015' | , Cast(DAT.Y2015 as bigint) Y2015 |
Year_2016 | Y2016 | bigint | 27584213 | , Y2016 bigint '$.Year_2016' | , DAT.Y2016 |
Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas):
=", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" =", DAT." & TRIM(B2)
Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:
-- unpivoting the data SELECT RES.Country , RES.CountryCode , Cast(Replace(RES.[Year], 'Y', '') as int) [Year] , RES.Population --INTO dbo.CountryPopulationPerYear FROM ( -- basis data SELECT Country , CountryCode , Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969 , Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979 , Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989 , Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999 , Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009 , Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016 FROM dbo.CountryPopulation ) DAT UNPIVOT -- unpivot logic (Population FOR [Year] IN (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969 , Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979 , Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989 , Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999 , Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009 , Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016) ) RES
Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table).
--reviewing the data SELECT Country , CountryCode , AVG(Population) AveragePopulation , Max(Population) - Min(Population) RangePopulation FROM dbo.CountryPopulationPerYear WHERE [Year] BETWEEN 2010 AND 2019 GROUP BY Country , CountryCode ORDER BY Country
On the other side making comparisons between consecutive years is easier when using a matrix structure:
--reviewing the data SELECT Country , CountryCode , Y2016 , Y2010 , Y2010-Y2010 [2016-2010] , Y2011-Y2010 [2011-2010] , Y2012-Y2011 [2011-2011] , Y2013-Y2012 [2011-2012] , Y2014-Y2013 [2011-2013] , Y2015-Y2014 [2011-2014] , Y2016-Y2015 [2011-2015] FROM dbo.CountryPopulation ORDER BY Country
No comments:
Post a Comment