SQL Troubles: matrix representation

Showing posts with label matrix representation. Show all posts

19 October 2023

📊Graphical Representation: Graphics We Live By II (Discount Rates in MS Excel)

Graphical Representation Series

It's difficult, if not impossible, to give general rules on how data visualizations should be built. However, the data professional can use a set of principles, which are less strict than rules, and validate one's work against them. Even then one might need to make concessions and go against the principles or common sense, though such cases should be few, at least in theory. One of such important principles is reflected in Tufte's statement that "if the statistics are boring, then you've got the wrong numbers" [1].

So, the numbers we show should not be boring, but that's the case with most of the numbers we need to show, or we consume in news and other media sources. Unfortunately, in many cases we need to go with the numbers we have and find a way to represent them, ideally by facilitating the reader to make sense of the respective data. This should be our first goal when visualizing data. Secondly, because everybody talks about insights nowadays, one should identify the opportunity for providing views into the data that go beyond the simple visualization, even if this puts more burden on data professional's shoulder. Frankly, from making sense of a set of data and facilitating an 'Aha' moment is a long leap. Thirdly, one should find or use the proper tools and channels for communicating the findings.

A basic requirement for the data professional to be able to address these goals is to have clear definitions of the numbers, have a good understanding of how the numbers reflect the reality, respectively how the numbers can be put into the broader context. Unfortunately, all these assumptions seem to be a luxury. On the other side, the type of data we work with allows us to address at least the first goal. Upon case, our previous experience can help, though there will be also cases in which we can try to do our best.

Let's consider a simple set of data retrieved recently from another post - Discount rates (in percentage) per State, in which the values for 5 neighboring States are considered (see the first two columns from diagram A). Without knowing the meaning of data, one could easily create a chart in Excel or any other visualization tool. Because the State has categorical values, probably some visualization tools will suggest using bar and not column charts. Either by own choice or by using the default settings, the data professional might end up with a column chart (see diagram B), which is Ok for some visualizations.

One can start with a few related questions:

(1) Does it make sense to use a chart to represent 5 values which have small variability (the difference between the first and last value is of only 6%)?

(2) Does it make sense to use a chart only for the sake of visualizing the data?

(3) Where is the benefit for using a chart as long there's no information conveyed?

One can see similar examples in the media where non-aggregated values are shown in a chart just for the sake of visualizing the data. Sometimes the authors compensate for the lack of meaning with junk elements, fancy titles or other tricks. Usually, sense-making in a chart takes longer than looking at the values in a table as there are more dimensions or elements to consider. For a table there's the title, headers and the values, nothing more! For a chart one has in addition the axes and some visualization elements that can facilitate or complicate visualization's decoding. Where to add that there are also many tricks to distort the data.

Tables tend to maximize the amount of digital ink used to represent the data, and minimize the amount used to represent everything else not important to understanding. It's what Tufte calls the data-to-ink ratio (see [1]), a second important principle. This can be translated in (a) removing the border of the chart area, (b) minimizing the number of gridlines shown, (c) minimizing the number of ticks on the axis without leading to information lost, (d) removing redundant information, (e) or information that doesn't help the reader.

However, the more data is available in the table, the more difficult it becomes to navigate the data. But again, if the chart shows the individual data without any information gained, a table might be still more effective. One shouldn't be afraid to show a table where is the case!

(4) I have a data visualization, what's next?

Ideally, the data professional should try to obtain the maximum of effect with minimum of elements. If this principle aims for the efficiency of design, a fourth related principle aims for the efficiency of effort - one should achieve a good enough visualization with a minimum of effort. Therefore, it's enough maybe if we settle to any of the two above results.

On the other side, maybe by investing a bit more effort certain aspects can be improved. In this area beginners start playing with the colors, formatting the different elements of the chart. Unfortunately, even if color plays a major role in the encoding and decoding of meaning, is often misused/overused.

(5) Is there any meaning in the colors used?

In the next examples taken from the web (diagram C and D), the author changed the color of the column with the minimal value to red to contrast it with the other values. Red is usually associated with danger, error, warning, or other similar characteristics with negative impact. The chances are high that the reader will associate the value with a negative connotation, even if red is used also for conveying important information (usually in text). Moreover, the reader will try to interpret the meaning of the other colors. In practice, the color grey has a neutral tone (and calming effect on the mind). Therefore, it's safe to use grey in visualization (see diagram D in contrast with diagram C). Some even advise setting grey as default for the visualization and changing the colors as needed later.

In these charts, the author signalized in titles that red denotes the lowest value, though it just reduces the confusion. One can meet titles in which several colors are used, reminding of a Christmas tree. Frankly, this type of encoding is not esthetically pleasing, and it can annoy the reader.

(6) What's in a name?

The titles and, upon case, the subtitles are important elements in communicating what the data reflects. The title should be in general short and succinct in the information it conveys, having the role of introducing, respectively identifying the chart, especially when multiple charts are used. Some charts can also use a subtitle, which can be longer than the title and have more of a storytelling character by highlighting the message and/or the finding in the data. In diagrams C and D the subtitles were considered as tiles, which is not considerably wrong.

In the media and presentations with influencing character, subtitles help the user understand the message or the main findings, though it's not appropriate for hardcoding the same in dynamic dashboards. Even if a logic was identified to handle the various scenarios, this shifts users' attention, and the chance is high that they'll stop further investigating the visualization. A data professional should present the facts with minimal interference in how the audience and/or users perceive the data.

As a recommendation, one should aim for clear general titles and avoid transmitting own message in charts. As a principle this can be summarized as "aim for clarity and equidistance".

(7) What about meaning?

Until now we barely considered the meaning of data. Unfortunately, there's no information about what the Discount rate means. It could be "the minimum interest rate set by the US Federal Reserve (and some other national banks) for lending to other banks" or "a rate used for discounting bills of exchange", to use the definitions given by the Oxford dictionary. Searching on the web, the results lead to discount rates for royalty savings, resident tuitions, or retail for discount transactions. Most probably the Discount rates from the data set refer to the latter.

We need a definition of the Discount rate to understand what the values represent when they are ordered. For example, when Texas has a value of 25% (see B), does this value have a negative or a positive impact when compared with other values? It depends on how it's used in the associated formula. The last two charts consider that the minimum value has a negative impact, though without more information the encoding might be wrong!

Important formulas and definitions should be considered as side information in the visualization, accompanying text or documentation! If further resources are required for understanding the data, then links to the required resources should be provided as well. At least this assures that the reader can acquire the right information without major overhead.

(8) What do readers look for?

Frankly, this should have been the first question! Readers have different expectations from data visualizations. First of all, it's the curiosity - how the data look in row and/or aggregated form, or in more advanced form how are they shaped (e.g. statistical characteristics like dispersion, variance, outliers). Secondly, readers look in the first phase to understand mainly whether the "results" are good or bad, even if there are many shades of grey in between. Further on, there must be made distinction between readers who want to learn more about the data, models, and processes behind, respectively readers who just want a confirmation of their expectations, opinions and beliefs (aka bias). And, in the end, there are also people who are not interested in the data and what it tells, where the title and/or subtitle provide enough information.

Besides this there are further categories of readers segmented by their role in the decision making, the planning and execution of operational, tactical, or strategic activities. Each of these categories has different needs. However, this exceeds the scope of our analysis.

Returning to our example, one can expect that the average reader will try to identify the smallest and highest Discount rates from the data set, respectively try to compare the values between the different States. Sorting the data and having the values close to each other facilitates the comparison and ranking, otherwise the reader needing to do this by himself/herself. This latter aspect and the fact that bar charts better handle the display of categorical data such as length and number, make from bar charts the tool of choice (see diagram E). So, whenever you see categorical data, consider using a bar chart!

Despite sorting the data, the reader might still need to subtract the various values to identify and compare the differences. The higher the differences between the values, the more complex these operations become. Diagram F is supposed to help in this area, the comparison to the minimal value being shown in orange. Unfortunately, small variances make numbers' display more challenging especially when the visualization tools don't offer display alternatives.

For showing the data from Diagram F were added in the table the third and fourth columns (see diagram A). There's a fifth column which designates the percentage from a percentage (what's the increase in percentages between the current and minimal value). Even if that's mathematically possible, the gain from using such data is neglectable and can create confusion. This opens the door for another principle that applies in other areas as well: "just because you can, it doesn't mean you should!". One should weigh design decisions against common sense or one's intuition on how something can be (mis)used and/or (mis)understood!

The downside of Diagram F is that the comparisons are made only in relation to the minimum value. The variations are small and allow further comparisons. The higher the differences, the more challenging it becomes to make further comparisons. A matrix display (see diagram G) which compares any two values will help if the number of points is manageable. The upper side of the numbers situated on and above the main diagonal were grayed (and can be removed) because they are either nonmeaningful, or the negatives of the numbers found below the diagram. Such diagrams are seldom used, though upon case they prove to be useful.

Choropleth maps (diagram H) are met almost everywhere data have a geographical dimension. Like all the other visuals they have their own advantages (e.g. relative location on the map) and disadvantages (e.g. encoding or displaying data). The diagram shows only the regions with data (remember the data-to-ink ratio principle).

(9) How about the shape of data?

When dealing with numerical data series, it's useful to show aggregated summaries like the average, quartiles, or standard deviation to understand how the data are shaped. Such summaries don't really make sense for our data set given the nature of the numbers (five values with small variance). One can still calculate them and show them in a box plot, though the benefit is neglectable.

(10) Which chart should be used?

As mentioned above, each chart has advantages and disadvantages. Given the simplicity and the number of data points, any of the above diagrams will do. A table is simple enough despite not using any visualization effects. Also, the bar charts are simple enough to use, with a plus maybe for diagram F which shows a further dimension of the data. The choropleth map adds the geographical dimension, which could be important for some readers. The matrix table is more appropriate for technical readers and involves more effort to understand, at least at first sight, though the learning curve is small. The column charts were considered only for exemplification purposes, though they might work as well.

In the end one should go with own experience and consider the audience and the communication channels used. One can also choose 2 different diagrams, especially when they are complementary and offer an additional dimension (e.g. diagrams F and H), though the context may dictate whether their use is appropriate or not. The diagrams should be simple to read and understand, but this doesn't mean that one should stick to the standard visuals. The data professional should explore other means of representing the data, a fresh view having the opportunity of catching the reader's attention.

As a closing remark, nowadays data visualization tools allow building such diagrams without much effort. Conversely, it takes more effort to go beyond the basic functionality and provide more value for thyself and the readers. One should be able to evaluate upfront how much time it makes sense to invest. Hopefully, the few methods, principles and recommendations presented here will help further!

Previous Post <<||>> Next Post

Resources:
[1] Edward R Tufte (1983) "The Visual Display of Quantitative Information"

01 April 2021

💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.

Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process.

As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values.

It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:

-- looking at the JSON file 
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile

In a second step one can add the OPENJSON function by looking only at the first record:

-- querying a json file (one record)
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$[0]')

In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly.

-- querying a json file (all records, a few fields)
SELECT Country 
, CountryCode 
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
) AS DAT;

In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly:

-- importing a json file (all records) on the fly
SELECT DAT.Country
, DAT.CountryCode
, DAT.Y1960
, DAT.Y1961
, DAT.Y1962
, DAT.Y1963
, DAT.Y1964
, DAT.Y1965
, DAT.Y1966
, DAT.Y1967
, DAT.Y1968
, DAT.Y1969
, DAT.Y1970
, DAT.Y1971
, DAT.Y1972
, DAT.Y1973
, DAT.Y1974
, DAT.Y1975
, DAT.Y1976
, DAT.Y1977
, DAT.Y1978
, DAT.Y1979
, DAT.Y1980
, DAT.Y1981
, DAT.Y1982
, DAT.Y1983
, DAT.Y1984
, DAT.Y1985
, DAT.Y1986
, DAT.Y1987
, DAT.Y1988
, DAT.Y1989
, DAT.Y1990
, DAT.Y1991
, DAT.Y1992
, DAT.Y1993
, DAT.Y1994
, DAT.Y1995
, DAT.Y1996
, DAT.Y1997
, DAT.Y1998
, DAT.Y1999
, DAT.Y2000
, DAT.Y2001
, Cast(DAT.Y2002 as bigint) Y2002
, Cast(DAT.Y2003 as bigint) Y2003
, Cast(DAT.Y2004 as bigint) Y2004
, Cast(DAT.Y2005 as bigint) Y2005
, Cast(DAT.Y2006 as bigint) Y2006
, Cast(DAT.Y2007 as bigint) Y2007
, Cast(DAT.Y2008 as bigint) Y2008
, Cast(DAT.Y2009 as bigint) Y2009
, Cast(DAT.Y2010 as bigint) Y2010
, Cast(DAT.Y2011 as bigint) Y2011
, Cast(DAT.Y2012 as bigint) Y2012
, Cast(DAT.Y2013 as bigint) Y2013
, Cast(DAT.Y2014 as bigint) Y2014
, Cast(DAT.Y2015 as bigint) Y2015
, DAT.Y2016
INTO dbo.CountryPopulation
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
, Y1960 bigint '$.Year_1960'
, Y1961 bigint '$.Year_1961'
, Y1962 bigint '$.Year_1962'
, Y1963 bigint '$.Year_1963'
, Y1964 bigint '$.Year_1964'
, Y1965 bigint '$.Year_1965'
, Y1966 bigint '$.Year_1966'
, Y1967 bigint '$.Year_1967'
, Y1968 bigint '$.Year_1968'
, Y1969 bigint '$.Year_1969'
, Y1970 bigint '$.Year_1970'
, Y1971 bigint '$.Year_1971'
, Y1972 bigint '$.Year_1972'
, Y1973 bigint '$.Year_1973'
, Y1974 bigint '$.Year_1974'
, Y1975 bigint '$.Year_1975'
, Y1976 bigint '$.Year_1976'
, Y1977 bigint '$.Year_1977'
, Y1978 bigint '$.Year_1978'
, Y1979 bigint '$.Year_1979'
, Y1980 bigint '$.Year_1980'
, Y1981 bigint '$.Year_1981'
, Y1982 bigint '$.Year_1982'
, Y1983 bigint '$.Year_1983'
, Y1984 bigint '$.Year_1984'
, Y1985 bigint '$.Year_1985'
, Y1986 bigint '$.Year_1986'
, Y1987 bigint '$.Year_1987'
, Y1988 bigint '$.Year_1988'
, Y1989 bigint '$.Year_1989'
, Y1990 bigint '$.Year_1990'
, Y1991 bigint '$.Year_1991'
, Y1992 bigint '$.Year_1992'
, Y1993 bigint '$.Year_1993'
, Y1994 bigint '$.Year_1994'
, Y1995 bigint '$.Year_1995'
, Y1996 bigint '$.Year_1996'
, Y1997 bigint '$.Year_1997'
, Y1998 bigint '$.Year_1998'
, Y1999 bigint '$.Year_1999'
, Y2000 bigint '$.Year_2000'
, Y2001 bigint '$.Year_2001'
, Y2002 decimal(19,1) '$.Year_2002'
, Y2003 decimal(19,1) '$.Year_2003'
, Y2004 decimal(19,1) '$.Year_2004'
, Y2005 decimal(19,1) '$.Year_2005'
, Y2006 decimal(19,1) '$.Year_2006'
, Y2007 decimal(19,1) '$.Year_2007'
, Y2008 decimal(19,1) '$.Year_2008'
, Y2009 decimal(19,1) '$.Year_2009'
, Y2010 decimal(19,1) '$.Year_2010'
, Y2011 decimal(19,1) '$.Year_2011'
, Y2012 decimal(19,1) '$.Year_2012'
, Y2013 decimal(19,1) '$.Year_2013'
, Y2014 decimal(19,1) '$.Year_2014'
, Y2015 decimal(19,1) '$.Year_2015'
, Y2016 bigint '$.Year_2016'
) AS DAT;

As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary.

Also, the columns’ names were changed either for simplification/convenience or simply taste.

Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:

Source field	Target field	DataType	Value	Import Clause	Select Clause
Country	Country	nvarchar(max)	emen Rep.	, Country nvarchar(max) '$.Country'	, DAT.Country
Country_Code	CountryCode	nvarchar(3)	YEM	, CountryCode nvarchar(3) '$.Country_Code'	, DAT.CountryCode
Year_1960	Y1960	bigint	5172135	, Y1960 bigint '$.Year_1960'	, DAT.Y1960
Year_1961	Y1961	bigint	5260501	, Y1961 bigint '$.Year_1961'	, DAT.Y1961
Year_1962	Y1962	bigint	5351799	, Y1962 bigint '$.Year_1962'	, DAT.Y1962
Year_1963	Y1963	bigint	5446063	, Y1963 bigint '$.Year_1963'	, DAT.Y1963
Year_1964	Y1964	bigint	5543339	, Y1964 bigint '$.Year_1964'	, DAT.Y1964
Year_1965	Y1965	bigint	5643643	, Y1965 bigint '$.Year_1965'	, DAT.Y1965
Year_1966	Y1966	bigint	5748588	, Y1966 bigint '$.Year_1966'	, DAT.Y1966
Year_1967	Y1967	bigint	5858638	, Y1967 bigint '$.Year_1967'	, DAT.Y1967
Year_1968	Y1968	bigint	5971407	, Y1968 bigint '$.Year_1968'	, DAT.Y1968
Year_1969	Y1969	bigint	6083619	, Y1969 bigint '$.Year_1969'	, DAT.Y1969
Year_1970	Y1970	bigint	6193810	, Y1970 bigint '$.Year_1970'	, DAT.Y1970
Year_1971	Y1971	bigint	6300554	, Y1971 bigint '$.Year_1971'	, DAT.Y1971
Year_1972	Y1972	bigint	6407295	, Y1972 bigint '$.Year_1972'	, DAT.Y1972
Year_1973	Y1973	bigint	6523452	, Y1973 bigint '$.Year_1973'	, DAT.Y1973
Year_1974	Y1974	bigint	6661566	, Y1974 bigint '$.Year_1974'	, DAT.Y1974
Year_1975	Y1975	bigint	6830692	, Y1975 bigint '$.Year_1975'	, DAT.Y1975
Year_1976	Y1976	bigint	7034868	, Y1976 bigint '$.Year_1976'	, DAT.Y1976
Year_1977	Y1977	bigint	7271872	, Y1977 bigint '$.Year_1977'	, DAT.Y1977
Year_1978	Y1978	bigint	7536764	, Y1978 bigint '$.Year_1978'	, DAT.Y1978
Year_1979	Y1979	bigint	7821552	, Y1979 bigint '$.Year_1979'	, DAT.Y1979
Year_1980	Y1980	bigint	8120497	, Y1980 bigint '$.Year_1980'	, DAT.Y1980
Year_1981	Y1981	bigint	8434017	, Y1981 bigint '$.Year_1981'	, DAT.Y1981
Year_1982	Y1982	bigint	8764621	, Y1982 bigint '$.Year_1982'	, DAT.Y1982
Year_1983	Y1983	bigint	9111097	, Y1983 bigint '$.Year_1983'	, DAT.Y1983
Year_1984	Y1984	bigint	9472170	, Y1984 bigint '$.Year_1984'	, DAT.Y1984
Year_1985	Y1985	bigint	9847899	, Y1985 bigint '$.Year_1985'	, DAT.Y1985
Year_1986	Y1986	bigint	10232733	, Y1986 bigint '$.Year_1986'	, DAT.Y1986
Year_1987	Y1987	bigint	10628585	, Y1987 bigint '$.Year_1987'	, DAT.Y1987
Year_1988	Y1988	bigint	11051504	, Y1988 bigint '$.Year_1988'	, DAT.Y1988
Year_1989	Y1989	bigint	11523267	, Y1989 bigint '$.Year_1989'	, DAT.Y1989
Year_1990	Y1990	bigint	12057039	, Y1990 bigint '$.Year_1990'	, DAT.Y1990
Year_1991	Y1991	bigint	12661614	, Y1991 bigint '$.Year_1991'	, DAT.Y1991
Year_1992	Y1992	bigint	13325583	, Y1992 bigint '$.Year_1992'	, DAT.Y1992
Year_1993	Y1993	bigint	14017239	, Y1993 bigint '$.Year_1993'	, DAT.Y1993
Year_1994	Y1994	bigint	14692686	, Y1994 bigint '$.Year_1994'	, DAT.Y1994
Year_1995	Y1995	bigint	15320653	, Y1995 bigint '$.Year_1995'	, DAT.Y1995
Year_1996	Y1996	bigint	15889449	, Y1996 bigint '$.Year_1996'	, DAT.Y1996
Year_1997	Y1997	bigint	16408954	, Y1997 bigint '$.Year_1997'	, DAT.Y1997
Year_1998	Y1998	bigint	16896210	, Y1998 bigint '$.Year_1998'	, DAT.Y1998
Year_1999	Y1999	bigint	17378098	, Y1999 bigint '$.Year_1999'	, DAT.Y1999
Year_2000	Y2000	bigint	17874725	, Y2000 bigint '$.Year_2000'	, DAT.Y2000
Year_2001	Y2001	bigint	18390135	, Y2001 bigint '$.Year_2001'	, DAT.Y2001
Year_2002	Y2002	decimal(19,1)	18919179.0	, Y2002 decimal(19,1) '$.Year_2002'	, Cast(DAT.Y2002 as bigint) Y2002
Year_2003	Y2003	decimal(19,1)	19462086.0	, Y2003 decimal(19,1) '$.Year_2003'	, Cast(DAT.Y2003 as bigint) Y2003
Year_2004	Y2004	decimal(19,1)	20017068.0	, Y2004 decimal(19,1) '$.Year_2004'	, Cast(DAT.Y2004 as bigint) Y2004
Year_2005	Y2005	decimal(19,1)	20582927.0	, Y2005 decimal(19,1) '$.Year_2005'	, Cast(DAT.Y2005 as bigint) Y2005
Year_2006	Y2006	decimal(19,1)	21160534.0	, Y2006 decimal(19,1) '$.Year_2006'	, Cast(DAT.Y2006 as bigint) Y2006
Year_2007	Y2007	decimal(19,1)	21751605.0	, Y2007 decimal(19,1) '$.Year_2007'	, Cast(DAT.Y2007 as bigint) Y2007
Year_2008	Y2008	decimal(19,1)	22356391.0	, Y2008 decimal(19,1) '$.Year_2008'	, Cast(DAT.Y2008 as bigint) Y2008
Year_2009	Y2009	decimal(19,1)	22974929.0	, Y2009 decimal(19,1) '$.Year_2009'	, Cast(DAT.Y2009 as bigint) Y2009
Year_2010	Y2010	decimal(19,1)	23606779.0	, Y2010 decimal(19,1) '$.Year_2010'	, Cast(DAT.Y2010 as bigint) Y2010
Year_2011	Y2011	decimal(19,1)	24252206.0	, Y2011 decimal(19,1) '$.Year_2011'	, Cast(DAT.Y2011 as bigint) Y2011
Year_2012	Y2012	decimal(19,1)	24909969.0	, Y2012 decimal(19,1) '$.Year_2012'	, Cast(DAT.Y2012 as bigint) Y2012
Year_2013	Y2013	decimal(19,1)	25576322.0	, Y2013 decimal(19,1) '$.Year_2013'	, Cast(DAT.Y2013 as bigint) Y2013
Year_2014	Y2014	decimal(19,1)	26246327.0	, Y2014 decimal(19,1) '$.Year_2014'	, Cast(DAT.Y2014 as bigint) Y2014
Year_2015	Y2015	decimal(19,1)	26916207.0	, Y2015 decimal(19,1) '$.Year_2015'	, Cast(DAT.Y2015 as bigint) Y2015
Year_2016	Y2016	bigint	27584213	, Y2016 bigint '$.Year_2016'	, DAT.Y2016

Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas):

=", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" 
=", DAT." & TRIM(B2)

Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:

-- unpivoting the data 
SELECT RES.Country
, RES.CountryCode
, Cast(Replace(RES.[Year], 'Y', '') as int) [Year]
, RES.Population
--INTO dbo.CountryPopulationPerYear
FROM 
( -- basis data
	SELECT Country
	, CountryCode
	, Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
	, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
	, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
	, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
	, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
	, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016
	FROM dbo.CountryPopulation
) DAT
UNPIVOT  -- unpivot logic
   (Population FOR [Year] IN  (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)
) RES

Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table).

--reviewing the data 
SELECT Country
, CountryCode
, AVG(Population) AveragePopulation
, Max(Population) - Min(Population) RangePopulation
FROM dbo.CountryPopulationPerYear
WHERE [Year] BETWEEN 2010 AND 2019
GROUP BY Country
, CountryCode
ORDER BY Country

On the other side making comparisons between consecutive years is easier when using a matrix structure:

--reviewing the data 
SELECT Country
, CountryCode
, Y2016
, Y2010
, Y2010-Y2010 [2016-2010]
, Y2011-Y2010 [2011-2010]
, Y2012-Y2011 [2011-2011]
, Y2013-Y2012 [2011-2012]
, Y2014-Y2013 [2011-2013]
, Y2015-Y2014 [2011-2014]
, Y2016-Y2015 [2011-2015]
FROM dbo.CountryPopulation
ORDER BY Country

Unless the storage space is a problem, in theory one can store the data in both formats as there can be requests which can benefit from one structure or the other.

Happy coding!

06 July 2020

🪄SSRS (& Paginated Reports): Design (Part III: Ranking Rows in Reports)

Introduction

In almost all the reports I built, unless it was explicitly requested no to, I prefer adding a running number (aka ranking) for each record contained into the report, while providing different background colors for consecutive rows. The ranking allows easily identify a record when discussing about it within the report or extracts, while the different background colors allow differentiating between two records while following the values which scrolling horizontally. The logic for the background color can be based on two (or more) colors using the ranking as basis.

Tabular Reports

In a tabular report the RowNumber() function is the straightforward way for providing a ranking. One just needs to add a column into the report before the other columns, giving a meaningful name (e.g. RankingNo) and provide the following formula within its Expression:
= RowNumber(Nothing)

When 'Nothing' is provided as parameter, the ranking is performed across all the report. If is needed to restrict the Ranking only to a grouping (e.g. Category), then group's name needs to be provided as parameter:
= RowNumber("Category")

Matrix Reports

Unfortunately, in a matrix report based on aggregation of raw data the RowNumber() function stops working, the values shown being incorrect. The solution I use to solve this is based on the custom GetRank() VB function:

Dim Rank as Integer = 0
Dim LastValue as String = ""

Function GetRank(group as string) as integer
if group <> LastValue then
       Rank = Rank + 1
       LastValue = group
end if

return Rank
end function

The function compares the values provided in the call against a global scope LastValue text value. If the values are different, then a global scope Rank value is incremented by1, while the LastValue is initialized to the new value, otherwise the values remaining the same. The logic is basic also for a non-programmer.

The above code needs to be added into the Code section of Report's Properties for the function to be available:

Adding the code in Report Properties

Once the function added, a new column should be added similarly as for a tabular report, providing the following code within its Expression in exchange:
=Code.GetRank(Fields!ProductNumber.Value)

Note:
As it seems, on the version of Reporting Services Extension I use, the function has only a page scope, the value being reset after each page. However when exporting the data with Excel the ranking is applied to the whole dataset.

Providing Alternate Colors

Independently of the report type, one can provide an alternate color for table's rows by selecting the row with the data and adding the following expression into the BackaroundColor property:
=Iif(ReportItems!RankingNo.Value Mod 2, "White", "LightSteelBlue")

Notes:
1) For a tabular report the cost of calling the RowNumber function instead of referring to the RankingNo cell is relatively small. One can write it also like this:
=llf(RowNumber(Nothing) Mod 2 = 0, "White", "LightSteelBlue")

Power BI Paginated Reports

The pieces of code considered above can be used also in Power BI Paginated Reports. Even if there's no functionality for adding custom code in the standard UI, one can make changes to the rdl file in Visual Studio or even in Notepad. For example, one can add the code within the "Code" tag at the end of the file before the closing tag for the report:

<Code>

Dim Rank as Integer = 0
Dim LastValue as String = ""
Dim Concatenation = ""

Function GetRank(group as string) as integer
if group <> LastValue then
       Rank = Rank + 1
       LastValue = group
end if

Concatenation = Concatenation & vbCrLf & Rank & "/" & group &amp; "/" & LastValue
return Rank
end function

</Code>
</Report>

Note:

One can consider using a pipeline "|" instead of a forward slash.

Happy coding!

Previous Post <<||>> Next Post

🪄SSRS (& Paginated Reports): First Steps (Part IV: Matrix Report Display)

Introduction

SQL Server Reporting Services (SSRS) allows grouping data into a matrix format based on one or more groups. By using the Report Wizard one can simplify considerably the volume of work.

The considered example is based on the AdventureWorks2014 database and considers Product's Inventory as base for building the report.

Preparing the Data

Usually it's useful to incorporate the logic for a report in one or more views, allowing thus to reuse the views in multiple reports. For the current report is needed to create two views, one for the Products, respectively Production.vProductInventory for the inventory.

-- dropping the vProducts view (cleaning after)
--DROP VIEW IF EXISTS [Production].[vProducts]

-- creating the vProducts view
CREATE VIEW [Production].[vProducts]
AS 
SELECT p.[ProductID] 
, p.ProductNumber
, p.[Name] 
, IsNull(p.Size, '') + IsNull(' ' + p.SizeUnitMeasureCode, '') Size
, p.Color
, P.Style
, p.ProductModelID
, pm.[Name] AS [ProductModel] 
, p.StandardCost 
, P.ListPrice
, P.SafetyStockLevel
, P.ReorderPoint
, p.SellStartDate 
, p.SellEndDate
, p.ProductSubcategoryID
, PPS.Name Subcategory
, PPS.ProductCategoryID
, PPC.Name Category
, P.MakeFlag
, P.FinishedGoodsFlag
FROM [Production].[Product] p 
     LEFT JOIN [Production].[ProductModel] pm 
       ON p.[ProductModelID] = pm.[ProductModelID] 
	 LEFT JOIN Production.ProductSubcategory PPS
	   ON P.ProductSubcategoryID = PPS.ProductSubcategoryID 
	      LEFT JOIN Production.ProductCategory PPC
		    ON PPS.ProductCategoryID = PPC.ProductCategoryID 
GO

-- reviewing the data 
SELECT *
FROM [Production].[vProducts]

-- checking the view for internal data consistency
SELECT count(*)
FROM [Production].[Product] p 
     LEFT JOIN [Production].[ProductModel] pm 
       ON p.[ProductModelID] = pm.[ProductModelID] 
	 LEFT JOIN Production.ProductSubcategory PPS
	   ON P.ProductSubcategoryID = PPS.ProductSubcategoryID 
	      LEFT JOIN Production.ProductCategory PPC
		    ON PPS.ProductCategoryID = PPC.ProductCategoryID

-- dropping the vProductInventory view (cleaning after)
--DROP VIEW IF EXISTS Production.vProductInventory

-- creating the view 
CREATE VIEW Production.vProductInventory
AS
SELECT PPI.ProductId 
, PPD.ProductNumber
, PPD.Name ProductName 
, PPD.ProductModel
, PPD.Size
, PPD.Category
, PPD.Subcategory
, PPD.Style
, PPD.StandardCost
, PPD.ListPrice 
, PPD.StandardCost * PPI.Quantity InventoryValue 
, PPD.ListPrice * PPI.Quantity SalesValue
, PPD.MakeFlag
, PPI.Locationid 
, PPL.Name Location 
, PPI.Shelf 
, PPI.Bin 
, PPI.Quantity 
FROM [Production].[ProductInventory] PPI
     JOIN [Production].[vProducts] PPD
	   ON PPI.ProductID = PPD.ProductID
	 JOIN [Production].[Location] PPL
	   ON PPI.LocationID = PPL.LocationID

-- reviewing the data
SELECT *
FROM Production.vProductInventory

-- checking the view for internal data consistency
SELECT count(*)
FROM [Production].[ProductInventory] PPI
     JOIN [Production].[vProducts] PPD
	   ON PPI.ProductID = PPD.ProductID
	 JOIN [Production].[Location] PPL
	   ON PPI.LocationID = PPL.LocationID

Note:
It's important to check the internal consistency of the views or queries used, on whether the logic removes or duplicates data. For this one can run the query for the uppermost table, and add repeatedly one more join for each run to see whether the number of records remains the same. One can shortcut the validation by checking only the number of records from the base table and for the whole query, and only if there are differences use the previously mentioned approach. (This is how I observed that the Production.vProductDetails view is not usable, because it considers only the Products having a valid Model.)

Creating the Report

We can now use the Production.vProductInventory view to create the Product Inventory by Location report based on the following query:

-- Product Inventory by Location
SELECT PPI.Category
, PPI.Subcategory
, PPI.ProductNumber
, PPI.ProductName 
, PPI.ProductModel
, PPI.Size
, PPI.Style
, PPI.StandardCost
, PPI.ListPrice
, PPI.Location 
, PPI.Quantity 
, PPI.InventoryValue 
, PPI.SalesValue 
FROM Production.vProductInventory PPI
ORDER BY PPI.Category
, PPI.Subcategory
, PPI.ProductNumber

Note:
The attributes can be provided in the order in which they should be displayed in the report, reducing thus the overhead in the further steps.

Using the Reporting Wizard via the Add New Report select in the first step the data source, while in the next step provide the above query:

Design the Query

In the next step select the "Matrix" Report Type:

Select the Report Type

Within the "Design the Matrix" section assign the fields as follows (all the fields except the ones considered as Columns and Details will be considered as Rows):

Design the Matrix

Into the last step give the report a meaningful name (e.g. Product Inventory by Location):

Completing the Wizard

In theory the report is ready to run, however before doing that change the formatting by aligning the headers to the center and eventually change their size from 11 to 10 pixels, respectively rename the dataset (e.g. Inventory). To obtain the same information about the grouping as below change into the "Advanced Mode".

Design View

And here's the output (I had to scroll through the pages to find meaningful values, therefore part of the Details header is not shown):

Preview

Restructuring the Grouping

As can be seen into the Design Mode, the wizard created a grouping for each attribute provider into the Details (see matrix1_Category, matrix1_Subcategory, etc.). Therefore, the values will not be repeated for each row, which is typically desirable for visualizations but not when exporting the data to Excel for further processing. I prefer the latter version, therefore to obtain this behavior one can go on and delete all the grouping via "Delete group only" except the matrix1_Category:

Deleting the groups

This action deleted unfortunately all the Detailed headers except Category. To bring them back into the grouping double click on the and add each field into the Group expressions as below:

Group Properties

As final change before running the report one can add header names for the Detail attributes. After these changes reports' design looks as follows:

Report Design with one grouping

And here's the final report with the values repeating for each row:

Preview Report without formatting

Note:
To avoid removing the groupings, I prefer to add only one Detail field into the query, typically the field which will make the row unique into the output (e.g. Product Number) and add the further fields (actually replace the below query with the one above) after the Wizard created the report. One still needs to add the columns manually into the report. In the end the effort is similar.

-- Product Inventory by Location
SELECT PPI.ProductNumber
, PPI.Location 
, PPI.Quantity 
, PPI.InventoryValue 
, PPI.SalesValue 
FROM Production.vProductInventory PPI
ORDER BY PPI.Category
, PPI.Subcategory
, PPI.ProductNumber

Changing the Design

Report's design can be slightly improved by adding various formatting of the cells or values. One can use similar formatting as the ones consider in the previous post. The only thing difficult to implement will be a ranking function (see Ranking Rows in Reports). After the design changes here's the report:

Final Report

Note:
Of course, together with parameters one can also add totals after each Category or Subcategory to the report if needed, though the latter is more appropriate for design purposes and not for further data consumption.

Happy coding!

Previous Post <<||>> Next Post

SQL Troubles

Pages

19 October 2023

📊Graphical Representation: Graphics We Live By II (Discount Rates in MS Excel)

01 April 2021

💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

06 July 2020

🪄SSRS (& Paginated Reports): Design (Part III: Ranking Rows in Reports)

🪄SSRS (& Paginated Reports): First Steps (Part IV: Matrix Report Display)

About Me