22 October 2022

🛢DBMS: Data Warehouse/Mart (Just the Quotes)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"There are four levels of data in the architected environment - the operational level, the atomic (or the data warehouse) level, the departmental (or the data mart) level, and the individual level. These different levels of data are the basis of a larger architecture called the corporate information factory (CIF). The operational level of data holds application-oriented primitive data only and primarily serves the high-performance transaction-processing community. The data-warehouse level of data holds integrated, historical primitive data that cannot be updated. In addition, some derived data is found there. The departmental or data mart level of data contains derived data almost exclusively. The departmental or data mart level of data is shaped by end-user requirements into a form specifically suited to the needs of the department. And the individual level of data is where much heuristic analysis is done." (William H Inmon, "Building the Data Warehouse" 4th Ed., 2005)

"Having a purposeless or poorly performing dashboard is more common than not. This happens when the underlying architecture is not designed properly to support the needs of dashboard interaction. There is an obvious disconnect between the design of the data warehouse and the design of the dashboards. The people who design the data warehouse do not know what the dashboard will do; and the people who design the dashboards do not know how the data warehouse was designed, resulting in a lack of cohesion between the two. A similar disconnect can also exist between the dashboard designer and the business analyst, resulting in a dashboard that may look beautiful and dazzling but brings very little business value." (Nils H Rasmussen et al, "Business Dashboards: A visual catalog for design and deployment", 2009)

"Having multiple data lakes replicates the same problems that were created with multiple data warehouses - disparate data siloes and data fiefdoms that don't facilitate sharing of the corporate data assets across the organization. Organizations need to have a single data lake from which they can source the data for their BI/data warehousing and analytic needs. The data lake may never become the 'single version of the truth' for the organization, but then again, neither will the data warehouse. Instead, the data lake becomes the 'single or central repository for all the organization's data' from which all the organization's reporting and analytic needs are sourced." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"There are, however, many problems with independent data marts. Independent data marts: (1) Do not have data that can be reconciled with other data marts (2) Require their own independent integration of raw data (3) Do not provide a foundation that can be built on whenever there are future analytical needs." (William H Inmon & Daniel Linstedt, "Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault", 2015)

"Unfortunately, some organizations are replicating the bad data warehouse practice by creating special-purpose data lakes - data lakes to address a specific business need. Resist that urge! Instead, source the data that is needed for that specific business need into an 'analytic sandbox' where the data scientists and the business users can collaborate to find those data variables and analytic models that are better predictors of the business performance. Within the 'analytic sandbox', the organization can bring together (ingest and integrate) the data that it wants to test, build the analytic models, test the model's goodness of fit, acquire new data, refine the analytic models, and retest the goodness of fit." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Data quality in warehousing and BI is typically defined in terms of the 4 C’s - is the data clean, correct, consistent, and complete? When it comes to big data, there are two schools of thought that have different views and expectations of data quality. The first school believes that the gold standard of the 4 C’s must apply to all data (big and little) used for clinical care and performance metrics. The second school believes that in big data environments, a stringent data quality standard is impossible, too costly, or not required. While diametrically opposite opinions may play well in panel discussions, they do little to reconcile the realities of healthcare data quality." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017) 

"Data warehousing has always been difficult, because leaders within an organization want to approach warehousing and analytics as just another technology or application buy. Viewed in this light, they fail to understand the complexity and interdependent nature of building an enterprise reporting environment." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data warehouse follows a pre-built static structure to model source data. Any changes at the structural and configuration level must go through a stringent business review process and impact analysis. Data lakes are very agile. Consumption or analytical layer can be modified to fit in the model requirements. Consumers of a data lake are not constant; therefore, schema and modeling lies at the liberty of analysts and scientists." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data warehousing, as we are aware, is the traditional approach of consolidating data from multiple source systems and combining into one store that would serve as the source for analytical and business intelligence reporting. The concept of data warehousing resolved the problems of data heterogeneity and low-level integration. In terms of objectives, a data lake is no different from a data warehouse. Both are primary advocates of terms like 'single source of truth' and 'central data repository'." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] "The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

19 October 2022

🌡Performance Management: Mastery (Part II: First Time Right - The Aim toward Operational Excellence)

 

Performance Management Series

Rooted in Six Sigma methodology as a step toward operational excellence, First Time Right (FTR) implies that any procedure is performed in the right manner the first time and every time. It equates to minimizing the waste in its various forms (inventory, motion, overprocessing, overproduction, waiting, transportation, defects). Like many quality concepts from the manufacturing industry, the concept was transported in the software development process as principle, process, goal and/or metric. Thus, it became part of Software Engineering, Project Management, Data Science, and any other similar endeavors whose outcome results in software products. 

Besides the quality aspect, FTR is rooted also in the economic imperative – the need to achieve something in the minimum amount of time with the minimum of effort. It’s about being efficient in delivering a product or achieving a given target. It can be associated with continuous improvement, learning and mastery, the aim being to encompass FTR as part of organization’s culture. 

Even if not explicitly declared, FTR lurks in each task planned. It seems that it became common practice to plan with the FTR in mind, however between this theoretical aim and practice there’s as usual an important gap. Unfortunately, planners, managers and even tasks' performers often forget that mistakes are made, that several iterations are needed to get the job done. It starts with the communication between people in clarifying the requirements and ends with the formal sign off. All the deviations from the FTR add up in the deviations between expected and actual effort, though probably more important are the deviations from the plan and all the consequences deriving from it. Especially in complex projects this adds up into a spiral of issues that can easily reinforce themselves. 

Many of the jobs that imply creativity, innovation, research or exploration require at least several iterations to get the job done and this is independent of participants’ professionalism and experience. Moreover, the more quality one needs, the higher the effort, the 80/20 being sometimes a good approximation of the effort needed. In extremis, aiming for perfection instead of excellence can make certain tasks a never-ending story. 

Achieving FTR requires practice - the more novelty, the higher the complexity, the communication or the synchronization needs, the more practice is needed. It starts with the individual to master the individual tasks and ends with the team, where communication, synchronization and other aspects need to be considered. The practice is usually achieved on hands-on work as part of the daily duties, project work, and so on. Unfortunately, it’s based primarily on individual experience, and seldom groomed in advance, as preparation for future tasks. That’s why sometimes when efficiency is needed in performing critical complex tasks, one also needs to consider the learning curve in achieving the required quality. 

Of course, many organizations demand from job applicants experience and, when possible, they hire people with experience, however the diversity, complexity and changing nature of tasks require further practice. This aspect is somehow recognized in the implementation in organizations of the various forms of DevOps, though how many organizations adopt it and enforce it on a regular basis? Moreover, a major requirement of nowadays businesses is to be agile, and besides the mere application of methodologies, being agile means to have also a FTR mindset. 

FTR starts with the wish for mastery at individual and team level and, with the right management attention, by allocating time for learning, self-development in the important areas, providing relevant feedback and building an infrastructure for knowledge sharing and harnessing, FTR can become part of organization’s culture. It’s up to each of us to do it!

18 October 2022

💎SQL Reloaded: Successive Price Increases/Discounts via Windowing Functions and CTEs

I was trying today to solve a problem that apparently requires recursive common table expressions, though they are not (yet) available in Azure Synapse serverless SQL pool. The problem can be summarized in the below table definition, in which given a set of Products with an initial Sales price, is needed to apply Price Increases successively for each Cycle. The cumulated increase is simulated in the last column for each line. 

Unfortunately, there is no SQL Server windowing function that allows multiplying incrementally the values of a column (similar as the running total works). However, there’s a mathematical trick that can be used to transform a product into a sum of elements by applying the Exp (exponential) and Log (logarithm) functions (see Solution 1), and which frankly is more elegant than applying CTEs (see Solution 2). 

-- create table with test data
SELECT *
INTO dbo.ItemPrices
FROM (VALUES ('ID001', 1000, 1, 1.02, '1.02')
, ('ID001', 1000, 2, 1.03, '1.02*1.03')
, ('ID001', 1000, 3, 1.03, '1.02*1.03*1.03')
, ('ID001', 1000, 4, 1.04, '1.02*1.03*1.03*1.04')
, ('ID002', 100, 1, 1.02, '1.02')
, ('ID002', 100, 2, 1.03, '1.02*1.03')
, ('ID002', 100, 3, 1.04, '1.02*1.03*1.04')
, ('ID002', 100, 4, 1.05, '1.02*1.03*1.04*1.05')
) DAT (ItemId, SalesPrice, Cycle, PriceIncrease, CumulatedIncrease)

-- reviewing the data
SELECT *
FROM dbo.ItemPrices

-- Solution 1: new sales prices with log & exp
SELECT ItemId
, SalesPrice
, Cycle
, PriceIncrease
, EXP(SUM(Log(PriceIncrease)) OVER(PARTITION BY Itemid ORDER BY Cycle)) CumulatedIncrease
, SalesPrice * EXP(SUM(Log(PriceIncrease)) OVER(PARTITION BY Itemid ORDER BY Cycle)) NewSalesPrice
FROM dbo.ItemPrices

-- Solution 2: new sales prices with recursive CTE
;WITH CTE 
AS (
-- initial record
SELECT ITP.ItemId
, ITP.SalesPrice
, ITP.Cycle
, ITP.PriceIncrease
, cast(ITP.PriceIncrease as decimal(38,6)) CumulatedIncrease
FROM dbo.ItemPrices ITP
WHERE ITP.Cycle = 1
UNION ALL
-- recursice part
SELECT ITP.ItemId
, ITP.SalesPrice
, ITP.Cycle
, ITP.PriceIncrease
, Cast(ITP.PriceIncrease * ITO.CumulatedIncrease as decimal(38,6))  CumulatedIncrease
FROM dbo.ItemPrices ITP
    JOIN CTE ITO
	  ON ITP.ItemId = ITO.ItemId
	 AND ITP.Cycle-1 = ITO.Cycle
)
-- final result
SELECT ItemId
, SalesPrice
, Cycle
, PriceIncrease
, CumulatedIncrease
, SalesPrice * CumulatedIncrease NewSalesPrice
FROM CTE
ORDER BY ItemId
, Cycle


-- validating the cumulated price increases (only last ones)
SELECT 1.02*1.03*1.03*1.04 
, 1.02*1.03*1.04*1.05

-- cleaning up
DROP TABLE IF EXISTS dbo.ItemPrices

Notes:
1. The logarithm in SQL Server’s implementation works only with positive numbers!
2. For simplification I transformed percentages (e.g. 1%) in values that are easier to multitply with (e.g. 1.01). The solution can be easily modified to consider discounts.
3. When CTEs are not available, one is forced to return to the methods used in SQL Server 2000 (I've been there) and thus use temporary tables or table variables with loops. Moreover, the logic can be encapsulated in multi-statement table-valued functions (see example), unfortunately, another feature not (yet) supported by serverless SQL pools. 
4. Unfortunately, STRING_AGG, which concatenates values across rows, works only with a GROUP BY clause. Anyway, its result is useless without the availability of a Eval function in SQL (see example), however the Expr function available in data flows could be used as workaround.
4. Even if I thought about the use of logarithms for transforming the product into a sum, I initially ignored the idea, thinking that the solution would be too complex to implement. So, the credit goes to another blogpost. Kudos!

Happy coding!

21 August 2022

🧮ERP: Planning (Part V: It’s all about Partnership II)

When starting an ERP implementation project an organization needs to fill the existing knowledge gaps in respect to whatever it takes to achieve the goals associated with the respective project. Therefore, it makes sense to work with a implementer that can help cover the gaps directly or indirectly. Moreover, it makes sense to establish a long-term relationship that would allow to harness ERP system’s capabilities after project’s end, increase the ROI and, why not, find other areas of cooperation. It’s in theory what a partner does, and a strategic technology partnership is about – providing any kind of technological expertise the customer doesn't have in-house. 

Unfortunately, from being a ‘service provider’ to becoming a ‘partner’ is a challenging road for many organizations, especially when this type of relationship is not understood and managed accordingly. Partnership’s management may resume in defining common goals, principles, values and processes, establishing a communication strategy and a common understanding of the challenges and the steps ahead, providing visibility into the cost estimates, billing, resources’ availability and utilization. Addressing these aspects would offer a framework on which the partnerships can nourish. Without considering these topics, the implementer remains just a 'service provider', no matter of the names used to characterize the relationship. 

Now, the use of the word ‘partner’ would make someone think that only one partner is considered, typically a big to middle-sized organization that would have this kind of resources. The main reason behind this reasoning is that the number of functional areas and volume of skillset required for filling the requirements of an implementation are high compared with other projects, the resources needing to be available on-demand without affecting the other constraints: costs, quality, time. This can be challenging, therefore can be met scenarios in which two or more external organizations are involved in the partnership, ideally organizations that complement each other. 

It is common in ERP implementations to appeal also to individual consultants for specific areas or the whole project. The principles and values of a partnership, as well the framework behind, can be applied to individual consultants as well. Independently of resources’ provenience more important is the partnership ‘mindset’ - being together in the same boat, working together on a shared and understood strategy, with clear goals and objectives.

Moreover, the people participating in the project must have a ‘partner's mindset’ as well. Without this, the project will likely get different impulses in the wrong direction(s), as a group’s interests will take priority over the ones of the organization. Ideally, this mindset should extend to the whole organization as topics like Data Quality and Process Improvement must be an organization’s effort, deeply imprinted in organization’s culture.

More like ever, it’s important for the business to see and treat the IT department as a ‘partner’ and not as a ‘service provider’ by providing the needed level of transparency in requirements, issues, practices and processes, by treating the IT department as equal party in the decision-making and addressing its current and future strategical requirements. Ideally, this partnership should happen long before the implementation starts, given that it takes time for mentalities and practices to change, for knowledge to be acquired and used appropriately. 

Building a partnership takes time, effort and strategic thinking, this on top of the actual implementation, increasing thus the overall complexity, at least at the beginning. Does it pay off? Like in a marriage, it’s useful to have somebody you can trust, who knows you, whom you can rely upon, and talk with to find solutions. However, only time will tell whether such expectations are met and kept till the end. 

Previous <||> Next

09 August 2022

🧭🪄Business Intelligence: Power BI (Part I: Power BI's Learning Curve I)

A learning curve attempts depicting the (average) time it takes a person to learn how to use a method, tool, or technique, tracing the path from newbie to mastery. A common definition of the learning curve is based on the correlation between a learner’s performance on a task or activity and the number of attempts or amount of time required to complete it.

There are several diagrams in circulation which depict the correlation between the difficulty of Power BI concepts and probably their implementation as functionality. Even if they reflect to some degree the rate of learning, their simplicity and fuzziness can easily make one question their accuracy in reflecting the reality.

Researchers tend to categorize the curves associated with the learning process in simple idealized patterns like S-curve (aka sigmoid), exponential growth, exponential rise and fall to limit, or power law, however the learning process in IT-based endeavors is seldom characterized by a linear or exponential curve, given that the tasks seldom allow a steady path. The jumps of knowledge between tasks can be wide enough to appear insurmountable, and they can prove to be quite of a challenge without some help.

Like a baby’s first steps, we, as learners, must learn first to crawl, before making some unsteady steps, and it can take long time until visible progress is made. It’s a slow progress until we suddenly hit a (tipping) point from which everything seems easy, fact that increases our confidence in us. On the other side, when we find that we make no visible progress for a long period, it’s easy to arrive to the opposite, a critical zone, which in extremis could make one lose interest.

As beginners, after the first tipping point on the learning journey, it’s easy to arrive at a plateau in which there seem no need to learn new things, the current knowledge allowing to handle a range of tasks of small to average complexity. This can last for a long time, and then, a big thing comes our way – a hard problem to solve or a concept hard to understand. It’s the point where we stagnate, and the deeper we go, and the more such challenges are thrown in our way, the more difficult the learning seems to be. However, with new understanding, small steps are made, one step after the other, the pace makes us to evolve faster until we reach again a critical point from which the process increases smoothly until we seem to stagnate again. We meet again a hard limit to growth, which seems to be more solid than the previous one.

Power BI's learning curve

Both limits to growth can appear to be hard, however, considering that the knowledge in the field expands, more opportunity for growth appear, thus, the limits are apparent. Even if knowledge tends to increase ‘indefinitely’, the limits are there in terms of complexity, time available, knowledge quality (incl. availability) or any other dimension of the learning process. Moreover, these successions of tipping points, growth limits, plateaus, critical, steady and fast progress zones can occur in several iterations in the learning path. Thus, the path seems to resemble a snakelike curve with many ups and downs.

For the learner is important to be aware of this last aspect, there are always ups and downs, taking effort, patience and maybe an expert’s help to bridge the gaps in between. The chances are high that the gap between what we think we know and what we know is considerable, therefore a reality check is useful from time to time. A new problem to tackle will provide that occasion!

Previous Post <<||>> Next Post

04 April 2021

💼Project Management: Lean Management (Part I: Between Value and Waste I - An Introduction)

 Mismanagement

Independently on whether Lean Management is considered in the context of Manufacturing, Software Development (SD), Project Management (PM) or any other business-related areas, there are three fundamental business concepts on which the whole scaffolding of the Lean philosophies is built upon, namely the ones of value, value stream and waste. 

From an economic standpoint, value refers to the monetary worth of a product, asset or service (further referred as product) to an organization, while from a qualitative perspective, it refers to the perceived benefit associated with its usage. The value is thus reflected in the costs associated with a product’s delivery (producer’s perspective), respectively the price paid on acquiring it and the degree to which the product can fulfill a demand (customer’s perspective).

Without diving too deep into theory of product valuation, the challenges revolve around reducing the costs associated with a product’s delivery, respectively selling it to a price the customer is willing to pay for, typically to address a given set of needs. Moreover, the customer is willing to pay only for the functions that satisfy the needs a product is thought to cover. From this friction of opposing driving forces, a product is designed and valued.

The value stream is the sequence of activities (also steps or processes) needed to deliver a product to customers. This formulation includes value-added and non-value-added activities, internal and external customers, respectively covers the full lifecycle of products and/or services in whatever form it occurs, either if is or not perceived by the customers.  

Waste is any activity that consumes resources but creates no value for the customers or, generally, for the stakeholders, be it internal or external. The waste is typically associated with the non-added value activities, activities that don’t produce value for stakeholders, and can increase directly or indirectly the costs of products especially when no attention is given to it and/or not recognized as such. Therefore, eliminating the waste can have an important impact on products’ costs and become one of the goals of Lean Management. Moreover, eliminating the waste is an incremental process that, when put in the context of continuous improvement, can lead to processes redesign and re-engineering.

Taiichi Ohno, the ‘father’ of the Toyota Production System (TPS), originally identified seven forms of waste (Japanese: muda): overproduction, waiting, transporting, inappropriate processing, unnecessary inventory, unnecessary/excess motion, and defects. Within the context of SD and PM, Tom and Marry Poppendieck [1] translated the types of wastes in concepts closer to the language of software developers: partially done work, extra processes, extra features, task switching, waiting, motion and, of course, defects. To this list were added later further types of waste associated with resources, confusion and work conditions.

Defects in form of errors and bugs, ineffective communication, rework and overwork, waiting, repetitive activities like handoffs or even unnecessary meetings are usually the visible part of products and projects and important from the perspective of stakeholders, which in extremis can become sensitive when their volume increases out of proportion.

Unfortunately, lurking in the deep waters of projects and wrecking everything that stands in their way are the other forms of waste less perceivable from stakeholders’ side: unclear requirements/goals, code not released or not tested, specifications not implemented, scrapped code, overutilized/underutilized resources, bureaucracy, suboptimal processes, unnecessary optimization, searching for information, mismanagement, task switching, improper work condition, confusion, to mention just the important activities associated to waste.

Through their elusive nature, independently on whether they are or not visible to stakeholders, they all impact the costs of projects and products when the proper attention is not given to them and not handled accordingly.

Lean Management - The Waste Iceberg

References:
[1] Mary Poppendieck & Tom Poppendieck (2003) Lean Software Development: An Agile Toolkit, Addison Wesley, ISBN: 0-321-15078-3

01 April 2021

💎SQL Reloaded: Processing JSON Files with Complex Structure in SQL Server 2016+

Unfortunately (or fortunately, for the challenge-searchers), not all JSON data files have a simple (matrix) structure, while the data might not even have a proper (readable) definition. It's the case of the unemployment data provided by the Cologne municipality (source). However with a language page translator and some small effort one can identify the proximate data definition:

Source FieldTarget FieldData Type
AM_ALO_INSG_AAALO_Totalint
AM_ALO_SGB2_AAALO_SGB2int
AM_ALO_UNTER25_AAALO_Under25int
AM_ALO_INSG_APALO_Total_Percfloat
AM_ALO_SGB2_APALO_SGB2_Percfloat
AM_ALO_UNTER25_APALO_Under25_Percfloat
AM_ALO_INSG_HAALO_Total_Histint
AM_ALO_SGB2_HAALO_SGB2_Histint
AM_ALO_UNTER25_HAALO_Under25_Histint
AM_ALO_INSG_HPALO_Total_HistPercfloat
AM_ALO_SGB2_HPALO_SGB2_HistPercfloat
AM_ALO_UNTER25_HPALO_Under25_HistPercfloat
AM_SVB_INSG_AASVB_Totalint
AM_SVB_MANN_AASVB_Menint
AM_SVB_FRAU_AASVB_Womenint
AM_SVB_DEUTSCH_AASVB_Germanint
AM_SVB_AUSLAND_AASVB_AUSLANDint
AM_SVB_U25J_AASVB_Under25Yoint
AM_SVB_UEBER55J_AASVB_Over55Yoint
AM_SVB_INSG_APSVB_Total_Percfloat
AM_SVB_MANN_APSVB_Men_Percfloat
AM_SVB_FRAU_APSVB_Women_Percfloat
AM_SVB_DEUTSCH_APSVB_German_Percfloat
AM_SVB_AUSLAND_APSVB_AUSLAND_Percfloat
AM_SVB_U25J_APSVB_Under25Yo_Percfloat
AM_SVB_UEBER55J_APSVB_Over55Yo_Percfloat
AM_SVB_INSG_HASVB_Total_Histint
AM_SVB_MANN_HASVB_Men_Histint
AM_SVB_FRAU_HASVB_Women_Histint
AM_SVB_DEUTSCH_HASVB_German_Histint
AM_SVB_AUSLAND_HASVB_AUSLAND_Histint
AM_SVB_U25J_HASVB_Under25Yo_Histint
AM_SVB_UEBER55J_HASVB_Over55Yo_Histint
AM_SVB_INSG_HPSVB_Total_HistPercfloat
AM_SVB_MANN_HPSVB_Men_HistPercfloat
AM_SVB_FRAU_HPSVB_Women_HistPercfloat
AM_SVB_DEUTSCH_HPSVB_German_HistPercfloat
AM_SVB_AUSLAND_HPSVB_AUSLAND_HistPercfloat
AM_SVB_U25J_HPSVB_Under25Yo_HistPercfloat
AM_SVB_UEBER55J_HPSVB_Over55Yo_HistPercfloat
SHAPE.AREASHAPE.AREAint
SHAPE.LENSHAPE.LENint

As previously stated (see post), it makes sense to build the logic over several iterations, making first sure that the references to file's columns were used correctly (observe the way the various elements were referenced in the queries):

SELECT DAT.ObjectId 
, DAT.Nummer
, DAT.Name
, DAT.ALO_Total
, DAT.ALO_SGB2
FROM OPENROWSET (BULK 'D:\data\Arbeitsmarkt Statistik Koeln Stadtteil.json',CODEPAGE='65001', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$.features')
 WITH( 
	  ObjectId int '$.properties.OBJECTID'
	, Nummer int '$.properties.NUMMER'
	, Name nvarchar(max) '$.properties.NAME'
	, ALO_Total int '$.properties.AM_ALO_INSG_AA'
	, ALO_SGB2 int '$.properties.AM_ALO_SGB2_AA'
) AS DAT; 

Output (first 13 records):
ObjectIdNummerNameALO_TotalALO_SGB2
1211Godorf12192
2308Lövenich15683
3307Weiden552383
4306Junkersdorf305164
5309Widdersdorf20093
6404Vogelsang310208
7505Weidenpesch527351
8502Mauenheim190127
12207Hahnwald153
13213Meschenich644554

The logic seems to work, however the German umlauts aren't displayed as expected ('Lövenich', when it should have been 'Lövenich'). This is caused by the differences in character sets. An easy way to address this is to use a functions which does the conversion (see the dbo.ReplaceCodes2Umlauts UDF from an older post). 

By applying the function on the Names, adding the further columns and an INSERT clause, the query becomes:
 
-- importing the JSON file
SELECT DAT.ObjectId 
, DAT.Nummer
, dbo.ReplaceCodes2Umlauts(DAT.Name) Name
, DAT.ALO_Total
, DAT.ALO_SGB2
, DAT.ALO_Under25
, DAT.ALO_Total_Perc
, DAT.ALO_SGB2_Perc
, DAT.ALO_Under25_Perc
, DAT.ALO_Total_Hist
, DAT.ALO_SGB2_Hist
, DAT.ALO_Under25_Hist
, DAT.ALO_Total_HistPerc
, DAT.ALO_SGB2_HistPerc
, DAT.ALO_Under25_HistPerc
, DAT.SVB_Total
, DAT.SVB_Men
, DAT.SVB_Women
, DAT.SVB_German
, DAT.SVB_AUSLAND
, DAT.SVB_Under25Yo
, DAT.SVB_Over55Yo
, DAT.SVB_Total_Perc
, DAT.SVB_Men_Perc
, DAT.SVB_Women_Perc
, DAT.SVB_German_Perc
, DAT.SVB_AUSLAND_Perc
, DAT.SVB_Under25Yo_Perc
, DAT.SVB_Over55Yo_Perc
, DAT.SVB_Total_Hist
, DAT.SVB_Men_Hist
, DAT.SVB_Women_Hist
, DAT.SVB_German_Hist
, DAT.SVB_AUSLAND_Hist
, DAT.SVB_Under25Yo_Hist
, DAT.SVB_Over55Yo_Hist
, DAT.SVB_Total_HistPerc
, DAT.SVB_Men_HistPerc
, DAT.SVB_Women_HistPerc
, DAT.SVB_German_HistPerc
, DAT.SVB_AUSLAND_HistPerc
, DAT.SVB_Under25Yo_HistPerc
, DAT.SVB_Over55Yo_HistPerc
, DAT.Shape_Area 
, DAT.Shape_Len 
INTO dbo.Unemployment_Cologne
FROM OPENROWSET (BULK 'D:\data\Arbeitsmarkt Statistik Koeln Stadtteil.json',CODEPAGE='65001', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$.features')
 WITH( 
	  ObjectId int '$.properties.OBJECTID'
	, Nummer int '$.properties.NUMMER'
	, Name nvarchar(max) '$.properties.NAME'
	, ALO_Total int '$.properties.AM_ALO_INSG_AA'
	, ALO_SGB2 int '$.properties.AM_ALO_SGB2_AA'
	, ALO_Under25 int '$.properties.AM_ALO_UNTER25_AA'
	, ALO_Total_Perc float '$.properties.AM_ALO_INSG_AP'
	, ALO_SGB2_Perc float '$.properties.AM_ALO_SGB2_AP'
	, ALO_Under25_Perc float '$.properties.AM_ALO_UNTER25_AP'
	, ALO_Total_Hist int '$.properties.AM_ALO_INSG_HA'
	, ALO_SGB2_Hist int '$.properties.AM_ALO_SGB2_HA'
	, ALO_Under25_Hist int '$.properties.AM_ALO_UNTER25_HA'
	, ALO_Total_HistPerc float '$.properties.AM_ALO_INSG_HP'
	, ALO_SGB2_HistPerc float '$.properties.AM_ALO_SGB2_HP'
	, ALO_Under25_HistPerc float '$.properties.AM_ALO_UNTER25_HP'
	, SVB_Total int '$.properties.AM_SVB_INSG_AA'
	, SVB_Men int '$.properties.AM_SVB_MANN_AA'
	, SVB_Women int '$.properties.AM_SVB_FRAU_AA'
	, SVB_German int '$.properties.AM_SVB_DEUTSCH_AA'
	, SVB_AUSLAND int '$.properties.AM_SVB_AUSLAND_AA'
	, SVB_Under25Yo int '$.properties.AM_SVB_U25J_AA'
	, SVB_Over55Yo int '$.properties.AM_SVB_UEBER55J_AA'
	, SVB_Total_Perc float '$.properties.AM_SVB_INSG_AP'
	, SVB_Men_Perc float '$.properties.AM_SVB_MANN_AP'
	, SVB_Women_Perc float '$.properties.AM_SVB_FRAU_AP'
	, SVB_German_Perc float '$.properties.AM_SVB_DEUTSCH_AP'
	, SVB_AUSLAND_Perc float '$.properties.AM_SVB_AUSLAND_AP'
	, SVB_Under25Yo_Perc float '$.properties.AM_SVB_U25J_AP'
	, SVB_Over55Yo_Perc float '$.properties.AM_SVB_UEBER55J_AP'
	, SVB_Total_Hist int '$.properties.AM_SVB_INSG_HA'
	, SVB_Men_Hist int '$.properties.AM_SVB_MANN_HA'
	, SVB_Women_Hist int '$.properties.AM_SVB_FRAU_HA'
	, SVB_German_Hist int '$.properties.AM_SVB_DEUTSCH_HA'
	, SVB_AUSLAND_Hist int '$.properties.AM_SVB_AUSLAND_HA'
	, SVB_Under25Yo_Hist int '$.properties.AM_SVB_U25J_HA'
	, SVB_Over55Yo_Hist int '$.properties.AM_SVB_UEBER55J_HA'
	, SVB_Total_HistPerc float '$.properties.AM_SVB_INSG_HP'
	, SVB_Men_HistPerc float '$.properties.AM_SVB_MANN_HP'
	, SVB_Women_HistPerc float '$.properties.AM_SVB_FRAU_HP'
	, SVB_German_HistPerc float '$.properties.AM_SVB_DEUTSCH_HP'
	, SVB_AUSLAND_HistPerc float '$.properties.AM_SVB_AUSLAND_HP'
	, SVB_Under25Yo_HistPerc float '$.properties.AM_SVB_U25J_HP'
	, SVB_Over55Yo_HistPerc float '$.properties.AM_SVB_UEBER55J_HP'
	, Shape_Area float '$.properties."SHAPE.AREA"'
	, Shape_Len float '$.properties."SHAPE.LEN"'
) AS DAT; 

Once the data made available, one can go on and discover the data and the relationships existing between the various columns. 

Happy coding!

💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.

Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process. 

As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values. 

It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:

-- looking at the JSON file 
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 

In a second step one can add the OPENJSON function by looking only at the first record: 

-- querying a json file (one record)
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$[0]')

In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly. 

-- querying a json file (all records, a few fields)
SELECT Country 
, CountryCode 
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
) AS DAT; 

In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly: 

-- importing a json file (all records) on the fly
SELECT DAT.Country
, DAT.CountryCode
, DAT.Y1960
, DAT.Y1961
, DAT.Y1962
, DAT.Y1963
, DAT.Y1964
, DAT.Y1965
, DAT.Y1966
, DAT.Y1967
, DAT.Y1968
, DAT.Y1969
, DAT.Y1970
, DAT.Y1971
, DAT.Y1972
, DAT.Y1973
, DAT.Y1974
, DAT.Y1975
, DAT.Y1976
, DAT.Y1977
, DAT.Y1978
, DAT.Y1979
, DAT.Y1980
, DAT.Y1981
, DAT.Y1982
, DAT.Y1983
, DAT.Y1984
, DAT.Y1985
, DAT.Y1986
, DAT.Y1987
, DAT.Y1988
, DAT.Y1989
, DAT.Y1990
, DAT.Y1991
, DAT.Y1992
, DAT.Y1993
, DAT.Y1994
, DAT.Y1995
, DAT.Y1996
, DAT.Y1997
, DAT.Y1998
, DAT.Y1999
, DAT.Y2000
, DAT.Y2001
, Cast(DAT.Y2002 as bigint) Y2002
, Cast(DAT.Y2003 as bigint) Y2003
, Cast(DAT.Y2004 as bigint) Y2004
, Cast(DAT.Y2005 as bigint) Y2005
, Cast(DAT.Y2006 as bigint) Y2006
, Cast(DAT.Y2007 as bigint) Y2007
, Cast(DAT.Y2008 as bigint) Y2008
, Cast(DAT.Y2009 as bigint) Y2009
, Cast(DAT.Y2010 as bigint) Y2010
, Cast(DAT.Y2011 as bigint) Y2011
, Cast(DAT.Y2012 as bigint) Y2012
, Cast(DAT.Y2013 as bigint) Y2013
, Cast(DAT.Y2014 as bigint) Y2014
, Cast(DAT.Y2015 as bigint) Y2015
, DAT.Y2016
INTO dbo.CountryPopulation
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
, Y1960 bigint '$.Year_1960'
, Y1961 bigint '$.Year_1961'
, Y1962 bigint '$.Year_1962'
, Y1963 bigint '$.Year_1963'
, Y1964 bigint '$.Year_1964'
, Y1965 bigint '$.Year_1965'
, Y1966 bigint '$.Year_1966'
, Y1967 bigint '$.Year_1967'
, Y1968 bigint '$.Year_1968'
, Y1969 bigint '$.Year_1969'
, Y1970 bigint '$.Year_1970'
, Y1971 bigint '$.Year_1971'
, Y1972 bigint '$.Year_1972'
, Y1973 bigint '$.Year_1973'
, Y1974 bigint '$.Year_1974'
, Y1975 bigint '$.Year_1975'
, Y1976 bigint '$.Year_1976'
, Y1977 bigint '$.Year_1977'
, Y1978 bigint '$.Year_1978'
, Y1979 bigint '$.Year_1979'
, Y1980 bigint '$.Year_1980'
, Y1981 bigint '$.Year_1981'
, Y1982 bigint '$.Year_1982'
, Y1983 bigint '$.Year_1983'
, Y1984 bigint '$.Year_1984'
, Y1985 bigint '$.Year_1985'
, Y1986 bigint '$.Year_1986'
, Y1987 bigint '$.Year_1987'
, Y1988 bigint '$.Year_1988'
, Y1989 bigint '$.Year_1989'
, Y1990 bigint '$.Year_1990'
, Y1991 bigint '$.Year_1991'
, Y1992 bigint '$.Year_1992'
, Y1993 bigint '$.Year_1993'
, Y1994 bigint '$.Year_1994'
, Y1995 bigint '$.Year_1995'
, Y1996 bigint '$.Year_1996'
, Y1997 bigint '$.Year_1997'
, Y1998 bigint '$.Year_1998'
, Y1999 bigint '$.Year_1999'
, Y2000 bigint '$.Year_2000'
, Y2001 bigint '$.Year_2001'
, Y2002 decimal(19,1) '$.Year_2002'
, Y2003 decimal(19,1) '$.Year_2003'
, Y2004 decimal(19,1) '$.Year_2004'
, Y2005 decimal(19,1) '$.Year_2005'
, Y2006 decimal(19,1) '$.Year_2006'
, Y2007 decimal(19,1) '$.Year_2007'
, Y2008 decimal(19,1) '$.Year_2008'
, Y2009 decimal(19,1) '$.Year_2009'
, Y2010 decimal(19,1) '$.Year_2010'
, Y2011 decimal(19,1) '$.Year_2011'
, Y2012 decimal(19,1) '$.Year_2012'
, Y2013 decimal(19,1) '$.Year_2013'
, Y2014 decimal(19,1) '$.Year_2014'
, Y2015 decimal(19,1) '$.Year_2015'
, Y2016 bigint '$.Year_2016'
) AS DAT; 

As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary. 

Also, the columns’ names were changed either for simplification/convenience or simply taste. 

Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:

Source fieldTarget fieldDataTypeValueImport ClauseSelect Clause
CountryCountrynvarchar(max) emen Rep., Country nvarchar(max) '$.Country', DAT.Country
Country_CodeCountryCodenvarchar(3) YEM, CountryCode nvarchar(3) '$.Country_Code', DAT.CountryCode
Year_1960Y1960bigint5172135, Y1960 bigint '$.Year_1960', DAT.Y1960
Year_1961Y1961bigint5260501, Y1961 bigint '$.Year_1961', DAT.Y1961
Year_1962Y1962bigint5351799, Y1962 bigint '$.Year_1962', DAT.Y1962
Year_1963Y1963bigint5446063, Y1963 bigint '$.Year_1963', DAT.Y1963
Year_1964Y1964bigint5543339, Y1964 bigint '$.Year_1964', DAT.Y1964
Year_1965Y1965bigint5643643, Y1965 bigint '$.Year_1965', DAT.Y1965
Year_1966Y1966bigint5748588, Y1966 bigint '$.Year_1966', DAT.Y1966
Year_1967Y1967bigint5858638, Y1967 bigint '$.Year_1967', DAT.Y1967
Year_1968Y1968bigint5971407, Y1968 bigint '$.Year_1968', DAT.Y1968
Year_1969Y1969bigint6083619, Y1969 bigint '$.Year_1969', DAT.Y1969
Year_1970Y1970bigint6193810, Y1970 bigint '$.Year_1970', DAT.Y1970
Year_1971Y1971bigint6300554, Y1971 bigint '$.Year_1971', DAT.Y1971
Year_1972Y1972bigint6407295, Y1972 bigint '$.Year_1972', DAT.Y1972
Year_1973Y1973bigint6523452, Y1973 bigint '$.Year_1973', DAT.Y1973
Year_1974Y1974bigint6661566, Y1974 bigint '$.Year_1974', DAT.Y1974
Year_1975Y1975bigint6830692, Y1975 bigint '$.Year_1975', DAT.Y1975
Year_1976Y1976bigint7034868, Y1976 bigint '$.Year_1976', DAT.Y1976
Year_1977Y1977bigint7271872, Y1977 bigint '$.Year_1977', DAT.Y1977
Year_1978Y1978bigint7536764, Y1978 bigint '$.Year_1978', DAT.Y1978
Year_1979Y1979bigint7821552, Y1979 bigint '$.Year_1979', DAT.Y1979
Year_1980Y1980bigint8120497, Y1980 bigint '$.Year_1980', DAT.Y1980
Year_1981Y1981bigint8434017, Y1981 bigint '$.Year_1981', DAT.Y1981
Year_1982Y1982bigint8764621, Y1982 bigint '$.Year_1982', DAT.Y1982
Year_1983Y1983bigint9111097, Y1983 bigint '$.Year_1983', DAT.Y1983
Year_1984Y1984bigint9472170, Y1984 bigint '$.Year_1984', DAT.Y1984
Year_1985Y1985bigint9847899, Y1985 bigint '$.Year_1985', DAT.Y1985
Year_1986Y1986bigint10232733, Y1986 bigint '$.Year_1986', DAT.Y1986
Year_1987Y1987bigint10628585, Y1987 bigint '$.Year_1987', DAT.Y1987
Year_1988Y1988bigint11051504, Y1988 bigint '$.Year_1988', DAT.Y1988
Year_1989Y1989bigint11523267, Y1989 bigint '$.Year_1989', DAT.Y1989
Year_1990Y1990bigint12057039, Y1990 bigint '$.Year_1990', DAT.Y1990
Year_1991Y1991bigint12661614, Y1991 bigint '$.Year_1991', DAT.Y1991
Year_1992Y1992bigint13325583, Y1992 bigint '$.Year_1992', DAT.Y1992
Year_1993Y1993bigint14017239, Y1993 bigint '$.Year_1993', DAT.Y1993
Year_1994Y1994bigint14692686, Y1994 bigint '$.Year_1994', DAT.Y1994
Year_1995Y1995bigint15320653, Y1995 bigint '$.Year_1995', DAT.Y1995
Year_1996Y1996bigint15889449, Y1996 bigint '$.Year_1996', DAT.Y1996
Year_1997Y1997bigint16408954, Y1997 bigint '$.Year_1997', DAT.Y1997
Year_1998Y1998bigint16896210, Y1998 bigint '$.Year_1998', DAT.Y1998
Year_1999Y1999bigint17378098, Y1999 bigint '$.Year_1999', DAT.Y1999
Year_2000Y2000bigint17874725, Y2000 bigint '$.Year_2000', DAT.Y2000
Year_2001Y2001bigint18390135, Y2001 bigint '$.Year_2001', DAT.Y2001
Year_2002Y2002decimal(19,1) 18919179.0, Y2002 decimal(19,1) '$.Year_2002', Cast(DAT.Y2002 as bigint) Y2002
Year_2003Y2003decimal(19,1) 19462086.0, Y2003 decimal(19,1) '$.Year_2003', Cast(DAT.Y2003 as bigint) Y2003
Year_2004Y2004decimal(19,1) 20017068.0, Y2004 decimal(19,1) '$.Year_2004', Cast(DAT.Y2004 as bigint) Y2004
Year_2005Y2005decimal(19,1) 20582927.0, Y2005 decimal(19,1) '$.Year_2005', Cast(DAT.Y2005 as bigint) Y2005
Year_2006Y2006decimal(19,1) 21160534.0, Y2006 decimal(19,1) '$.Year_2006', Cast(DAT.Y2006 as bigint) Y2006
Year_2007Y2007decimal(19,1) 21751605.0, Y2007 decimal(19,1) '$.Year_2007', Cast(DAT.Y2007 as bigint) Y2007
Year_2008Y2008decimal(19,1) 22356391.0, Y2008 decimal(19,1) '$.Year_2008', Cast(DAT.Y2008 as bigint) Y2008
Year_2009Y2009decimal(19,1) 22974929.0, Y2009 decimal(19,1) '$.Year_2009', Cast(DAT.Y2009 as bigint) Y2009
Year_2010Y2010decimal(19,1) 23606779.0, Y2010 decimal(19,1) '$.Year_2010', Cast(DAT.Y2010 as bigint) Y2010
Year_2011Y2011decimal(19,1) 24252206.0, Y2011 decimal(19,1) '$.Year_2011', Cast(DAT.Y2011 as bigint) Y2011
Year_2012Y2012decimal(19,1) 24909969.0, Y2012 decimal(19,1) '$.Year_2012', Cast(DAT.Y2012 as bigint) Y2012
Year_2013Y2013decimal(19,1) 25576322.0, Y2013 decimal(19,1) '$.Year_2013', Cast(DAT.Y2013 as bigint) Y2013
Year_2014Y2014decimal(19,1) 26246327.0, Y2014 decimal(19,1) '$.Year_2014', Cast(DAT.Y2014 as bigint) Y2014
Year_2015Y2015decimal(19,1) 26916207.0, Y2015 decimal(19,1) '$.Year_2015', Cast(DAT.Y2015 as bigint) Y2015
Year_2016Y2016bigint27584213, Y2016 bigint '$.Year_2016', DAT.Y2016

Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas): 

=", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" 
=", DAT." & TRIM(B2)

Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:

-- unpivoting the data 
SELECT RES.Country
, RES.CountryCode
, Cast(Replace(RES.[Year], 'Y', '') as int) [Year]
, RES.Population
--INTO dbo.CountryPopulationPerYear
FROM 
( -- basis data
	SELECT Country
	, CountryCode
	, Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
	, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
	, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
	, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
	, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
	, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016
	FROM dbo.CountryPopulation
) DAT
UNPIVOT  -- unpivot logic
   (Population FOR [Year] IN  (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)
) RES

Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table). 

--reviewing the data 
SELECT Country
, CountryCode
, AVG(Population) AveragePopulation
, Max(Population) - Min(Population) RangePopulation
FROM dbo.CountryPopulationPerYear
WHERE [Year] BETWEEN 2010 AND 2019
GROUP BY Country
, CountryCode
ORDER BY Country

On the other side making comparisons between consecutive years is easier when using a matrix structure: 

--reviewing the data 
SELECT Country
, CountryCode
, Y2016
, Y2010
, Y2010-Y2010 [2016-2010]
, Y2011-Y2010 [2011-2010]
, Y2012-Y2011 [2011-2011]
, Y2013-Y2012 [2011-2012]
, Y2014-Y2013 [2011-2013]
, Y2015-Y2014 [2011-2014]
, Y2016-Y2015 [2011-2015]
FROM dbo.CountryPopulation
ORDER BY Country

Unless the storage space is a problem, in theory one can store the data in both formats as there can be requests which can benefit from one structure or the other. 

Happy coding!
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.