Showing posts with label data processing. Show all posts
Showing posts with label data processing. Show all posts

17 March 2025

🏭🗒️Microsoft Fabric: Caching in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 17-Mar-2024

[Microsoft Fabric] Caching
  • {def} technique that improves the performance of data processing by storing frequently accessed data and metadata in a faster storage layer [1]
    • e.g. local memory, local SSD disk
    • ⇐ subsequent requests can be served faster, directly from the cache [1]
      • if a set of data has been previously accessed by a query, any subsequent queries will retrieve that data directly from the in-memory cache [1]
      • local memory operations are notably faster compared to fetching data from remote storage [1]
      • ⇐ significantly diminishes IO latency [1]
    • fully transparent to the user
    • consistently active and operates seamlessly in the background [1]
    • orchestrated and upheld by MF
      • it doesn't offer users the capability to manually clear the cache [1] 
    • provides transactional consistency
      • ensures that any modifications to the data in storage after it has been initially loaded into the in-memory cache, will result in consistent data [1]
    • when the cache reaches its capacity threshold and fresh data is being read for the first time, objects that have remained unused for the longest duration will be removed from the cache [1]
      • process is enacted to create space for the influx of new data and maintain an optimal cache utilization strategy [1] 
  • {type} in-memory cache
    • data in cache is organized in a compressed columnar format (aka columnar storage) [1]
      • ⇐ optimized for analytical queries
      • each column of data is stored separately [1]
        • {benefit} allows for better compression [1]
          • since similar data values are stored together [1]
        • {benefit} reduces the memory footprint
    • when queries need to perform operations on a specific column, the engine can work more efficiently speeding up queries' execution [1]
      • it doesn't have to process unnecessary data from other columns
      • can perform operations on multiple columns simultaneously [1]
        • taking advantage of modern multi-core processors [1]
    • when retrieves data from storage, the data is transformed from its original file-based format into highly optimized structures in in-memory cache [1]
    • {scenario} analytical workloads where queries involve scanning large amounts of data to perform aggregations, filtering, and other data manipulations [1]
  • {type} disk cache
    • complementary extension to the in-memory cache
    • any data loaded into the in-memory cache is also serialized to the SSD cache [1]
      • data removed from the in-memory cache remains within the SSD cache for an extended period
      • when subsequent query requests the data, it is retrieved from the SSD cache into the in-memory cache quicker [1]
  • {issue} cold run (aka cold cache) performance
    • the first 1-3 executions of a query perform noticeably slower than subsequent executions [2]
      • if the first run's performance is crucial, try manually creating statistics (aka pre-warming the cache) [2]
      • otherwise, one can rely on the automatic statistics generated in the first query run and leveraged in subsequent runs [2]
      • as long as underlying data does not change significantly [2]
  • differentiated from
    • [Kusto] caching policy [link]
    • [Apache Spark] intelligent cache [link]
    • [Power BI] query caching [link]
    • [Azure] caching [link]
    References:
    [1] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]
    [2] Microsoft Learn (2024) Fabric Data Warehouse performance guidelines [link]
    [3] Sandeep Pawar (2023) Pre-Warming The Direct Lake Dataset For Warm Cache Import-Like Performance [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    IO - Input/Output
    MF - Microsoft Fabric
    SSD - Solid State Drive

    23 February 2025

    💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part IX: From OLTP to OLAP Data Models)

    With SQL databases Microsoft brought OLTP to Microsoft Fabric which allows addressing a wider range of requirements, though this involves also some challenges that usually are addressed by the transition from the OLTP to OLAP architectures. Typically, there's an abstraction layer that is built on top of the OLTP data models that allows to address the various OLAP requirements. As soon as OLTP and OLAP models are mixed together, this opens the door to design and data quality issues that have impact on the adoption of solutions by users. Probably, those who worked with MS Access or even MS Excel directly or in combination with SQL Server can still remember the issues they run into.

    Ideally, it should be a separation layer between OLTP and the OLAP data. This can be easily achieved in SQL databases by using two different schemas that mimic the interaction between the two types of architectures. So, supposing that the dbo schema from the SalesLT is the data as maintain by the OLTP layer, one can add an additional schema Test in which the OLAP logic is modelled. This scenario is not ideal, though it allows to model the two aspects of the topic considered. The following steps are to be performed in the environment in which the SalesLT database was created. 

    Independently in which layer one works, it's ideal to create a set of views that abstracts the logic and ideally simplifies the processing of data. So, in a first step it's recommended to abstract the data from the source by creating a set of views like the one below:

    -- drop view (cleaning)
    -- DROP VIEW IF EXISTS SalesLT.vCustomerLocations 
    
    -- create view
    CREATE VIEW SalesLT.vCustomerLocations
    -- Customers with main office
    AS
    SELECT CST.CustomerId 
    , CSA.AddressID
    , CST.Title
    , CST.FirstName 
    , IsNull(CST.MiddleName, '') MiddleName
    , CST.LastName 
    , CST.CompanyName 
    , CST.SalesPerson 
    , IsNull(CSA.AddressType, '') AddressType
    , IsNull(ADR.City, '') City
    , IsNull(ADR.StateProvince, '') StateProvince
    , IsNull(ADR.CountryRegion, '') CountryRegion
    , IsNull(ADR.PostalCode, '') PostalCode
    FROM SalesLT.Customer CST
    	 LEFT JOIN SalesLT.CustomerAddress CSA
    	   ON CST.CustomerID = CSA.CustomerID
    	  AND CSA.AddressType = 'Main Office'
    	 	LEFT JOIN SalesLT.Address ADR
    		  ON CSA.AddressID = ADR.AddressID
    

    The view uses LEFT instead of FULL joins because this allows more flexibility, respectively identifying the gaps existing between entities (e.g. customers without addresses). In these abstractions, the number of transformations is kept to a minimum to reflect the data as reflected by the source. It may be chosen to minimize the occurrence of NULL values as this simplifies the logic for comparisons (see the use of IsNull).

    Once the abstraction from the OLTP layer was built, one can make the data available in the OLAP layer:

    -- create schema
    CREATE SCHEMA Test
    
    -- dropping the target table (for cleaning)
    -- DROP TABLE IF EXISTS Test.CustomerLocations
    
    -- Option 1
    -- create the table on the fly
    SELECT *
    INTO Test.CustomerLocations
    FROM SalesLT.vCustomerLocations
    
    -- Option 2
    -- create the table manually (alternative to precedent step
    CREATE TABLE [Test].[CustomerLocations](
    	[CustomerId] [int] NOT NULL,
    	[AddressID] [int] NULL,
    	[Title] [nvarchar](8) NULL,
    	[FirstName] [nvarchar](50) NULL,
    	[MiddleName] [nvarchar](50) NULL,
    	[LastName] [nvarchar](50) NULL,
    	[CompanyName] [nvarchar](128) NULL,
    	[SalesPerson] [nvarchar](256) NULL,
    	[AddressType] [nvarchar](50) NULL,
    	[City] [nvarchar](30) NULL,
    	[StateProvince] [nvarchar](50) NULL,
    	[CountryRegion] [nvarchar](50) NULL,
    	[PostalCode] [nvarchar](15) NULL
    ) ON [PRIMARY]
    GO
    
    -- insert records
    INSERT INTO Test.CustomerLocations
    SELECT *
    FROM SalesLT.vCustomerLocations
    
    
    -- checking the output (both scenarios)
    SELECT top 100 *
    FROM Test.CustomerLocations
    
    
    -- drop the view (for cleaning)
    -- DROP VIEW IF EXISTS Test.vCustomerLocations
    
    -- create view
    CREATE VIEW Test.vCustomerLocations
    -- Customer locations
    AS
    SELECT CSL.CustomerId 
    , CSL.AddressID
    , CSL.Title
    , CSL.FirstName 
    , CSL.MiddleName 
    , CSL.LastName 
    , Concat(CSL.FirstName, ' ' + CSL.MiddleName, ' ', CSL.LastName) FullName
    , CSL.CompanyName 
    , CSL.SalesPerson 
    , CSL.AddressType
    , CSL.City
    , CSL.StateProvince
    , CSL.CountryRegion 
    , CSL.PostalCode
    FROM Test.CustomerLocations CSL
    
    -- test the view
    SELECT top 100 *
    FROM Test.vCustomerLocations
    

    Further on, one can create additional objects as required. Usually, a set of well-designed views is enough, offering the needed flexibility with a minimum of code duplication. In addition, one can build stored procedures and table-valued functions as needed:

    -- drop the function (for cleaning)
    -- DROP FUNCTION IF EXISTS Test.tvfGetCustomerAddresses
    
    -- generated template - function
    CREATE FUNCTION Test.tvfGetCustomerAddresses (
        @CountryRegion nvarchar(50) NULL,
        @StateProvince nvarchar(50) NULL
    )
    RETURNS TABLE
    -- Customers by Country & State province
    AS
    RETURN (
    SELECT CSL.CustomerId 
    , CSL.AddressID
    , CSL.Title
    , CSL.FirstName 
    , CSL.MiddleName 
    , CSL.LastName 
    , CSL.FullName
    , CSL.CompanyName 
    , CSL.SalesPerson 
    , CSL.AddressType 
    , CSL.City
    , CSL.StateProvince 
    , CSL.CountryRegion 
    , CSL.PostalCode
    FROM Test.vCustomerLocations CSL
    WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
      AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
    );
    
    -- retrieving all records
    SELECT *
    FROM Test.tvfGetCustomerAddresses(NULL, NULL)
    
    -- providing parameters
    SELECT *
    FROM Test.tvfGetCustomerAddresses('United States', 'Utah')
    
    -- filtering on non-parametrized volumns
    SELECT *
    FROM Test.tvfGetCustomerAddresses('United States', 'Utah')
    WHERE City = 'Salt Lake City'
    
    
    
    -- drop the procedure (for cleaning)
    -- DROP PROCEDURE IF EXISTS Test.spGetCustomerAddresses 
    
    -- generated template - stored procedure
    CREATE PROCEDURE Test.spGetCustomerAddresses (
        @CountryRegion nvarchar(50) NULL,
        @StateProvince nvarchar(50) NULL
    )
    -- Customers by Country & State province
    AS
    BEGIN
    	SELECT CSL.CustomerId 
    	, CSL.AddressID
    	, CSL.Title
    	, CSL.FirstName 
    	, CSL.MiddleName 
    	, CSL.LastName 
    	, CSL.FullName
    	, CSL.CompanyName 
    	, CSL.SalesPerson 
    	, CSL.AddressType 
    	, CSL.City
    	, CSL.StateProvince 
    	, CSL.CountryRegion 
    	, CSL.PostalCode
    	FROM Test.vCustomerLocations CSL
    	WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
    	AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
    END 
    
    -- retrieving all records
    EXEC Test.spGetCustomerAddresses NULL, NULL
    
    -- providing parameters
     EXEC Test.spGetCustomerAddresses 'United States', 'Utah'
    

    These steps can repeated for each entity in scope.

    This separation between OLTP and OLAP is usually necessary given that business processes need a certain amount of time until they are correctly reflected as per reporting needs. Otherwise, the gaps can negatively impact the quality of data used for reporting. For some reports these deviation might be acceptable, though there will be probably also (many) exceptions. Independently of the solution used, it's still needed to make sure that the data are appropriate for the processes and reporting. 

    If no physical separation is needed between the two types of layers, one can remove the persisted tables from the logic and keep the objects as they are.

    Independently of which architecture is chosen, one shouldn't forget to validate one's presumptions in what concerns the data model (e.g. customers without addresses, address types, etc.).

    Previous Post <<||>> Next Post

    21 January 2025

    🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

    Last updated: 21-Jan-2025

    [Data Warehousing] Extract, Transform, Load (ETL)  

    • {def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)
      • includes 
        • the transportation of data
        • overlaps between stages
        • changes in flow 
          • due to 
            • new technologies
            • changing requirements
        • changes in scope
        • troubleshooting
          • due to data mismatches
    • {step} extraction
      • data is extracted directly from the source systems or intermediate repositories
        • data may be made available in intermediate repositories, when the direct access to the source system is not possible
          •  this approach can add a complexity layer
      •  {substep} data validation
        • an automated process that validates whether data pulled from sources has the expected values
        • relies on a validation engine
          • rejects data if it falls outside the validation rules
          • analyzes rejected records on an ongoing basis to
            • identifies what went wrong
            • corrects the source data
            • modifies extraction to resolve the problem in the next batches
    • {step} transform
      • transforms the data, removing extraneous or erroneous data
      • applies business rules 
      • checks data integrity
        • ensures that the data is not corrupted in the source or corrupted by ETL
        • may ensure no data was dropped in previous stages
      • aggregates the data if necessary
    • {step} load
      • {substep} store the data into a staging layer
        • transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
        • {advantage} makes it easier to roll back, if something went wrong
        • {advantage} allows to develop the logic iteratively and publish the data only when needed 
        • {advantage} can be used to generate audit reports for 
          • regulatory compliance 
          • diagnose and repair of data problems
        • modern ETL process perform transformations in place, instead of in staging areas
      • {substep} publish the data to the target
        • loads the data into the target table(s)
        • {scenario} the existing data are overridden every time the ETL pipeline loads a new batch
          • this might happen daily, weekly, or monthly
        • {scenario} add new data without overriding
          • the timestamp can indicate the data is new
        • {recommendation} prevent the loading process to error out due to disk space and performance limitations
    • {approach} building an ETL infrastructure
      • involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly
        • recurring process
          • e.g. data used for reporting
        • one-time process
          • e.g. data migration
      • may involve 
        • multiple source or destination systems 
        • different types of data
          • e.g. reference, master and transactional data
          • ⇐ may have complex dependencies
        • different level of structuredness
          • e.g. structured, semistructured, nonstructured 
        • different data formats
        • data of different quality
        • different ownership 
    • {recommendation} consider ETL best practices
      • {best practice} define requirements upfront in a consolidated and consistent manner
        • allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
        • the requirements may involve all the aspects of the process
          • e.g. data extraction, data transformation, standard formatting, etc.
      • {best practice} define a high level strategy
        • allows to define the road ahead, risks and other aspects 
        • allows to provide transparency
        • this may be part of a broader strategy that can be referenced 
      • {best practice} align the requirements and various aspects to the existing strategies existing in the organization
        • allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
        • e.g. IT, business, Information Security, Data Management strategies
      • {best practice} define the scope upfront
        • allows to better estimate the effort and validate the outcomes
        • even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations
      • {best practice} manage the effort as a project and use a suitable Project Management methodology
        • allows to apply structured well-established PM practices 
        • it might be suited to adapt the methodology to project's particularities 
      • {best practice} convert data to standard formats to standardize data processing
        • allows to reduce the volume of issues resulted from data type mismatches
        • applies mainly to dates, numeric or other values for which can be defined standard formats
      • {best practice} clean the data in the source systems, when  cost-effective
        • allows to reduces the overall effort, especially when this is done in advance
        • this should be based ideally on the scope
      • {best practice} define and enforce data ownership
        • allows to enforce clear responsibilities across the various processes
        • allows to reduce the overall effort
      • {best practice} document data dependencies
        • document the dependencies existing in the data at the various levels
      • {best practice} protocol data movement from source(s) to destination(s) in term of data volume
        • allows to provide transparence into the data movement process
        • allows to identify gaps in the data or broader issues
        • can be used for troubleshooting and understanding the overall data growth
    • {recommendation} consider proven systems, architectures and methodologies
      • allows to minimize the overall effort and costs associated with the process

    10 November 2024

    🏭🗒️Microsoft Fabric: Warehouse [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 11-Mar-2024

    Warehouse vs SQL analytics endpoint in Microsoft Fabric
    Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

    [Microsoft Fabric] Warehouse

    • {def} highly available relational data warehouse that can be used to store and query data in the Lakehouse
      • supports the full transactional T-SQL capabilities 
      • modernized version of the traditional data warehouse
      • unifies capabilities from Synapse Dedicated and Serverless SQL Pools
    • resources are managed elastically to provide the best possible performance
      • ⇒ no need to think about indexing or distribution
      • a new parser gives enhanced CSV file ingestion time
      • metadata is cached in addition to data
      • improved assignment of compute resources to milliseconds
      • multi-TB result sets are streamed to the client
    • leverages a distributed query processing engine
      • provides with workloads that have a natural isolation boundary [3]
        • true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]
    • {operation} data ingestion
      • involves moving data from source systems into the data warehouse [2]
        • the data becomes available for analysis [1]
      • via Pipelines, Dataflows, cross-database querying, COPY INTO command
      • no need to copy data from the lakehouse to the data warehouse [1]
        • one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]
    • {operation} data storage
      • involves storing the data in a format that is optimized for analytics [2]
    • {operation} data processing
      • involves transforming the data into a format that is ready for consumption by analytical tools [1]
    • {operation} data analysis and delivery
      • involves analyzing the data to gain insights and delivering those insights to the business [1]
    • {operation} designing a warehouse (aka warehouse design)
      • standard warehouse design can be used
    • {operation} sharing a warehouse (aka warehouse sharing)
      • a way to provide users read access to the warehouse for downstream consumption
        • via SQL, Spark, or Power BI
      • the level of permissions can be customized to provide the appropriate level of access
    • {feature} mirroring 
      • provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric
        • any database can be accessed and managed centrally from within Fabric without having to switch database clients
        • data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload
    • {feature} v-order
      • write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engine [5]
    • {feature} caching
      • stores frequently accessed data and metadata in a faster storage layer [6]
    • {concept} SQL analytics endpoint 
      • a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]
    • {concept} virtual warehouse
      • can containing data from virtually any source by using shortcuts [3]
    • {concept} cross database querying 
      • enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]

      References:
      [1] Microsoft Learn (2023) Fabric: Get started with data warehouses in Microsoft Fabric (link
      [2] Microsoft Learn (2023) Fabric: Microsoft Fabric decision guide: choose a data store (link)
      [3] Microsoft Learn (2024) Fabric: What is data warehousing in Microsoft Fabric? (link)
      [4] Microsoft Learn (2023) Fabric: Better together: the lakehouse and warehouse (link)
      [5] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
      [6] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]
      [7] Microsoft Learn (2024) Fabric: 

      Resources:
      [R1] Microsoft Learn (2023) Fabric: Data warehousing documentation in Microsoft Fabric (link)
      [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
      [R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

      Acronyms:
      ETL - Extract, Transfer, Load
      MF - Microsoft Fabric

      13 February 2024

      🧭Business Intelligence: A One-Man Show (Part V: Focus on the Foundation)

      Business Intelligence Suite
      Business Intelligence Suite

      I tend to agree that one person can't do anymore "everything in the data space", as Christopher Laubenthal put it his article on the topic [1]. He seems to catch the essence of some of the core data roles found in organizations. Summarizing these roles, data architecture is about designing and building a data infrastructure, data engineering is about moving data, database administration is mainly about managing databases, data analysis is about assisting the business with data and reports, information design is about telling stories, while data science can be about studying the impact of various components on the data. 

      However, I find his analogy between a college's functional structure and the core data roles as poorly chosen from multiple perspectives, even if both are about building an infrastructure of some type. 

      Firstly, the two constructions have different foundations. Data exists in a an organization also without data architects, data engineers or data administrators (DBAs)! It's enough to buy one or more information systems functioning as islands and reporting needs will arise. The need for a data architect might come when the systems need to be integrated or maybe when a data warehouse needs to be build, though many organizations are still in business without such constructs. While for the others, the more complex the integrations, the bigger the need for a Data Architect. Conversely, some systems can be integrated by design and such capabilities might drive their selection.

      Data engineering is needed mainly in the context of the cloud, respectively of data lake-based architectures, where data needs to be moved, processed and prepared for consumption. Conversely, architectures like Microsoft Fabric minimize data movement, the focus being on data processing, the successive transformations it needs to suffer in moving from bronze to the gold layer, respectively in creating an organizational semantical data model. The complexity of the data processing is dependent on data' structuredness, quality and other data characteristics. 

      As I mentioned before, modern databases, including the ones in the cloud, reduce the need for DBAs to a considerable degree. Unless the volume of work is big enough to consider a DBA role as an in-house resource, organizations will more likely consider involving a service provider and a contingent to cover the needs. 

      Having in-house one or more people acting under the Data Analyst role, people who know and understand the business, respectively the data tools used in the process, can go a long way. Moreover, it's helpful to have an evangelist-like resource in house, a person who is able to raise awareness and knowhow, help diffuse knowledge about tools, techniques, data, results, best practices, respectively act as a mentor for the Data Analyst citizens. From my point of view, these are the people who form the data-related backbone (foundation) of an organization and this is the minimum of what an organization should have!

      Once this established, one can build data warehouses, data integrations and other support architectures, respectively think about BI and Data strategy, Data Governance, etc. Of course, having a Chief Data Officer and a Data Strategy in place can bring more structure in handling the topics at the various levels - strategical, tactical, respectively operational. In constructions one starts with a blueprint and a data strategy can have the same effect, if one knows how to write it and implement it accordingly. However, the strategy is just a tool, while the data-knowledgeable workers are the foundation on which organizations should build upon!

      "Build it and they will come" philosophy can work as well, though without knowledgeable and inquisitive people the philosophy has high chances to fail.

      Previous Post <<||>> Next Post

      Resources:
      [1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

      01 April 2021

      💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

      Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.

      Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process. 

      As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values. 

      It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:

      -- looking at the JSON file 
      SELECT *
      FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
      

      In a second step one can add the OPENJSON function by looking only at the first record: 

      -- querying a json file (one record)
      SELECT *
      FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
           CROSS APPLY OPENJSON(BulkColumn,'$[0]')
      

      In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly. 

      -- querying a json file (all records, a few fields)
      SELECT Country 
      , CountryCode 
      FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
           CROSS APPLY OPENJSON(BulkColumn,'$')
       WITH ( 
        Country nvarchar(max) '$.Country'
      , CountryCode nvarchar(3) '$.Country_Code'
      ) AS DAT; 
      

      In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly: 

      -- importing a json file (all records) on the fly
      SELECT DAT.Country
      , DAT.CountryCode
      , DAT.Y1960
      , DAT.Y1961
      , DAT.Y1962
      , DAT.Y1963
      , DAT.Y1964
      , DAT.Y1965
      , DAT.Y1966
      , DAT.Y1967
      , DAT.Y1968
      , DAT.Y1969
      , DAT.Y1970
      , DAT.Y1971
      , DAT.Y1972
      , DAT.Y1973
      , DAT.Y1974
      , DAT.Y1975
      , DAT.Y1976
      , DAT.Y1977
      , DAT.Y1978
      , DAT.Y1979
      , DAT.Y1980
      , DAT.Y1981
      , DAT.Y1982
      , DAT.Y1983
      , DAT.Y1984
      , DAT.Y1985
      , DAT.Y1986
      , DAT.Y1987
      , DAT.Y1988
      , DAT.Y1989
      , DAT.Y1990
      , DAT.Y1991
      , DAT.Y1992
      , DAT.Y1993
      , DAT.Y1994
      , DAT.Y1995
      , DAT.Y1996
      , DAT.Y1997
      , DAT.Y1998
      , DAT.Y1999
      , DAT.Y2000
      , DAT.Y2001
      , Cast(DAT.Y2002 as bigint) Y2002
      , Cast(DAT.Y2003 as bigint) Y2003
      , Cast(DAT.Y2004 as bigint) Y2004
      , Cast(DAT.Y2005 as bigint) Y2005
      , Cast(DAT.Y2006 as bigint) Y2006
      , Cast(DAT.Y2007 as bigint) Y2007
      , Cast(DAT.Y2008 as bigint) Y2008
      , Cast(DAT.Y2009 as bigint) Y2009
      , Cast(DAT.Y2010 as bigint) Y2010
      , Cast(DAT.Y2011 as bigint) Y2011
      , Cast(DAT.Y2012 as bigint) Y2012
      , Cast(DAT.Y2013 as bigint) Y2013
      , Cast(DAT.Y2014 as bigint) Y2014
      , Cast(DAT.Y2015 as bigint) Y2015
      , DAT.Y2016
      INTO dbo.CountryPopulation
      FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
           CROSS APPLY OPENJSON(BulkColumn,'$')
       WITH ( 
        Country nvarchar(max) '$.Country'
      , CountryCode nvarchar(3) '$.Country_Code'
      , Y1960 bigint '$.Year_1960'
      , Y1961 bigint '$.Year_1961'
      , Y1962 bigint '$.Year_1962'
      , Y1963 bigint '$.Year_1963'
      , Y1964 bigint '$.Year_1964'
      , Y1965 bigint '$.Year_1965'
      , Y1966 bigint '$.Year_1966'
      , Y1967 bigint '$.Year_1967'
      , Y1968 bigint '$.Year_1968'
      , Y1969 bigint '$.Year_1969'
      , Y1970 bigint '$.Year_1970'
      , Y1971 bigint '$.Year_1971'
      , Y1972 bigint '$.Year_1972'
      , Y1973 bigint '$.Year_1973'
      , Y1974 bigint '$.Year_1974'
      , Y1975 bigint '$.Year_1975'
      , Y1976 bigint '$.Year_1976'
      , Y1977 bigint '$.Year_1977'
      , Y1978 bigint '$.Year_1978'
      , Y1979 bigint '$.Year_1979'
      , Y1980 bigint '$.Year_1980'
      , Y1981 bigint '$.Year_1981'
      , Y1982 bigint '$.Year_1982'
      , Y1983 bigint '$.Year_1983'
      , Y1984 bigint '$.Year_1984'
      , Y1985 bigint '$.Year_1985'
      , Y1986 bigint '$.Year_1986'
      , Y1987 bigint '$.Year_1987'
      , Y1988 bigint '$.Year_1988'
      , Y1989 bigint '$.Year_1989'
      , Y1990 bigint '$.Year_1990'
      , Y1991 bigint '$.Year_1991'
      , Y1992 bigint '$.Year_1992'
      , Y1993 bigint '$.Year_1993'
      , Y1994 bigint '$.Year_1994'
      , Y1995 bigint '$.Year_1995'
      , Y1996 bigint '$.Year_1996'
      , Y1997 bigint '$.Year_1997'
      , Y1998 bigint '$.Year_1998'
      , Y1999 bigint '$.Year_1999'
      , Y2000 bigint '$.Year_2000'
      , Y2001 bigint '$.Year_2001'
      , Y2002 decimal(19,1) '$.Year_2002'
      , Y2003 decimal(19,1) '$.Year_2003'
      , Y2004 decimal(19,1) '$.Year_2004'
      , Y2005 decimal(19,1) '$.Year_2005'
      , Y2006 decimal(19,1) '$.Year_2006'
      , Y2007 decimal(19,1) '$.Year_2007'
      , Y2008 decimal(19,1) '$.Year_2008'
      , Y2009 decimal(19,1) '$.Year_2009'
      , Y2010 decimal(19,1) '$.Year_2010'
      , Y2011 decimal(19,1) '$.Year_2011'
      , Y2012 decimal(19,1) '$.Year_2012'
      , Y2013 decimal(19,1) '$.Year_2013'
      , Y2014 decimal(19,1) '$.Year_2014'
      , Y2015 decimal(19,1) '$.Year_2015'
      , Y2016 bigint '$.Year_2016'
      ) AS DAT; 
      

      As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary. 

      Also, the columns’ names were changed either for simplification/convenience or simply taste. 

      Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:

      Source fieldTarget fieldDataTypeValueImport ClauseSelect Clause
      CountryCountrynvarchar(max) emen Rep., Country nvarchar(max) '$.Country', DAT.Country
      Country_CodeCountryCodenvarchar(3) YEM, CountryCode nvarchar(3) '$.Country_Code', DAT.CountryCode
      Year_1960Y1960bigint5172135, Y1960 bigint '$.Year_1960', DAT.Y1960
      Year_1961Y1961bigint5260501, Y1961 bigint '$.Year_1961', DAT.Y1961
      Year_1962Y1962bigint5351799, Y1962 bigint '$.Year_1962', DAT.Y1962
      Year_1963Y1963bigint5446063, Y1963 bigint '$.Year_1963', DAT.Y1963
      Year_1964Y1964bigint5543339, Y1964 bigint '$.Year_1964', DAT.Y1964
      Year_1965Y1965bigint5643643, Y1965 bigint '$.Year_1965', DAT.Y1965
      Year_1966Y1966bigint5748588, Y1966 bigint '$.Year_1966', DAT.Y1966
      Year_1967Y1967bigint5858638, Y1967 bigint '$.Year_1967', DAT.Y1967
      Year_1968Y1968bigint5971407, Y1968 bigint '$.Year_1968', DAT.Y1968
      Year_1969Y1969bigint6083619, Y1969 bigint '$.Year_1969', DAT.Y1969
      Year_1970Y1970bigint6193810, Y1970 bigint '$.Year_1970', DAT.Y1970
      Year_1971Y1971bigint6300554, Y1971 bigint '$.Year_1971', DAT.Y1971
      Year_1972Y1972bigint6407295, Y1972 bigint '$.Year_1972', DAT.Y1972
      Year_1973Y1973bigint6523452, Y1973 bigint '$.Year_1973', DAT.Y1973
      Year_1974Y1974bigint6661566, Y1974 bigint '$.Year_1974', DAT.Y1974
      Year_1975Y1975bigint6830692, Y1975 bigint '$.Year_1975', DAT.Y1975
      Year_1976Y1976bigint7034868, Y1976 bigint '$.Year_1976', DAT.Y1976
      Year_1977Y1977bigint7271872, Y1977 bigint '$.Year_1977', DAT.Y1977
      Year_1978Y1978bigint7536764, Y1978 bigint '$.Year_1978', DAT.Y1978
      Year_1979Y1979bigint7821552, Y1979 bigint '$.Year_1979', DAT.Y1979
      Year_1980Y1980bigint8120497, Y1980 bigint '$.Year_1980', DAT.Y1980
      Year_1981Y1981bigint8434017, Y1981 bigint '$.Year_1981', DAT.Y1981
      Year_1982Y1982bigint8764621, Y1982 bigint '$.Year_1982', DAT.Y1982
      Year_1983Y1983bigint9111097, Y1983 bigint '$.Year_1983', DAT.Y1983
      Year_1984Y1984bigint9472170, Y1984 bigint '$.Year_1984', DAT.Y1984
      Year_1985Y1985bigint9847899, Y1985 bigint '$.Year_1985', DAT.Y1985
      Year_1986Y1986bigint10232733, Y1986 bigint '$.Year_1986', DAT.Y1986
      Year_1987Y1987bigint10628585, Y1987 bigint '$.Year_1987', DAT.Y1987
      Year_1988Y1988bigint11051504, Y1988 bigint '$.Year_1988', DAT.Y1988
      Year_1989Y1989bigint11523267, Y1989 bigint '$.Year_1989', DAT.Y1989
      Year_1990Y1990bigint12057039, Y1990 bigint '$.Year_1990', DAT.Y1990
      Year_1991Y1991bigint12661614, Y1991 bigint '$.Year_1991', DAT.Y1991
      Year_1992Y1992bigint13325583, Y1992 bigint '$.Year_1992', DAT.Y1992
      Year_1993Y1993bigint14017239, Y1993 bigint '$.Year_1993', DAT.Y1993
      Year_1994Y1994bigint14692686, Y1994 bigint '$.Year_1994', DAT.Y1994
      Year_1995Y1995bigint15320653, Y1995 bigint '$.Year_1995', DAT.Y1995
      Year_1996Y1996bigint15889449, Y1996 bigint '$.Year_1996', DAT.Y1996
      Year_1997Y1997bigint16408954, Y1997 bigint '$.Year_1997', DAT.Y1997
      Year_1998Y1998bigint16896210, Y1998 bigint '$.Year_1998', DAT.Y1998
      Year_1999Y1999bigint17378098, Y1999 bigint '$.Year_1999', DAT.Y1999
      Year_2000Y2000bigint17874725, Y2000 bigint '$.Year_2000', DAT.Y2000
      Year_2001Y2001bigint18390135, Y2001 bigint '$.Year_2001', DAT.Y2001
      Year_2002Y2002decimal(19,1) 18919179.0, Y2002 decimal(19,1) '$.Year_2002', Cast(DAT.Y2002 as bigint) Y2002
      Year_2003Y2003decimal(19,1) 19462086.0, Y2003 decimal(19,1) '$.Year_2003', Cast(DAT.Y2003 as bigint) Y2003
      Year_2004Y2004decimal(19,1) 20017068.0, Y2004 decimal(19,1) '$.Year_2004', Cast(DAT.Y2004 as bigint) Y2004
      Year_2005Y2005decimal(19,1) 20582927.0, Y2005 decimal(19,1) '$.Year_2005', Cast(DAT.Y2005 as bigint) Y2005
      Year_2006Y2006decimal(19,1) 21160534.0, Y2006 decimal(19,1) '$.Year_2006', Cast(DAT.Y2006 as bigint) Y2006
      Year_2007Y2007decimal(19,1) 21751605.0, Y2007 decimal(19,1) '$.Year_2007', Cast(DAT.Y2007 as bigint) Y2007
      Year_2008Y2008decimal(19,1) 22356391.0, Y2008 decimal(19,1) '$.Year_2008', Cast(DAT.Y2008 as bigint) Y2008
      Year_2009Y2009decimal(19,1) 22974929.0, Y2009 decimal(19,1) '$.Year_2009', Cast(DAT.Y2009 as bigint) Y2009
      Year_2010Y2010decimal(19,1) 23606779.0, Y2010 decimal(19,1) '$.Year_2010', Cast(DAT.Y2010 as bigint) Y2010
      Year_2011Y2011decimal(19,1) 24252206.0, Y2011 decimal(19,1) '$.Year_2011', Cast(DAT.Y2011 as bigint) Y2011
      Year_2012Y2012decimal(19,1) 24909969.0, Y2012 decimal(19,1) '$.Year_2012', Cast(DAT.Y2012 as bigint) Y2012
      Year_2013Y2013decimal(19,1) 25576322.0, Y2013 decimal(19,1) '$.Year_2013', Cast(DAT.Y2013 as bigint) Y2013
      Year_2014Y2014decimal(19,1) 26246327.0, Y2014 decimal(19,1) '$.Year_2014', Cast(DAT.Y2014 as bigint) Y2014
      Year_2015Y2015decimal(19,1) 26916207.0, Y2015 decimal(19,1) '$.Year_2015', Cast(DAT.Y2015 as bigint) Y2015
      Year_2016Y2016bigint27584213, Y2016 bigint '$.Year_2016', DAT.Y2016

      Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas): 

      =", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" 
      =", DAT." & TRIM(B2)
      

      Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:

      -- unpivoting the data 
      SELECT RES.Country
      , RES.CountryCode
      , Cast(Replace(RES.[Year], 'Y', '') as int) [Year]
      , RES.Population
      --INTO dbo.CountryPopulationPerYear
      FROM 
      ( -- basis data
      	SELECT Country
      	, CountryCode
      	, Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
      	, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
      	, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
      	, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
      	, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
      	, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016
      	FROM dbo.CountryPopulation
      ) DAT
      UNPIVOT  -- unpivot logic
         (Population FOR [Year] IN  (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
      , Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
      , Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
      , Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
      , Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
      , Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)
      ) RES
      

      Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table). 

      --reviewing the data 
      SELECT Country
      , CountryCode
      , AVG(Population) AveragePopulation
      , Max(Population) - Min(Population) RangePopulation
      FROM dbo.CountryPopulationPerYear
      WHERE [Year] BETWEEN 2010 AND 2019
      GROUP BY Country
      , CountryCode
      ORDER BY Country
      

      On the other side making comparisons between consecutive years is easier when using a matrix structure: 

      --reviewing the data 
      SELECT Country
      , CountryCode
      , Y2016
      , Y2010
      , Y2010-Y2010 [2016-2010]
      , Y2011-Y2010 [2011-2010]
      , Y2012-Y2011 [2011-2011]
      , Y2013-Y2012 [2011-2012]
      , Y2014-Y2013 [2011-2013]
      , Y2015-Y2014 [2011-2014]
      , Y2016-Y2015 [2011-2015]
      FROM dbo.CountryPopulation
      ORDER BY Country
      

      Unless the storage space is a problem, in theory one can store the data in both formats as there can be requests which can benefit from one structure or the other. 

      Happy coding!
      Related Posts Plugin for WordPress, Blogger...

      About Me

      My photo
      Koeln, NRW, Germany
      IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.