SQL Troubles

01 April 2021

💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.

Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process.

As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values.

It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:

-- looking at the JSON file 
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile

In a second step one can add the OPENJSON function by looking only at the first record:

-- querying a json file (one record)
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$[0]')

In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly.

-- querying a json file (all records, a few fields)
SELECT Country 
, CountryCode 
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
) AS DAT;

In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly:

-- importing a json file (all records) on the fly
SELECT DAT.Country
, DAT.CountryCode
, DAT.Y1960
, DAT.Y1961
, DAT.Y1962
, DAT.Y1963
, DAT.Y1964
, DAT.Y1965
, DAT.Y1966
, DAT.Y1967
, DAT.Y1968
, DAT.Y1969
, DAT.Y1970
, DAT.Y1971
, DAT.Y1972
, DAT.Y1973
, DAT.Y1974
, DAT.Y1975
, DAT.Y1976
, DAT.Y1977
, DAT.Y1978
, DAT.Y1979
, DAT.Y1980
, DAT.Y1981
, DAT.Y1982
, DAT.Y1983
, DAT.Y1984
, DAT.Y1985
, DAT.Y1986
, DAT.Y1987
, DAT.Y1988
, DAT.Y1989
, DAT.Y1990
, DAT.Y1991
, DAT.Y1992
, DAT.Y1993
, DAT.Y1994
, DAT.Y1995
, DAT.Y1996
, DAT.Y1997
, DAT.Y1998
, DAT.Y1999
, DAT.Y2000
, DAT.Y2001
, Cast(DAT.Y2002 as bigint) Y2002
, Cast(DAT.Y2003 as bigint) Y2003
, Cast(DAT.Y2004 as bigint) Y2004
, Cast(DAT.Y2005 as bigint) Y2005
, Cast(DAT.Y2006 as bigint) Y2006
, Cast(DAT.Y2007 as bigint) Y2007
, Cast(DAT.Y2008 as bigint) Y2008
, Cast(DAT.Y2009 as bigint) Y2009
, Cast(DAT.Y2010 as bigint) Y2010
, Cast(DAT.Y2011 as bigint) Y2011
, Cast(DAT.Y2012 as bigint) Y2012
, Cast(DAT.Y2013 as bigint) Y2013
, Cast(DAT.Y2014 as bigint) Y2014
, Cast(DAT.Y2015 as bigint) Y2015
, DAT.Y2016
INTO dbo.CountryPopulation
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
, Y1960 bigint '$.Year_1960'
, Y1961 bigint '$.Year_1961'
, Y1962 bigint '$.Year_1962'
, Y1963 bigint '$.Year_1963'
, Y1964 bigint '$.Year_1964'
, Y1965 bigint '$.Year_1965'
, Y1966 bigint '$.Year_1966'
, Y1967 bigint '$.Year_1967'
, Y1968 bigint '$.Year_1968'
, Y1969 bigint '$.Year_1969'
, Y1970 bigint '$.Year_1970'
, Y1971 bigint '$.Year_1971'
, Y1972 bigint '$.Year_1972'
, Y1973 bigint '$.Year_1973'
, Y1974 bigint '$.Year_1974'
, Y1975 bigint '$.Year_1975'
, Y1976 bigint '$.Year_1976'
, Y1977 bigint '$.Year_1977'
, Y1978 bigint '$.Year_1978'
, Y1979 bigint '$.Year_1979'
, Y1980 bigint '$.Year_1980'
, Y1981 bigint '$.Year_1981'
, Y1982 bigint '$.Year_1982'
, Y1983 bigint '$.Year_1983'
, Y1984 bigint '$.Year_1984'
, Y1985 bigint '$.Year_1985'
, Y1986 bigint '$.Year_1986'
, Y1987 bigint '$.Year_1987'
, Y1988 bigint '$.Year_1988'
, Y1989 bigint '$.Year_1989'
, Y1990 bigint '$.Year_1990'
, Y1991 bigint '$.Year_1991'
, Y1992 bigint '$.Year_1992'
, Y1993 bigint '$.Year_1993'
, Y1994 bigint '$.Year_1994'
, Y1995 bigint '$.Year_1995'
, Y1996 bigint '$.Year_1996'
, Y1997 bigint '$.Year_1997'
, Y1998 bigint '$.Year_1998'
, Y1999 bigint '$.Year_1999'
, Y2000 bigint '$.Year_2000'
, Y2001 bigint '$.Year_2001'
, Y2002 decimal(19,1) '$.Year_2002'
, Y2003 decimal(19,1) '$.Year_2003'
, Y2004 decimal(19,1) '$.Year_2004'
, Y2005 decimal(19,1) '$.Year_2005'
, Y2006 decimal(19,1) '$.Year_2006'
, Y2007 decimal(19,1) '$.Year_2007'
, Y2008 decimal(19,1) '$.Year_2008'
, Y2009 decimal(19,1) '$.Year_2009'
, Y2010 decimal(19,1) '$.Year_2010'
, Y2011 decimal(19,1) '$.Year_2011'
, Y2012 decimal(19,1) '$.Year_2012'
, Y2013 decimal(19,1) '$.Year_2013'
, Y2014 decimal(19,1) '$.Year_2014'
, Y2015 decimal(19,1) '$.Year_2015'
, Y2016 bigint '$.Year_2016'
) AS DAT;

As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary.

Also, the columns’ names were changed either for simplification/convenience or simply taste.

Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:

Source field	Target field	DataType	Value	Import Clause	Select Clause
Country	Country	nvarchar(max)	emen Rep.	, Country nvarchar(max) '$.Country'	, DAT.Country
Country_Code	CountryCode	nvarchar(3)	YEM	, CountryCode nvarchar(3) '$.Country_Code'	, DAT.CountryCode
Year_1960	Y1960	bigint	5172135	, Y1960 bigint '$.Year_1960'	, DAT.Y1960
Year_1961	Y1961	bigint	5260501	, Y1961 bigint '$.Year_1961'	, DAT.Y1961
Year_1962	Y1962	bigint	5351799	, Y1962 bigint '$.Year_1962'	, DAT.Y1962
Year_1963	Y1963	bigint	5446063	, Y1963 bigint '$.Year_1963'	, DAT.Y1963
Year_1964	Y1964	bigint	5543339	, Y1964 bigint '$.Year_1964'	, DAT.Y1964
Year_1965	Y1965	bigint	5643643	, Y1965 bigint '$.Year_1965'	, DAT.Y1965
Year_1966	Y1966	bigint	5748588	, Y1966 bigint '$.Year_1966'	, DAT.Y1966
Year_1967	Y1967	bigint	5858638	, Y1967 bigint '$.Year_1967'	, DAT.Y1967
Year_1968	Y1968	bigint	5971407	, Y1968 bigint '$.Year_1968'	, DAT.Y1968
Year_1969	Y1969	bigint	6083619	, Y1969 bigint '$.Year_1969'	, DAT.Y1969
Year_1970	Y1970	bigint	6193810	, Y1970 bigint '$.Year_1970'	, DAT.Y1970
Year_1971	Y1971	bigint	6300554	, Y1971 bigint '$.Year_1971'	, DAT.Y1971
Year_1972	Y1972	bigint	6407295	, Y1972 bigint '$.Year_1972'	, DAT.Y1972
Year_1973	Y1973	bigint	6523452	, Y1973 bigint '$.Year_1973'	, DAT.Y1973
Year_1974	Y1974	bigint	6661566	, Y1974 bigint '$.Year_1974'	, DAT.Y1974
Year_1975	Y1975	bigint	6830692	, Y1975 bigint '$.Year_1975'	, DAT.Y1975
Year_1976	Y1976	bigint	7034868	, Y1976 bigint '$.Year_1976'	, DAT.Y1976
Year_1977	Y1977	bigint	7271872	, Y1977 bigint '$.Year_1977'	, DAT.Y1977
Year_1978	Y1978	bigint	7536764	, Y1978 bigint '$.Year_1978'	, DAT.Y1978
Year_1979	Y1979	bigint	7821552	, Y1979 bigint '$.Year_1979'	, DAT.Y1979
Year_1980	Y1980	bigint	8120497	, Y1980 bigint '$.Year_1980'	, DAT.Y1980
Year_1981	Y1981	bigint	8434017	, Y1981 bigint '$.Year_1981'	, DAT.Y1981
Year_1982	Y1982	bigint	8764621	, Y1982 bigint '$.Year_1982'	, DAT.Y1982
Year_1983	Y1983	bigint	9111097	, Y1983 bigint '$.Year_1983'	, DAT.Y1983
Year_1984	Y1984	bigint	9472170	, Y1984 bigint '$.Year_1984'	, DAT.Y1984
Year_1985	Y1985	bigint	9847899	, Y1985 bigint '$.Year_1985'	, DAT.Y1985
Year_1986	Y1986	bigint	10232733	, Y1986 bigint '$.Year_1986'	, DAT.Y1986
Year_1987	Y1987	bigint	10628585	, Y1987 bigint '$.Year_1987'	, DAT.Y1987
Year_1988	Y1988	bigint	11051504	, Y1988 bigint '$.Year_1988'	, DAT.Y1988
Year_1989	Y1989	bigint	11523267	, Y1989 bigint '$.Year_1989'	, DAT.Y1989
Year_1990	Y1990	bigint	12057039	, Y1990 bigint '$.Year_1990'	, DAT.Y1990
Year_1991	Y1991	bigint	12661614	, Y1991 bigint '$.Year_1991'	, DAT.Y1991
Year_1992	Y1992	bigint	13325583	, Y1992 bigint '$.Year_1992'	, DAT.Y1992
Year_1993	Y1993	bigint	14017239	, Y1993 bigint '$.Year_1993'	, DAT.Y1993
Year_1994	Y1994	bigint	14692686	, Y1994 bigint '$.Year_1994'	, DAT.Y1994
Year_1995	Y1995	bigint	15320653	, Y1995 bigint '$.Year_1995'	, DAT.Y1995
Year_1996	Y1996	bigint	15889449	, Y1996 bigint '$.Year_1996'	, DAT.Y1996
Year_1997	Y1997	bigint	16408954	, Y1997 bigint '$.Year_1997'	, DAT.Y1997
Year_1998	Y1998	bigint	16896210	, Y1998 bigint '$.Year_1998'	, DAT.Y1998
Year_1999	Y1999	bigint	17378098	, Y1999 bigint '$.Year_1999'	, DAT.Y1999
Year_2000	Y2000	bigint	17874725	, Y2000 bigint '$.Year_2000'	, DAT.Y2000
Year_2001	Y2001	bigint	18390135	, Y2001 bigint '$.Year_2001'	, DAT.Y2001
Year_2002	Y2002	decimal(19,1)	18919179.0	, Y2002 decimal(19,1) '$.Year_2002'	, Cast(DAT.Y2002 as bigint) Y2002
Year_2003	Y2003	decimal(19,1)	19462086.0	, Y2003 decimal(19,1) '$.Year_2003'	, Cast(DAT.Y2003 as bigint) Y2003
Year_2004	Y2004	decimal(19,1)	20017068.0	, Y2004 decimal(19,1) '$.Year_2004'	, Cast(DAT.Y2004 as bigint) Y2004
Year_2005	Y2005	decimal(19,1)	20582927.0	, Y2005 decimal(19,1) '$.Year_2005'	, Cast(DAT.Y2005 as bigint) Y2005
Year_2006	Y2006	decimal(19,1)	21160534.0	, Y2006 decimal(19,1) '$.Year_2006'	, Cast(DAT.Y2006 as bigint) Y2006
Year_2007	Y2007	decimal(19,1)	21751605.0	, Y2007 decimal(19,1) '$.Year_2007'	, Cast(DAT.Y2007 as bigint) Y2007
Year_2008	Y2008	decimal(19,1)	22356391.0	, Y2008 decimal(19,1) '$.Year_2008'	, Cast(DAT.Y2008 as bigint) Y2008
Year_2009	Y2009	decimal(19,1)	22974929.0	, Y2009 decimal(19,1) '$.Year_2009'	, Cast(DAT.Y2009 as bigint) Y2009
Year_2010	Y2010	decimal(19,1)	23606779.0	, Y2010 decimal(19,1) '$.Year_2010'	, Cast(DAT.Y2010 as bigint) Y2010
Year_2011	Y2011	decimal(19,1)	24252206.0	, Y2011 decimal(19,1) '$.Year_2011'	, Cast(DAT.Y2011 as bigint) Y2011
Year_2012	Y2012	decimal(19,1)	24909969.0	, Y2012 decimal(19,1) '$.Year_2012'	, Cast(DAT.Y2012 as bigint) Y2012
Year_2013	Y2013	decimal(19,1)	25576322.0	, Y2013 decimal(19,1) '$.Year_2013'	, Cast(DAT.Y2013 as bigint) Y2013
Year_2014	Y2014	decimal(19,1)	26246327.0	, Y2014 decimal(19,1) '$.Year_2014'	, Cast(DAT.Y2014 as bigint) Y2014
Year_2015	Y2015	decimal(19,1)	26916207.0	, Y2015 decimal(19,1) '$.Year_2015'	, Cast(DAT.Y2015 as bigint) Y2015
Year_2016	Y2016	bigint	27584213	, Y2016 bigint '$.Year_2016'	, DAT.Y2016

Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas):

=", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" 
=", DAT." & TRIM(B2)

Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:

-- unpivoting the data 
SELECT RES.Country
, RES.CountryCode
, Cast(Replace(RES.[Year], 'Y', '') as int) [Year]
, RES.Population
--INTO dbo.CountryPopulationPerYear
FROM 
( -- basis data
	SELECT Country
	, CountryCode
	, Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
	, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
	, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
	, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
	, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
	, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016
	FROM dbo.CountryPopulation
) DAT
UNPIVOT  -- unpivot logic
   (Population FOR [Year] IN  (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)
) RES

Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table).

--reviewing the data 
SELECT Country
, CountryCode
, AVG(Population) AveragePopulation
, Max(Population) - Min(Population) RangePopulation
FROM dbo.CountryPopulationPerYear
WHERE [Year] BETWEEN 2010 AND 2019
GROUP BY Country
, CountryCode
ORDER BY Country

On the other side making comparisons between consecutive years is easier when using a matrix structure:

--reviewing the data 
SELECT Country
, CountryCode
, Y2016
, Y2010
, Y2010-Y2010 [2016-2010]
, Y2011-Y2010 [2011-2010]
, Y2012-Y2011 [2011-2011]
, Y2013-Y2012 [2011-2012]
, Y2014-Y2013 [2011-2013]
, Y2015-Y2014 [2011-2014]
, Y2016-Y2015 [2011-2015]
FROM dbo.CountryPopulation
ORDER BY Country

Unless the storage space is a problem, in theory one can store the data in both formats as there can be requests which can benefit from one structure or the other.

Happy coding!

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)

an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently [1]
{goal} help customers fully realize the benefits of their analytics program [1]
{component} data science lifecycle definition
- {description} a framework to structure the development of data science projects [1]
- {goal} designed for data science projects that ship as part of intelligent applications that deploy ML & AI models for predictive analytics [1]
- {benefit} can be used in the context of other DM methodologies as they have common ground [1]
  - e.g. CRISP-DM, KDD
- {benefit} exploratory data science projects or improvised analytics projects can also benefit from using this process [1]
{component} standardized project structure
- {description} a directory structure that includes templates for project documents
  - ⇒makes it easy for team members to find information [1]
  - ⇐templates for the folder structure and required documents are provided in standard locations [1]
  - all code and documents are stored in an agile VCS tracking repository [1]
    - {recommendation} create a separate repository for each project on the VCS for versioning, information security, and collaboration [1]
- {benefit} organizes the code for the various activities [1]
- {benefit} allows tracking the progress [1]
- {benefit} provides checklist with key questions for each project to guarantee process and deliverables’ quality [1]
- {benefit} enables team collaboration [1]
- {benefit} allows closer tracking of the code for individual features [1]
- {benefit} enables teams to obtain better cost estimates [1]
- {benefit} helps build institutional knowledge across the organization [1]
{component} recommended infrastructure
- {description} a set of recommendations for the infrastructure and resources needed for analytics and storage [1]
- {benefit} addresses cloud and/or on-premises requirements [1]
- {benefit} enables reproducible analysis [1]
- {benefit} avoids infrastructure duplication [1]
  - ⇒minimizes inconsistencies and unnecessary infrastructure costs [1]
- {tools} tools are provided to provision the shared resources, track them, and allow each team member to connect to those resources securely [1]
- {good practice} create a consistent compute environment [1]
  - ⇐allows team members replicate and validate experiments [1]
{component} recommended tools and utilities
- {description} a set of recommendations for the tools and utilities needed for project’s execution [1]
- {benefit} help lower the barriers and increase the consistency of their adoption [1]
- {benefit} provides an initial set of tools and scripts to jump-start methodology’s adoption [1]
- {benefit} helps automate some of the common tasks in the data science lifecycle [1]
  - e.g. data exploration and baseline modeling [1]
- {benefit} well-defined structure provided for individuals to contribute shared tools and utilities into their team's shared code repository [1]
  - ⇐ resources can then be leveraged by other projects [1]
{phase} 1: business understanding
- {goal} define and document the business problem, its objectives, the needed attributes, and the metric(s) used to determine project’s success
- {goal} identify and document the relevant data sources
- {step} 1.1: define project’s objectives
  - elicit together with the stakeholders the requirements, define and document the problem and its objectives, respectively the metric(s) used to determine project’s success
    - requires a good understanding of the business processes, data and further characteristics
- {step} 1.2: identify data sources
  - identify the attributes and the data sources relevant to the problem under study
- {step} 1.3: define project plan and team*
  - develop a high-level milestone plan and identify the resources needed for executing it
- {tool} project charter
  - standard template that documents the business problem, the scope of the project, the business objectives and metric(s) used to determine project’s success
{phase} 2: data acquisition & understanding
- {goal} prepare the base dataset(s) as needed by the modeling phase into the target repository
- {goal} build the data ETL/ELT architecture and processes needed for provisioning the basis data
- {step} 2.1: ingest data
  - make the required data available for the team in the repository where the analytics operations take place
- {step} 2.2: explore data
  - understand data’s characteristics by leveraging specific tools (visualization, analysis)
  - prepare the data as needed for further processing
- {step} 2.3: set up pipelines
  - build the pipelines needed for data actualization and qualitative assessment [3]
  - set up a process to score new data or refresh the data regularly [3]
- {step} 2.4: feasibility analysis*
  - reevaluate the project to determine whether the value expected is sufficient to continue pursuing it
- {tool} data quality report
  - report that includes data summaries, data mappings, variable ranking, data qualitative assessment(s) and further information [3]
- {tool} solution architecture
  - diagram and/or textual-based description of the data pipeline(s), technical assumptions and further aspects
- {tool} data reports
  - document the structure and statistics of the raw data
- {tool} checkpoint decision
  - decision template document that
    - summarizes the findings of the feasibility analysis step
    - includes a set of choices and recommendations for the next steps
    - serves as basis for the decision on whether to continue or not the project, respectively what the next steps are
{phase} 3: modeling
- {goal} create a machine-learning model that addresses the prediction requirements and that's suitable for production
- {step} 3.1: feature engineering
  - the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis [4]
    - ⇐requires a good understanding of how the features relate to each other and how the ML algorithms use those features [4]
- {step} 3.2: model selection*
  - choose one or more modeling algorithms that address problem’s characteristics the best
- {step} 3.3: model training
  - involves the following steps:
    - split the input data into training and test datasets
    - build the models by using the training dataset
    - evaluate the training and the test data set
    - determine the optimal setup and methods
- {step} 3.4: model evaluation
  - evaluate the performance of the model(s)
- {step} 3.5: feasibility analysis*
  - evaluate the readiness of the models for use into production, respectively on whether they fulfill project’s objectives
- {tool} feature sets
  - describe the features developed for the modeling and how they were generated
  - contains pointers to the code used to generate the features
- {tool} model report
  - a standard, template-based report that provides details on each experiment’s outcomes
  - created for each model tried
- {tool} checkpoint decision
- {tool} model performance metrics
  - e.g. ROC curves or MSE
{phase} 4: deployment
- {goal} deploy the models and the data pipelines to the environment used for final user acceptance
- {step} 4.1: operationalize architecture
  - prepare the models and data pipelines for use into production
  - {best practice} expose the models over an open API interface
    - enables models’ consumption from various applications
  - {best practice} build telemetry and monitoring into the models and the data pipelines [5]
    - helps in monitoring and troubleshooting [5]
- {step} 4.2: deploy solution*
  - deploy the architecture into production
- {tool} status dashboard
  - displays data on system’s health and key metrics
- {tool} model report
  - the report in its final form with deployment information
- {tool} solution architecture
  - the document in its final form
{phase} 5: customer acceptance
- {goal} confirm that project’s objectives were fulfilled and get customer’s acceptance
- {step} 5.1: system validation
  - validate system’s performance and outcomes and confirm that it fulfills customer’s needs
- {step} 5.2: project signoff*
  - finalize and review documentation
  - handover the solution and afferent documentation to customer
  - evaluate the project against the defined objectives and get customer’ signoff
- {tool} exit report
- {tool} technical report
  - contains all the details of the project that are useful for learning about how to operate the system [6]

Acronyms:

Artificial Intelligence (AI)

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Data Mining (DM)

Knowledge Discovery in Databases (KDD)

Team Data Science Process (TDSP)

Version Control System (VCS)

Visual Studio Team Services (VSTS)

Resources:

[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]

[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]

[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]

[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source]

[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]

[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

21 March 2021

𖣯Strategic Management: The Impact of New Technologies (Part III: Checking the Vital Signs)

An organization which went through a major change, like the replacement of a strategic system (e.g. ERP/BI implementations), needs to go through a period of attentive supervision to address the inherent issues that ideally need to be handled as they arise, to minimize their future effects. Some organizations might even go through a convalescence period, which risks to prolong itself if the appropriate remedies aren’t found. Therefore, one needs an entity, who/which has the skills to recognize the symptoms, understand what’s happening and why, respectively of identifying the appropriate actions.

Given technologies’ multi-layered complexity and the volume of knowledge for understanding them, the role of the doctor can be seldom taken by one person. Moreover, the patient is an organization, each person in the organization having usually local knowledge about the patient. The needed knowledge is dispersed trough the organization, and one needs to tap into that knowledge, identify the people close to technologies and business area, respectively allow such people exchange information on a regular basis.

The people who should know the best the organization are in theory the management, however they are usually too far away from technologies and often too busy with management topics. IT professionals are close to technologies, though sometimes too far away from the patient. The users have a too narrow overview, while from logistical and economic reasons the number of people involved should be kept to a minimum. A compromise is to designate one person from each business area who works with any of the strategic systems, and assure that they have the technical and business knowledge required. It’s nothing but the key-user concept, though for it to work the key-users need not only knowledge but also the empowerment to act when the symptoms appear.

Big organizations have also a product owner for each application who supervises the application through its entire lifecycle, and who needs to coordinate with the IT, business and service providers. This is probably a good idea in order to assure that the ROI is reached over time, respectively that the needs of the system are considered within the IT operation context. In small organizations, the role can be taken by a technical or a business resource with deeper skills then the average user, usually a key-user. However, unless joined with the key-user role, the product owner’s focus will be the product and seldom the business themes.

The issues that need to be overcome after major changes are usually cross-functional, being imperative for people to work together and find solutions. Unfortunately, it’s also in human nature to wait until the issues are big enough to get the proper attention. Unless the key-users have the time allocated already for such topics, the issues will be lost in the heap of operational and tactical activities. This time must be allocated for all key-users and the technical resources needed to support them.

Some organizations build temporary working parties (groups of experts working together to achieve specific goals) or similar groups. However, the statute of such group needs to be permanent if the organization wants to continuously have its health in check, to build the needed expertize and awareness about occurred or potential issues. Centers of excellence/expertize (CoE) or competency centers (CC) are such working groups with permanent statute, having defined roles, responsibilities, and processes for supporting and promoting the effective use of technologies within the organization, respectively of monitoring and systematically addressing the risks and opportunities associated with them.

There’s also the null hypothesis, doing nothing, relying solely on employees’ professionalism, though without defined responsibility, accountability and empowerment, it can get messy.

Previous Post <<||>> Next Post

𖣯Strategic Management: The Impact of New Technologies (Part II - The Technology-oriented Patient)

Looking at the way data, information and knowledge flow through an organization, with a little imagination one can see the resemblance between an organization and the human body, in which the networks created by the respective flows spread through organization as nervous, circulatory or lymphatic braids do, each with its own role in the good functioning of the organization. Each technology adopted by an organization taps into these flows creating a structure that can be compared with the nerve plexus, as the various flows intersect in such points creating an agglomeration of nerves and braids.

The size of each plexus can be considered as proportional to the importance of the technology in respect to the overall structure. Strategic technologies like ERP, BI or planning systems, given their importance (gravity), resemble with the organs from the human body, with complex networks of braids in their vicinity. Maybe the metaphor is too far-off, though it allows stressing the importance of each technology in respect to its role and the good functioning of the organization. Moreover, each such structure functions as pressure points that can in extremis block any of the flows considered, a long-term block having important effects.

The human organism is a marvelous piece of work reflecting the grand design, however in time, especially when neglected or driven by external agents, diseases can clutch around any of the parts of the human body, with all the consequences deriving from this. On the other side, an organization is a hand-made structure found in continuous expansion as new technologies or resources are added. Even if the technologies are at peripheral side of the system, their good or bad functioning can have a ripple effect trough the various networks.

Replacing any of the above-mentioned strategic systems can be compared with the replacement of an organ in the human body, having a high degree of failure compared with other operations, being complex in nature, the organism needing long periods to recover, while in extreme situations the convalescence prolongs till the end. Fortunately, organizations seem to be more resilient to such operations, though that’s not necessarily a rule. Sometimes all it takes is just a small mistake for making the operation fail.

The general feeling is that ERP and BI implementations are taken too lightly by management, employees and implementers. During the replacement operation one must make sure not only that the organ fits and functions as expected, but also that the vital networks regained their vitality and function as expected, and the latter is a process that spans over the years to come. One needs to check the important (health) signs regularly and take the appropriate countermeasures. There must be an entity having the role of the doctor, who/which has the skills to address adequately the issues.

Moreover, when the physical structure of an organization is affected, a series of micro-operations might be needed to address the deformities. Unfortunately, these areas are seldom seen in time, and can require a sustained effort for fixing, while a total reconstruction might apply. One works also with an amorphous and ever-changing structure that require many attempts until a remedy is found, if a remedy is possible after all.

Even if such operations are pretty well documented, often what organizations lack are the skilled resources needed during and post-implementation, resources that must know as well the patient, and ideally its historical and further health preconditions. Each patient is different and quite often needs its own treatment/medication. With such changes, the organization lands itself on a discovery journey in which the appropriate path can easily deviate from the well-trodden paths.

Previous Post <<||>> Next Post

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.

This is probably the most important aspect, even if there can be further advantages, like using built-in connectors to a wide range of sources or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools have the same capabilities, maybe reduced to certain data sources, though their newer versions seem to bridge the gap.

One of the most stressed advantages of ELT is the possibility of having all the (business) data in the repository, though these are not technological advantages. The same can be obtained via ETL tools, even if this might involve upon case a bigger effort, effort depending on the functionality existing in each tool. It’s true that ETL solutions have a narrower scope by loading a subset of the available data, or that transformations are made before loading the data, though this depends on the scope considered while building the data warehouse or data mart, respectively the design of ETL packages, and both are a matter of choice, choices that can be traced back to business requirements or technical best practices.

Some of the advantages seen are context-dependent – the context in which the technologies are put, respectively the problems are solved. It is often imputed to ETL solutions that the available data are already prepared (aggregated, converted) and new requirements will drive additional effort. On the other side, in ELT-based solutions all the data are made available and eventually further transformed, but also here the level of transformations made depends on specific requirements. Independently of the approach used, the data are still available if needed, respectively involve certain effort for further processing.

Building usable and reliable data models is dependent on good design, and in the design process reside the most important challenges. In theory, some think that in ETL scenarios the design is done beforehand though that’s not necessarily true. One can pull the raw data from the source and build the data models in the target repositories.

Data conversion and cleaning is needed under both approaches. In some scenarios is ideal to do this upfront, minimizing the effect these processes have on data’s usage, while in other scenarios it’s helpful to address them later in the process, with the risk that each project will address them differently. This can become an issue and should be ideally addressed by design (e.g. by building an intermediate layer) or at least organizationally (e.g. enforcing best practices).

Advancing that ELT is better just because the data are true (being in raw form) can be taken only as a marketing slogan. The degree of truth data has depends on the way data reflects business’ processes and the way data are maintained, while their quality is judged entirely on their intended use. Even if raw data allow more flexibility in handling the various requests, the challenges involved in processing can be neglected only under the consequences that follow from this.

Looking at the analytics and data integration cloud-based technologies, they seem to allow both approaches, thus building optimal solutions relying on professionals’ wisdom of making appropriate choices.

Previous Post <<||>>Next Post

🧭Business Intelligence: New Technologies, Old Challenges (Part I: An Introduction)

Each important technology has the potential of creating divides between the specialists from a given field. This aspect is more suggestive in the data-driven fields like BI/Analytics or Data Warehousing. The data professionals (engineers, scientists, analysts, developers) skilled only in the new wave of technologies tend to disregard the role played by the former technologies and their role in the data landscape. The argumentation for such behavior is rooted in the belief that a new technology is better and can solve any problem better than previous technologies did. It’s a kind of mirage professionals and customers can easily fall under.

Being bigger, faster, having new functionality, doesn’t make a tool the best choice by default. The choice must be rooted in the problem to be solved and the set of requirements it comes with. Just because a vibratory rammer is a new technology, is faster and has more power in applying pressure, this doesn’t mean that it will replace a hammer. Where a certain type of power is needed the vibratory rammer might be the best tool, while for situations in which a minimum of power and probably more precision is needed, like driving in a nail, then an adequately sized hammer will prove to be a better choice.

A technology is to be used in certain (business/technological) contexts, and even if contexts often overlap, the further details (aka requirements) should lead to the proper use of tools. It’s in a professional’s duties to be able to differentiate between contexts, requirements and the capabilities of the tools appropriate for each context. In this resides partially a professional’s mastery over its field of work and of providing adequate solutions for customers’ needs. Especially in IT, it’s not enough to master the new tools but also have an understanding about preceding tools, usage contexts, capabilities and challenges.

From an historical perspective each tool appeared to fill a demand, and even if maybe it didn’t manage to fill it adequately, the experience obtained can prove to be valuable in one way or another. Otherwise, one risks reinventing the wheel, or more dangerously, repeating the failures of the past. Each new technology seems to provide a deja-vu from this perspective.

Moreover, a new technology provides new opportunities and requires maybe to change our way of thinking in respect to how the technology is used and the processes or techniques associated with it. Knowledge of the past technologies help identifying such opportunities easier. How a tool is used is also a matter of skills, while its appropriate use and adoption implies an inherent learning curve. Having previous experience with similar tools tends to reduce the learning curve considerably, though hands-on learning is still necessary, and appropriate learning materials or tutoring is upon case needed for a smoother transition.

In what concerns the implementation of mature technologies, most of the challenges were seldom the technologies themselves but of non-technical nature, ranging from the poor understanding/knowledge about the tools, their role and the implications they have for an organization, to an organization’s maturity in leading projects. Even the most-advanced technology can fail in the hands of non-experts. Experience can’t be judged based only on the years spent in the field or the number of projects one worked on, but on the understanding acquired about implementation and usage’s challenges. These latter aspects seem to be widely ignored, even if it can make the difference between success and failure in a technology’s implementation.

Ultimately, each technology is appropriate in certain contexts and a new technology doesn’t necessarily make another obsolete, at least not until the old contexts become obsolete.

Previous Post <<||>>Next Post

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Microsoft Azure: Azure Data Factory (ADF)

{definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
- ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
- allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
{benefit} easy-to-use
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
- {feature} uses JSON to describe each of its entities
{benefit} cost-effective
- pay-as-you-go model against the Azure subscription with no up-front costs
- low price-to-performance ratio
  - ⇐ cost effective and performant at the same time
- fully managed serverless cloud service that scales on demand [2]
  - ⇒requires zero hardware maintenance [1]
  - ⇒can easily scale beyond what was originally anticipated [1]
- does not store any data [1]
- provides additional cost-saving functionality [11]
  - {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
{benefit} powerful
- allows ingesting on-premise and cloud-based data sources
- high-performance hybrid connectivity
  - over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
- orchestrate at scale
  - on-demand compute
  - Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
- {feature} [ADFv2] monitoring
  - richer and natively integrating it with Azure Monitor and OMS [11]
    - includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
- {feature} [ADFv2] control flow functionality
  - lets define complex workflows using programmatic or UI mechanisms
    - allows defining parameters at pipeline level [11]
    - includes custom state passing and looping containers [11]
    - pipelines can be authored via additional tools
      - e.g. PowerShell, .NET, Python, REST APIs
      - ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
{benefit} intelligent
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
{benefit} enterprise-grade security:
- provides same security standards as any other Microsoft service [11]
{benefit} monthly release cycle
- {feature} via auto-update
- improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
{benefit} backwards compatibility
- {feature} [ADFv2] allows rehosting SSIS solutions [2]
  - ⇒ helpful for modernizing data warehouse solutions
{prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
{limitation} availability
- the service isn’t available in all regions
  - an instance can be made available in other region to trigger the job on customer’s computer environment [1]
    - ⇐ the time for executing the job on the compute environment doesn’t change [1]
{concept} activity
- the unit of orchestration in ADF [1]
- defines the actions to perform on data [1]
- takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
- activity types
  - data movement activities
  - data transformation activities
  - control activities
    - control how the pipeline works and interacts with the data [10]
    - allow executing pipelines [10]
    - allow running a foreach statement or Lookup activities [10]
{concept] pipeline
- logical grouping of activities that together perform a task [1]
  - the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
  - two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
- allows building ETL/ELT workloads
- scheduled by scheduler triggers [10]
- data in a pipeline is referred to by different names
  - ⇐ based on the amount of modification that has been performed
  - raw data
    - data with no processing applied [10]
      - ⇒does not yet have a schema applied
    - stored in the message encoding format used to send tracking events such as JSON.
    - can be organized into meaningful data stores and data lakes [10]
      - ⇐ further used in decision-making
    - it's common to send all tracking events as raw events
      - ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
  - processed data
    - raw data that has been decoded in the event-specific formats with the schema applied
      - e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
    - usually stored in different event tables and destination in a data pipeline [10]
  - cooked data
    - processed data that has been aggregated or summarized [10]
- {concept} pipeline parameters
  - similar to SSIS package parameters
    - ⇐ need to be set from outside packages
  - can be passed from the parent pipeline
{concept} dataset
- named references/pointers to the data used as an input or an output of an activity [1]
- identifies data structures within different (linked) data stores [1]
  - ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
  - once created, it can be used with activities in a pipeline [10]
    - e.g. a dataset can be an input or output dataset of a copy activity
{concept} linked service
- defines the information needed by ADF to connect to external resources at runtime
  - much like connection strings which define the connection information [10]
- used to represent
  - {concept} data store
    - holds the input-output data to the ADF
    - e.g. tables, files, folders, and documents
  - {concept} compute resource
    - can host the execution of an activity [1]
{concept} scheduler triggers
- allow pipelines to be triggered on a wall-clock schedule [10]
  - pipelines and triggers have an n-m relationship
    - multiple triggers can kick off a single pipeline
    - the same trigger can kick off multiple pipelines
  - manual triggers trigger pipelines on demand [10]
- once defined, it must be started to begin triggering the pipeline [10]
- comes into effect only after publishing the solution to ADF [10]
  - ⇐ not when saving the trigger in the UI [10]
- to run a pipeline, a pipeline reference must be included in trigger definition [10]
- there is a cost associated with each pipeline run
  - {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
  - {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]

Previous Post <<||>> Next Post

Acronyms:

Azure Data Factory (ADF)

Continuous Integration/Continuous Deployment (CI/CD)

Extract Load Transform (ELT)

Extract Transform Load (ETL)

Independent Software Vendors (ISVs)

Operations Management Suite (OMS)

pay-as-you-go (PAYG)

SQL Server Integration Services (SSIS)

Resources:

[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College

[2] Microsoft (2021) Azure Data Factory [source]

[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]

[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]

[10] Coursera (2021) Data Processing with Azure [source]

[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

SQL Troubles

Pages

01 April 2021

💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

29 March 2021

Notes: Team Data Science Process (TDSP)

21 March 2021

𖣯Strategic Management: The Impact of New Technologies (Part III: Checking the Vital Signs)

𖣯Strategic Management: The Impact of New Technologies (Part II - The Technology-oriented Patient)

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

🧭Business Intelligence: New Technologies, Old Challenges (Part I: An Introduction)

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

About Me