Showing posts with label data processing. Show all posts
Showing posts with label data processing. Show all posts

10 November 2024

🏭🗒️Microsoft Fabric: Data Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 11-Mar-2024

Warehouse vs SQL analytics endpoint in Microsoft Fabric
Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

[Microsoft Fabric] Data Warehouse

  • highly available relational data warehouse that can be used to store and query data in the Lakehouse
    • supports the full transactional T-SQL capabilities 
    • modernized version of the traditional data warehouse
  • unifies capabilities from Synapse Dedicated and Serverless SQL Pools
  • modernized with key improvements
  • resources are managed elastically to provide the best possible performance
    • ⇒ no need to think about indexing or distribution
    • a new parser gives enhanced CSV file ingestion time
    • metadata is now cached in addition to data
    • improved assignment of compute resources to milliseconds
    • multi-TB result sets are streamed to the client
  • leverages a distributed query processing engine
    • provides with workloads that have a natural isolation boundary [3]
      • true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]
  • {operation} data ingestion
    • involves moving data from source systems into the data warehouse [2]
      • the data becomes available for analysis [1]
    • via Pipelines, Dataflows, cross-database querying, COPY INTO command
    • no need to copy data from the lakehouse to the data warehouse [1]
      • one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]
  • {operation} data storage
    • involves storing the data in a format that is optimized for analytics [2]
  • {operation} data processing
    • involves transforming the data into a format that is ready for consumption by analytical tools [1]
  • {operation} data analysis and delivery
    • involves analyzing the data to gain insights and delivering those insights to the business [1]
  • {operation} designing a warehouse (aka warehouse design)
    • standard warehouse design can be used
  • {operation} sharing a warehouse (aka warehouse sharing)
    • a way to provide users read access to the warehouse for downstream consumption
      • via SQL, Spark, or Power BI
    • the level of permissions can be customized to provide the appropriate level of access
  • {feature} mirroring 
    • provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric
      • any database can be accessed and managed centrally from within Fabric without having to switch database clients
      • data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload
  • {concept}SQL analytics endpoint 
    • a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]
  • {concept}virtual warehouse
    • can containing data from virtually any source by using shortcuts [3]
  • {concept} cross database querying 
    • enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]
References:
[1] Microsoft Learn: Fabric (2023) Get started with data warehouses in Microsoft Fabric (link
[2] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: choose a data store (link)
[3] Microsoft Learn: Fabric (2024) What is data warehousing in Microsoft Fabric? (link)
[4] Microsoft Learn: Fabric (2023) Better together: the lakehouse and warehouse (link)

Resources:
[1] Microsoft Learn: Fabric (2023) Data warehousing documentation in Microsoft Fabric (link)


13 February 2024

🧭Business Intelligence: A One-Man Show (Part V: Focus on the Foundation)

Business Intelligence Suite
Business Intelligence Suite

I tend to agree that one person can't do anymore "everything in the data space", as Christopher Laubenthal put it his article on the topic [1]. He seems to catch the essence of some of the core data roles found in organizations. Summarizing these roles, data architecture is about designing and building a data infrastructure, data engineering is about moving data, database administration is mainly about managing databases, data analysis is about assisting the business with data and reports, information design is about telling stories, while data science can be about studying the impact of various components on the data. 

However, I find his analogy between a college's functional structure and the core data roles as poorly chosen from multiple perspectives, even if both are about building an infrastructure of some type. 

Firstly, the two constructions have different foundations. Data exists in a an organization also without data architects, data engineers or data administrators (DBAs)! It's enough to buy one or more information systems functioning as islands and reporting needs will arise. The need for a data architect might come when the systems need to be integrated or maybe when a data warehouse needs to be build, though many organizations are still in business without such constructs. While for the others, the more complex the integrations, the bigger the need for a Data Architect. Conversely, some systems can be integrated by design and such capabilities might drive their selection.

Data engineering is needed mainly in the context of the cloud, respectively of data lake-based architectures, where data needs to be moved, processed and prepared for consumption. Conversely, architectures like Microsoft Fabric minimize data movement, the focus being on data processing, the successive transformations it needs to suffer in moving from bronze to the gold layer, respectively in creating an organizational semantical data model. The complexity of the data processing is dependent on data' structuredness, quality and other data characteristics. 

As I mentioned before, modern databases, including the ones in the cloud, reduce the need for DBAs to a considerable degree. Unless the volume of work is big enough to consider a DBA role as an in-house resource, organizations will more likely consider involving a service provider and a contingent to cover the needs. 

Having in-house one or more people acting under the Data Analyst role, people who know and understand the business, respectively the data tools used in the process, can go a long way. Moreover, it's helpful to have an evangelist-like resource in house, a person who is able to raise awareness and knowhow, help diffuse knowledge about tools, techniques, data, results, best practices, respectively act as a mentor for the Data Analyst citizens. From my point of view, these are the people who form the data-related backbone (foundation) of an organization and this is the minimum of what an organization should have!

Once this established, one can build data warehouses, data integrations and other support architectures, respectively think about BI and Data strategy, Data Governance, etc. Of course, having a Chief Data Officer and a Data Strategy in place can bring more structure in handling the topics at the various levels - strategical, tactical, respectively operational. In constructions one starts with a blueprint and a data strategy can have the same effect, if one knows how to write it and implement it accordingly. However, the strategy is just a tool, while the data-knowledgeable workers are the foundation on which organizations should build upon!

"Build it and they will come" philosophy can work as well, though without knowledgeable and inquisitive people the philosophy has high chances to fail.

Previous Post <<||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

01 April 2021

💎SQL Reloaded: Processing JSON Files with Flat Matrix Structure in SQL Server 2016+

Besides the CSV format, many of the data files made available under the open data initiatives are stored in JSON format, which makes data more difficult to process, even if JSON offers a richer structure that goes beyond the tabular structure of CSV files. Fortunately, starting with SQL Server 2016, JSON became a native format, which makes the processing of JSON files relatively easy, the easiness with which one can process the data depending on how they are structured.

Let’s consider as example a JSON file with the world population per country and year that can be downloaded from DataHub (source). The structure behind resembles a tabular model (see the table on the source website), having a flat structure. Just export the data to a file with the JSON extension (e.g. ‘population-figures-by-country.json’) locally (e.g. ‘D:/Data’). The next step is to understand file’s structure. Some repositories provide good documentation in this respect, though there are also many exceptions. Having a JSON editor like Visual Studio which reveals the structure makes easier the process. 

As in the case of CSV files, is needed to infer the data types. There are two alphanumeric fields (Country & Country Code), while the remaining fields are numeric. The only challenge raised by the data seems to be the difference in format between the years 2002 and 2015 in respect to the other years, as the values of the former contain a decimal after comma. All the numeric values should have been whole values. 

It’s recommended to start small and build the logic iteratively. Therefore, for the first step just look at files content via the OPENROWSET function:

-- looking at the JSON file 
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 

In a second step one can add the OPENJSON function by looking only at the first record: 

-- querying a json file (one record)
SELECT *
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$[0]')

In a third step one can add a few columns (e.g. Country & Country Code) to make sure that the select statement works correctly. 

-- querying a json file (all records, a few fields)
SELECT Country 
, CountryCode 
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
) AS DAT; 

In a next step can be added all the columns and import the data in a table (e.g. dbo.CountryPopulation) on the fly: 

-- importing a json file (all records) on the fly
SELECT DAT.Country
, DAT.CountryCode
, DAT.Y1960
, DAT.Y1961
, DAT.Y1962
, DAT.Y1963
, DAT.Y1964
, DAT.Y1965
, DAT.Y1966
, DAT.Y1967
, DAT.Y1968
, DAT.Y1969
, DAT.Y1970
, DAT.Y1971
, DAT.Y1972
, DAT.Y1973
, DAT.Y1974
, DAT.Y1975
, DAT.Y1976
, DAT.Y1977
, DAT.Y1978
, DAT.Y1979
, DAT.Y1980
, DAT.Y1981
, DAT.Y1982
, DAT.Y1983
, DAT.Y1984
, DAT.Y1985
, DAT.Y1986
, DAT.Y1987
, DAT.Y1988
, DAT.Y1989
, DAT.Y1990
, DAT.Y1991
, DAT.Y1992
, DAT.Y1993
, DAT.Y1994
, DAT.Y1995
, DAT.Y1996
, DAT.Y1997
, DAT.Y1998
, DAT.Y1999
, DAT.Y2000
, DAT.Y2001
, Cast(DAT.Y2002 as bigint) Y2002
, Cast(DAT.Y2003 as bigint) Y2003
, Cast(DAT.Y2004 as bigint) Y2004
, Cast(DAT.Y2005 as bigint) Y2005
, Cast(DAT.Y2006 as bigint) Y2006
, Cast(DAT.Y2007 as bigint) Y2007
, Cast(DAT.Y2008 as bigint) Y2008
, Cast(DAT.Y2009 as bigint) Y2009
, Cast(DAT.Y2010 as bigint) Y2010
, Cast(DAT.Y2011 as bigint) Y2011
, Cast(DAT.Y2012 as bigint) Y2012
, Cast(DAT.Y2013 as bigint) Y2013
, Cast(DAT.Y2014 as bigint) Y2014
, Cast(DAT.Y2015 as bigint) Y2015
, DAT.Y2016
INTO dbo.CountryPopulation
FROM OPENROWSET (BULK 'D:\data\population-figures-by-country.json', SINGLE_CLOB)  as jsonfile 
     CROSS APPLY OPENJSON(BulkColumn,'$')
 WITH ( 
  Country nvarchar(max) '$.Country'
, CountryCode nvarchar(3) '$.Country_Code'
, Y1960 bigint '$.Year_1960'
, Y1961 bigint '$.Year_1961'
, Y1962 bigint '$.Year_1962'
, Y1963 bigint '$.Year_1963'
, Y1964 bigint '$.Year_1964'
, Y1965 bigint '$.Year_1965'
, Y1966 bigint '$.Year_1966'
, Y1967 bigint '$.Year_1967'
, Y1968 bigint '$.Year_1968'
, Y1969 bigint '$.Year_1969'
, Y1970 bigint '$.Year_1970'
, Y1971 bigint '$.Year_1971'
, Y1972 bigint '$.Year_1972'
, Y1973 bigint '$.Year_1973'
, Y1974 bigint '$.Year_1974'
, Y1975 bigint '$.Year_1975'
, Y1976 bigint '$.Year_1976'
, Y1977 bigint '$.Year_1977'
, Y1978 bigint '$.Year_1978'
, Y1979 bigint '$.Year_1979'
, Y1980 bigint '$.Year_1980'
, Y1981 bigint '$.Year_1981'
, Y1982 bigint '$.Year_1982'
, Y1983 bigint '$.Year_1983'
, Y1984 bigint '$.Year_1984'
, Y1985 bigint '$.Year_1985'
, Y1986 bigint '$.Year_1986'
, Y1987 bigint '$.Year_1987'
, Y1988 bigint '$.Year_1988'
, Y1989 bigint '$.Year_1989'
, Y1990 bigint '$.Year_1990'
, Y1991 bigint '$.Year_1991'
, Y1992 bigint '$.Year_1992'
, Y1993 bigint '$.Year_1993'
, Y1994 bigint '$.Year_1994'
, Y1995 bigint '$.Year_1995'
, Y1996 bigint '$.Year_1996'
, Y1997 bigint '$.Year_1997'
, Y1998 bigint '$.Year_1998'
, Y1999 bigint '$.Year_1999'
, Y2000 bigint '$.Year_2000'
, Y2001 bigint '$.Year_2001'
, Y2002 decimal(19,1) '$.Year_2002'
, Y2003 decimal(19,1) '$.Year_2003'
, Y2004 decimal(19,1) '$.Year_2004'
, Y2005 decimal(19,1) '$.Year_2005'
, Y2006 decimal(19,1) '$.Year_2006'
, Y2007 decimal(19,1) '$.Year_2007'
, Y2008 decimal(19,1) '$.Year_2008'
, Y2009 decimal(19,1) '$.Year_2009'
, Y2010 decimal(19,1) '$.Year_2010'
, Y2011 decimal(19,1) '$.Year_2011'
, Y2012 decimal(19,1) '$.Year_2012'
, Y2013 decimal(19,1) '$.Year_2013'
, Y2014 decimal(19,1) '$.Year_2014'
, Y2015 decimal(19,1) '$.Year_2015'
, Y2016 bigint '$.Year_2016'
) AS DAT; 

As can be seen the decimal values were converted to bigint to preserve the same definition. Moreover, this enables data processing later, as no additional (implicit) conversions are necessary. 

Also, the columns’ names were changed either for simplification/convenience or simply taste. 

Writing such a monster query can be time-consuming, though preparing the metadata into Excel can decrease considerably the effort. With copy-past and a few tricks (e.g. replacing values, splitting columns based on a delimiter) one can easily prepare such a structure:

Source fieldTarget fieldDataTypeValueImport ClauseSelect Clause
CountryCountrynvarchar(max) emen Rep., Country nvarchar(max) '$.Country', DAT.Country
Country_CodeCountryCodenvarchar(3) YEM, CountryCode nvarchar(3) '$.Country_Code', DAT.CountryCode
Year_1960Y1960bigint5172135, Y1960 bigint '$.Year_1960', DAT.Y1960
Year_1961Y1961bigint5260501, Y1961 bigint '$.Year_1961', DAT.Y1961
Year_1962Y1962bigint5351799, Y1962 bigint '$.Year_1962', DAT.Y1962
Year_1963Y1963bigint5446063, Y1963 bigint '$.Year_1963', DAT.Y1963
Year_1964Y1964bigint5543339, Y1964 bigint '$.Year_1964', DAT.Y1964
Year_1965Y1965bigint5643643, Y1965 bigint '$.Year_1965', DAT.Y1965
Year_1966Y1966bigint5748588, Y1966 bigint '$.Year_1966', DAT.Y1966
Year_1967Y1967bigint5858638, Y1967 bigint '$.Year_1967', DAT.Y1967
Year_1968Y1968bigint5971407, Y1968 bigint '$.Year_1968', DAT.Y1968
Year_1969Y1969bigint6083619, Y1969 bigint '$.Year_1969', DAT.Y1969
Year_1970Y1970bigint6193810, Y1970 bigint '$.Year_1970', DAT.Y1970
Year_1971Y1971bigint6300554, Y1971 bigint '$.Year_1971', DAT.Y1971
Year_1972Y1972bigint6407295, Y1972 bigint '$.Year_1972', DAT.Y1972
Year_1973Y1973bigint6523452, Y1973 bigint '$.Year_1973', DAT.Y1973
Year_1974Y1974bigint6661566, Y1974 bigint '$.Year_1974', DAT.Y1974
Year_1975Y1975bigint6830692, Y1975 bigint '$.Year_1975', DAT.Y1975
Year_1976Y1976bigint7034868, Y1976 bigint '$.Year_1976', DAT.Y1976
Year_1977Y1977bigint7271872, Y1977 bigint '$.Year_1977', DAT.Y1977
Year_1978Y1978bigint7536764, Y1978 bigint '$.Year_1978', DAT.Y1978
Year_1979Y1979bigint7821552, Y1979 bigint '$.Year_1979', DAT.Y1979
Year_1980Y1980bigint8120497, Y1980 bigint '$.Year_1980', DAT.Y1980
Year_1981Y1981bigint8434017, Y1981 bigint '$.Year_1981', DAT.Y1981
Year_1982Y1982bigint8764621, Y1982 bigint '$.Year_1982', DAT.Y1982
Year_1983Y1983bigint9111097, Y1983 bigint '$.Year_1983', DAT.Y1983
Year_1984Y1984bigint9472170, Y1984 bigint '$.Year_1984', DAT.Y1984
Year_1985Y1985bigint9847899, Y1985 bigint '$.Year_1985', DAT.Y1985
Year_1986Y1986bigint10232733, Y1986 bigint '$.Year_1986', DAT.Y1986
Year_1987Y1987bigint10628585, Y1987 bigint '$.Year_1987', DAT.Y1987
Year_1988Y1988bigint11051504, Y1988 bigint '$.Year_1988', DAT.Y1988
Year_1989Y1989bigint11523267, Y1989 bigint '$.Year_1989', DAT.Y1989
Year_1990Y1990bigint12057039, Y1990 bigint '$.Year_1990', DAT.Y1990
Year_1991Y1991bigint12661614, Y1991 bigint '$.Year_1991', DAT.Y1991
Year_1992Y1992bigint13325583, Y1992 bigint '$.Year_1992', DAT.Y1992
Year_1993Y1993bigint14017239, Y1993 bigint '$.Year_1993', DAT.Y1993
Year_1994Y1994bigint14692686, Y1994 bigint '$.Year_1994', DAT.Y1994
Year_1995Y1995bigint15320653, Y1995 bigint '$.Year_1995', DAT.Y1995
Year_1996Y1996bigint15889449, Y1996 bigint '$.Year_1996', DAT.Y1996
Year_1997Y1997bigint16408954, Y1997 bigint '$.Year_1997', DAT.Y1997
Year_1998Y1998bigint16896210, Y1998 bigint '$.Year_1998', DAT.Y1998
Year_1999Y1999bigint17378098, Y1999 bigint '$.Year_1999', DAT.Y1999
Year_2000Y2000bigint17874725, Y2000 bigint '$.Year_2000', DAT.Y2000
Year_2001Y2001bigint18390135, Y2001 bigint '$.Year_2001', DAT.Y2001
Year_2002Y2002decimal(19,1) 18919179.0, Y2002 decimal(19,1) '$.Year_2002', Cast(DAT.Y2002 as bigint) Y2002
Year_2003Y2003decimal(19,1) 19462086.0, Y2003 decimal(19,1) '$.Year_2003', Cast(DAT.Y2003 as bigint) Y2003
Year_2004Y2004decimal(19,1) 20017068.0, Y2004 decimal(19,1) '$.Year_2004', Cast(DAT.Y2004 as bigint) Y2004
Year_2005Y2005decimal(19,1) 20582927.0, Y2005 decimal(19,1) '$.Year_2005', Cast(DAT.Y2005 as bigint) Y2005
Year_2006Y2006decimal(19,1) 21160534.0, Y2006 decimal(19,1) '$.Year_2006', Cast(DAT.Y2006 as bigint) Y2006
Year_2007Y2007decimal(19,1) 21751605.0, Y2007 decimal(19,1) '$.Year_2007', Cast(DAT.Y2007 as bigint) Y2007
Year_2008Y2008decimal(19,1) 22356391.0, Y2008 decimal(19,1) '$.Year_2008', Cast(DAT.Y2008 as bigint) Y2008
Year_2009Y2009decimal(19,1) 22974929.0, Y2009 decimal(19,1) '$.Year_2009', Cast(DAT.Y2009 as bigint) Y2009
Year_2010Y2010decimal(19,1) 23606779.0, Y2010 decimal(19,1) '$.Year_2010', Cast(DAT.Y2010 as bigint) Y2010
Year_2011Y2011decimal(19,1) 24252206.0, Y2011 decimal(19,1) '$.Year_2011', Cast(DAT.Y2011 as bigint) Y2011
Year_2012Y2012decimal(19,1) 24909969.0, Y2012 decimal(19,1) '$.Year_2012', Cast(DAT.Y2012 as bigint) Y2012
Year_2013Y2013decimal(19,1) 25576322.0, Y2013 decimal(19,1) '$.Year_2013', Cast(DAT.Y2013 as bigint) Y2013
Year_2014Y2014decimal(19,1) 26246327.0, Y2014 decimal(19,1) '$.Year_2014', Cast(DAT.Y2014 as bigint) Y2014
Year_2015Y2015decimal(19,1) 26916207.0, Y2015 decimal(19,1) '$.Year_2015', Cast(DAT.Y2015 as bigint) Y2015
Year_2016Y2016bigint27584213, Y2016 bigint '$.Year_2016', DAT.Y2016

Based on this structure, one can add two further formulas in Excel to prepare the statements as above and then copy the fields (last two columns were generated using the below formulas): 

=", " & TRIM(B2) & " " & C2 & " '$." & TRIM(A2) & "'" 
=", DAT." & TRIM(B2)

Consuming data in which the values are stored in a matrix structure can involve further challenges sometimes, even if this type of storage tends to save space. For example, adding the values for a new year would involve extending the table with one more column, while performing calculations between years would involve referencing each column in formulas. Therefore, transforming the data from a matrix to a normalized structure can have some benefit. This can be achieved by writing a query via the UNPIVOT operator:

-- unpivoting the data 
SELECT RES.Country
, RES.CountryCode
, Cast(Replace(RES.[Year], 'Y', '') as int) [Year]
, RES.Population
--INTO dbo.CountryPopulationPerYear
FROM 
( -- basis data
	SELECT Country
	, CountryCode
	, Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
	, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
	, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
	, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
	, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
	, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016
	FROM dbo.CountryPopulation
) DAT
UNPIVOT  -- unpivot logic
   (Population FOR [Year] IN  (Y1960, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969
, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979
, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989
, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999
, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009
, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)
) RES

Also this can be performed in two steps, first preparing the query, and in a final step inserting the data into a table (e.g. dbo.CountryPopulationPerYear) on the fly (re-execute the previous query after uncommenting the INSERT clause to generate the table). 

--reviewing the data 
SELECT Country
, CountryCode
, AVG(Population) AveragePopulation
, Max(Population) - Min(Population) RangePopulation
FROM dbo.CountryPopulationPerYear
WHERE [Year] BETWEEN 2010 AND 2019
GROUP BY Country
, CountryCode
ORDER BY Country

On the other side making comparisons between consecutive years is easier when using a matrix structure: 

--reviewing the data 
SELECT Country
, CountryCode
, Y2016
, Y2010
, Y2010-Y2010 [2016-2010]
, Y2011-Y2010 [2011-2010]
, Y2012-Y2011 [2011-2011]
, Y2013-Y2012 [2011-2012]
, Y2014-Y2013 [2011-2013]
, Y2015-Y2014 [2011-2014]
, Y2016-Y2015 [2011-2015]
FROM dbo.CountryPopulation
ORDER BY Country

Unless the storage space is a problem, in theory one can store the data in both formats as there can be requests which can benefit from one structure or the other. 

Happy coding!

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

Data Migration
Data Migrations Series

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

Data Warehousing

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.  

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses. 

18 May 2018

🔬Data Science: Natural Language Processing [NLP] (Definitions)

"Using software to 'understand' the meaning contained within texts. Everyday speech is broken down into patterns. Typically, these systems employ syntactic analysis to infer the semantic meaning embedded in documents. NLP identifies patterns in sample texts and makes predictions about unseen texts." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"Use of computers to interpret and manipulate words as part of a language." (Dougal Hutchison, "Automated Essay Scoring Systems", 2009)

"It is a subfield of Computational Linguistics (i.e. the field that researches linguistics phenomena that occur in digital data), whose focus is on how to build automatic systems able to interpret/generate information in natural language." (Diana Pérez-Marín et al, "Adaptive Computer Assisted Assessment", 2010)

"the notion that the context of text can be inferred from the text itself." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"An area of computer science involved with the computational study of human languages." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Similarly to text mining, NLP is a multidisciplinary research field of computer science, artificial intelligence, and linguistics. However, it mainly focuses on the interaction between computers and human languages." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Natural Language Processing is prevalently used to analyse the text or speech in order to make machine understand the words like human." (Anumeera Balamurali & Balamurali Ananthanarayanan,"Develop a Neural Model to Score Bigram of Words Using Bag-of-Words Model for Sentiment Analysis", 2020)

 "Natural language processing is the ability of computer program to understand human language as it is spoken or handwritten." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"NLP is a field of computer science and linguistics focused on techniques and algorithms for processing data, continuing natural language." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"NLP is a Linguistic approach to interact with human language and computer. This field comes under Artificial Intelligence and Computer Science." (Sayani Ghosal & Amita Jain, "Research Journey of Hate Content Detection From Cyberspace", 2021)

"a field of computer science involved with interactions between computers and human languages." (Analytics Insight)

"is a field of computer science, with the goal to understand or generate human languages, either in text or speech form. There are two primary sub fields of NLP, Natural Language Understanding (NLU), and Natural Language Generation (NLG)." (Accenture)

01 February 2018

🔬Data Science: MapReduce (Definitions)

"A data processing and aggregation paradigm consisting of a 'map' phase that selects data and a 'reduce' phase that transforms the data. In MongoDB, you can run arbitrary aggregations over data using map-reduce." (MongoDb, "Glossary", 2008)

"A divide-and-conquer strategy for processing large data sets in parallel. In the 'map' phase, the data sets are subdivided. The desired computation is performed on each subset. The 'reduce' phase combines the results of the subset calculations into a final result. MapReduce frameworks handle the details of managing the operations and the nodes they run on, including restarting operations that fail for some reason. The user of the framework only has to write the algorithms for mapping and reducing the data sets and computing with the subsets." (Dean Wampler & Alex Payne, "Programming Scala", 2009)

"A method by which computationally intensive problems can be processed on multiple computers in parallel. The method can be divided into a mapping step and a reducing step. In the mapping step, a master computer divides a problem into smaller problems that are distributed to other computers. In the reducing step, the master computer collects the output from the other computers. Although MapReduce is intended for Big Data resources, holding petabytes of data, most Big Data problems do not require MapReduce." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"An early Big Data (before this term became popular) programming solution originally developed by Google for parallel processing using very large data sets distributed across a number of computing and storage systems. A Hadoop implementation of MapReduce is now available." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode. The 'map' component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called 'reduce' aggregates all the elements back together to provide a result." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A programming model consisting of two logical steps - Map and Reduce - for processing massively parallelizable problems across extremely large datasets using a large cluster of commodity computers." (Haoliang Wang et al, "Accessing Big Data in the Cloud Using Mobile Devices", Handbook of Research on Cloud Infrastructures for Big Data Analytics, 2014)

"Algorithm that is used to split massive data sets among many commodity hardware pieces in an effort to reduce computing time." (Billie Anderson & J Michael Hardin, "Harnessing the Power of Big Data Analytics", Encyclopedia of Business Analytics and Optimization, 2014)

"MapReduce is a parallel programming model proposed by Google and is used to distribute computing on clusters of computers for processing large data sets." (Jyotsna T Wassan, "Emergence of NoSQL Platforms for Big Data Needs", Encyclopedia of Business Analytics and Optimization, 2014)

"A concept which is an abstraction of the primitives ‘map’ and ‘reduce’. Most of the computations are carried by applying a ‘map’ operation to each global record in order to generate key/value pairs and then apply the reduce operation in order to combine the derived data appropriately." (P S Shivalkar & B K Tripathy, "Rough Set Based Green Cloud Computing in Emerging Markets", Encyclopedia of Information Science and Technology 3rd Ed., 2015) 

"A programming model that uses a divide and conquer method to speed-up processing large datasets, with a special focus on semi-structured data." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", Encyclopedia of Information Science and Technology 3rd Ed., 2015) 

"MapReduce is a programming model for general-purpose parallelization of data-intensive processing. MapReduce divides the processing into two phases: a mapping phase, in which data is broken up into chunks that can be processed by separate threads - potentially running on separate machines; and a reduce phase, which combines the output from the mappers into the final result." (Guy Harrison, "Next Generation Databases: NoSQL, NewSQL, and Big Data", 2015)

"MapReduce is a technological framework for processing parallelize-able problems across huge data sets using a large number of computers (nodes). […] MapReduce consists of two major steps: 'Map' and 'Reduce'. They are similar to the original Fork and Join operations in distributed systems, but they can consider a large number of computers that can be constructed based on the Internet cloud. In the Map-step, the master computer (a node) first divides the input into smaller sub-problems and then distributes them to worker computers (worker nodes). A worker node may also be a sub-master node to distribute the sub-problem into even smaller problems that will form a multi-level structure of a task tree. The worker node can solve the sub-problem and report the results back to its upper level master node. In the Reduce-step, the master node will collect the results from the worker nodes and then combine the answers in an output (solution) of the original problem." (Li M Chen et al, "Mathematical Problems in Data Science: Theoretical and Practical Methods", 2015)

"A programming model which process massive amounts of unstructured data in parallel and distributed cluster of processors." (Fatma Mohamed et al, "Data Streams Processing Techniques Data Streams Processing Techniques", Handbook of Research on Machine Learning Innovations and Trends, 2017)

"A data processing framework of Hadoop which provides data intensive computation of large data sets by dividing tasks across several machines and finally combining the result." (Rupali Ahuja, "Hadoop Framework for Handling Big Data Needs", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"A high-level programming model, which uses the “map” and “reduce” functions, for processing high volumes of data." (Carson K.-S. Leung, "Big Data Analysis and Mining", Encyclopedia of Information Science and Technology 4th Ed., 2018)

"Is a computational paradigm for processing massive datasets in parallel if the computation fits a three-step pattern: map, shard and reduce. The map process is a parallel one. Each process executes on a different part of data and produces (key, value) pairs. The shard process collects the generated pairs, sorts and partitions them. Each partition is assigned to a different reduce process which produces a single result." (Venkat Gudivada et al, "Database Systems for Big Data Storage and Retrieval", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"Is a programming model or algorithm for the processing of data using a parallel programming implementation and was originally used for academic purposes associated with parallel programming techniques. (Soraya Sedkaoui, "Understanding Data Analytics Is Good but Knowing How to Use It Is Better!", Big Data Analytics for Entrepreneurial Success, 2019)

"MapReduce is a style of programming based on functional programming that was the basis of Hadoop." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Is a specific programming model, which as such represents a new approach to solving the problem of processing large amounts of differently structured data. It consists of two functions - Map (sorting and filtering data) and Reduce (summarizing intermediate results), and it is executed in parallel and distributed." (Savo Stupar et al, "Importance of Applying Big Data Concept in Marketing Decision Making", Handbook of Research on Applied AI for International Business and Marketing Applications, 2021)

"A software framework for processing vast amounts of data." (Analytics Insight)

24 January 2018

🔬Data Science: Data Processing (Definitions)

"The act of turning raw data into meaningful output, generally associated with computers." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"Any process that converts data into information. The processing is usually assumed to be automated and running on an information system." (Eleutherios A Papathanassiou & Xenia J Mamakou, "Privacy Issues in Public Web Sites", Handbook of Research on Public Information Technology, 2008) 

"Obtaining, recording or holding the data, or carrying out any operation on the data, including organising, adapting or altering it. Retrieval, consultation or use of the data, disclosure of the data, and alignment, combination, blocking, erasure or destruction of the data are all legally classed as processing." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The operation performed on data through capture, transformation, and storage, in order to derive new information according to a given set of rules." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Collection and elaboration of sensing data with the aim to derivate/infer new knowledge from original raw data." (Paolo Bellavista et al, "Crowdsensing in Smart Cities: Technical Challenges, Open Issues, and Emerging Solution Guidelines", 2015)

"The act of data manipulation through integration of mathematical tools, statistics, and computer application to generate information." (Babangida Zubairu, "Security Risks of Biomedical Data Processing in Cloud Computing Environment", 2018)

"Any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction." (Yordanka Ivanova, "Data Controller, Processor, or Joint Controller: Towards Reaching GDPR Compliance in a Data- and Technology-Driven World", 2020)

"Data processing is any action performed to turn raw data into useful information." (Xplenty) [source]

"Data processing occurs when data is collected and translated into usable information. […] Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organization." (Talend) [source]
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.