SQL Troubles

12 June 2020

🎡SSIS Project: Covid-19 Data

Introduction

I was exploring the Covid-19 data provided by the John Hopkins institute and I stumbled as usual on several data issues. Therefore I thought I could share some scripts and ideas in a post.

The data from the downloaded files cover a timeframe between 22nd of January and 10th of June 2020 and reflect the number of confirmed cases, deaths and recoveries per day, country and state. As it seems the data are updated on a daily basis. Unfortunately, the data are spread over several files (one for each indicator), which makes their consumption more difficult than expected, though the challenges are minor. I downloaded for the beginning the following files from the above link:

time_series_covid19_recovered_global_narrow.csv
time_series_covid19_confirmed_global_narrow.csv
time_series_covid19_deaths_global_narrow.csv

Before attempting anything with the files, it's recommended to look over them to check whether column names are provided and the columns are properly named, respectively how the columns and the rows are delimited, or on whether other things can be observed during a first review. I needed for example to delete the second line from each file.

Data Loading

When starting this kind of projects, it's useful to check first project's feasibility in term of whether the data are usable. Therefore, I use first to import the data via the 'Import Data' wizard, following to decide later whether it makes sense to build a project for this. Right click on the database in which you'd like to import the data, then from 'Tasks' choose 'Import Data' to use the wizard:

As data source we will consider the first data file, therefore into the 'Choose a Data Source' step, browse for the file:

The 'Locale' might appear different for you, therefore let the default value, however you'll have to make sure that the following values are like in the above screenshot. Look over the 'Columns' section to check whether the formatting was applied correctly. The preview offers a first overview of the data:

It's useful to review the default data types defined by the wizard for each column within the 'Advanced' section. For example the first two fields could have more than the default of 50 characters. The length or the data type can be modified as needed:

One can attempt in theory to get all the column definitions right from the beginning, though for the first attempt this is less important. In addition, without knowing data's definition, there's always the possibility for something to go wrong. Personally, I prefer loading the data as text and doing later the needed conversions, if necessary.

Into the next step one can define the 'Destination', the database where the data will be loaded:

After defining the source and destination is needing to define the mapping between the two. One can consider going with the table definition provided by the wizard or modify table's name directly in the wizard:

By clicking on the 'Edit Mappings' one can review the mappings. As there's a one-to-one import, one can in theory skip this step. However, if there are data already into the table, one can delete the rows from the destination tables or append them directly to the destination. The first radio button is selected though, as the table will be created as well:

With this being done, one can run the package as it is - just click 'Next':

If everything is ok, each step from the package will appear in green, in the end the number of records appears.

It may look like many steps, though once one got used to using the wizard, the data are loaded in less than 5 minutes.

Data Discovery

Once the data loaded, it's time for data discovery - looking at the structure of the data and trying to understand their meaning. Before further using the data it's important to identify the attributes which identify uniquely a record within your dataset - in this case the combination between State, Country and Date. At least it should be unique, because the second query returned some duplicates for the Country 'Korea', respectively 'Sint Eustatius and Saba' which seems to be a region within 'Netherlands'.


-- looking at the data
SELECT top 1000 *
FROM [dbo].[time_series_covid19_recovered_global_narrow]


-- checking for duplicates 
SELECT [Province State]
, [Country Region]
, Date 
, count(*) NoRecords 
FROM [dbo].[time_series_covid19_recovered_global_narrow]
GROUP BY [Province State]
, [Country Region]
, Date
HAVING count(*)>1

-- checking for distinct values  
SELECT [Country Region]
, count(*) NoRecords 
, SUM(TRY_CAST(Value as int)) NoCases
FROM [dbo].[time_series_covid19_recovered_global_narrow]
GROUP BY [Country Region]
HAVING count(*)>1
ORDER BY [Country Region]

The two duplicates are caused by the fact that a comma was used in a country, respectively a province's name, when the comma is actually used as delimiter. (Therefore it's better to use a sign like "|" as delimiter, as the chances are small for the sign to be used anywhere. Another solution would be to use quotes for alphanumeric values.)  Fortunately, the problem can be easily fixed with an update which uses the dbo.CutLeft, respectively dbo.CutRight functions defined in a previous post. Therefore the functions need to be created first within the same database before running the scripts.


-- review the issue
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region]  LIKE '%int Eustatius and Saba%'

-- correct the data
UPDATE [dbo].[time_series_covid19_recovered_global_narrow]
SET [Province State] = [Province State] + ', ' + [Country Region] 
, [Country Region] = [Lat]
, [Lat] = [Long]
, [Long] = [Date]
, [Date] = [Value]
, [Value] = [ISO 3166-1 Alpha 3-Codes]
, [ISO 3166-1 Alpha 3-Codes] = [Region Code]
, [Region Code] = [Sub-region Code]
, [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0)
, [Intermediate Region Code] = dbo.CutRight([Intermediate Region Code], ',', 0)
 WHERE [Country Region]  LIKE '%Sint Eustatius and Saba%'

-- review data after correction
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%int Eustatius and Saba%'

A similar solution is used for the second problem:


-- review the issue
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%Korea%'

-- correct the data
UPDATE [dbo].[time_series_covid19_recovered_global_narrow]
SET [Province State] = '' 
, [Country Region] = Replace( [Country Region] + ', ' + [Lat], '"', '')
, [Lat] = [Long]
, [Long] = [Date]
, [Date] = [Value]
, [Value] = [ISO 3166-1 Alpha 3-Codes]
, [ISO 3166-1 Alpha 3-Codes] = [Region Code]
, [Region Code] = [Sub-region Code]
, [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0)
, [Intermediate Region Code] = Replace(dbo.CutRight([Intermediate Region Code], ',', 0), '"', '')
WHERE [Country Region] LIKE '%Korea%'

-- review data after correction
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%Korea%'

With this the data from the first file are ready to use. The data from the other two files can be loaded following the same steps as above. The tables not only that they have a similar structure, but they have the same issues. One can just replace the name of the tables into the scripts to correct the issues.   

Putting All Together

As we need for analysis the data from all three tables, we could create a query that joins them together and encapsulate it within a view. The volume of data is neglectable, and even without an index the query can perform acceptably. However, as soon the number of data increases, it's useful to have only one table for consumption. Independently of the approach considered the query is similar.  As we made sure that the key is unique across all the data, we could write the query as follows:


-- combining the data together 
SELECT [Province State]
, [Country Region]
, TRY_CAST([Lat] as decimal(10,4)) [Lat]
, TRY_CAST([Long] as decimal(10,4)) [Long]
, [Date]
, Sum(Confirmed) Confirmed
, Sum(Death) Death
, Sum(Recovered) Recovered
, [ISO 3166-1 Alpha 3-Codes]
, [Region Code]
, [Sub-region Code]
, [Intermediate Region Code]
INTO [dbo].[time_series_covid19_global_narrow]
FROM (
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , TRY_CAST([Value] as int) Confirmed
 , 0 Death 
 , 0 Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_confirmed_global_narrow]
 UNION ALL
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , 0 Confirmed
 , TRY_CAST([Value] as int) Death 
 , 0 Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_deaths_global_narrow]
 UNION ALL
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , 0 Confirmed
 , 0 Death 
 , TRY_CAST([Value] as int) Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_recovered_global_narrow]
  ) DAT
GROUP BY [Province State]
, [Country Region]
, TRY_CAST([Lat] as decimal(10,4)) 
, TRY_CAST([Long] as decimal(10,4)) 
, [Date]
, [ISO 3166-1 Alpha 3-Codes]
, [Region Code]
, [Sub-region Code]
, [Intermediate Region Code]

-- reviewing the data
SELECT *
FROM [dbo].[time_series_covid19_global_narrow]

-- checking for duplicates 
SELECT [Province State]
, [Country Region]
, Date 
, count(*) NoRecords 
FROM [dbo].[time_series_covid19_global_narrow]
GROUP BY [Province State]
, [Country Region]
, Date
HAVING count(*)>1

If everything went smoothly, then the last query will return no records. As the latitude was given differently, it was needed to format and cut the values after 4 decimals. Before using the data is needed to do a few adjustments. As the data are incremental, adding up to the previous date, it's useful to calculate the increase between two consecutive days. This can be done via the LAG window function:

-- preparing the data for analysis in a view
CREATE VIEW dbo.v_time_series_covid19
AS
SELECT * 
, LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevConfirmed
, LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevDeath
, LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevRecovered
, Confirmed-LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseConfirmed
, Death-LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseDeath
, Recovered - LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseRecovered
FROM [dbo].[time_series_covid19_global_narrow]

With this the data are ready for consumption:

-- sample query
SELECT *
FROM dbo.v_time_series_covid19
WHERE [Country Region]  LIKE '%China%'
  AND [Province State] LIKE '%Hubei%'
ORDER BY DATE

Of course, one can increase the value of this dataset by pulling further information, like the size of the population, the average density, or any other factors that could have impact on the propagation of the disease.

Instead of loading the data via the wizard, one can create an SSIS project instead, however some of the corrections still need to be done manually, unless one includes into the logic the corrections as well.

Happy coding!

11 June 2020

🧭🪄☯Business Intelligence: SQL Server Reporting Services (The Good, the Bad and the Ugly)

SQL Server Reporting Services (SSRS) is the oldest solution from the modern Microsoft BI stack. Released as add-on to SQL Server 2000, it allows covering most of an organization's reporting requirements, either if we talk about tables, matrices or crosstab displays, raw data, aggregations, KPIs or visualizations like charts, gauges, sparklines, tree maps or sunbursts.

The Good: Once you have a SQL query based on any standard data sources (SQL Server, Oracle, SharePoint, OData, XML, etc.), it can be used in just a few minutes to create a report with the help of a wizard. Sure, adding the needed formatting, parameters, custom code, drilldown and drill-through functionality might take some effort, though in less than an hour you have a running report. The use of templates and a custom branding allows providing a common experience across the enterprise.

The whole service is available once you have a SQL Server license, fact that makes from the SSRS a cost-effective tool. The shallow learning curve and the integration with SharePoint facilitates the development and consumption of reports.

With its pixel-accurate display of data, SSRS is ideal for printing business documents. This was probably one of the reasons why SSRS become with Microsoft Dynamics AX 2009 also the main reporting platform for the further versions. One can use an AX 2009 class as source for the report, or directly use the base tables, which can increase reports’ performance in the detriment of reengineering the logic from AX 2009. With a few exceptions in finance area the reporting logic is easy to build.

With SQL Server 2016 it got a HTML5 rendering engine, while with SSRS 2017 it supports a responsive web design. The integration of the SSRS and Power BI environments has the chance to further extend the value provided by this powerful combination, however it depends also in which direction Microsoft will develop this idea.

The Bad: One of the important downsides of SSRS is that it doesn’t allow custom authentication. Even if some examples exist on the Web, it’s hard to understand Microsoft’s stubbornness of not providing this by design.

Because SSRS still uses an older MS Office driver, it allows exporting only 65536 records to Excel, fact that makes data consumption more complicated. In addition, the pixel-perfect isn’t that perfect, the introduction of empty columns when exporting to Excel, adds some unnecessary burden.

In total, the progress made by SSRS between the various releases is small when compared with the changes suffered by SQL Server. Even if the visualization capabilities cover most of the requests, it loses field when compared with Power BI and similar visualization tools.

The Ugly: SSRS, as the typical BI developer knows it, is different than the architecture frameworks provided when working with Business Central, respectively Dynamics 365 and CRM. Even if there are maybe entitled reasons, Microsoft failed to unite the three architectures into a flexible solution. Almost all the examples available on the Web target CRM, and frankly it’s hard to understand that. It feels like Microsoft wants to sabotage their own product?! What’s hard to understand is that besides SSRS and Power BI Microsoft has several other reporting tools for Dynamics 365. Building reports for Business Central or Dynamics 365 requires certain skills, while the development time increased considerably, thus SSRS losing from the appeal it previously had, allowing other tools to join the landscape (e.g. electronic documents).

SSRS can’t be smoothly integrated with Office 365 Online, remaining mainly a solution for on-premise architectures. This can become a bottleneck when the customers move to the cloud, the BI strategy needing to be eventually rethought as well.

Previous Post <<||>> Next Post

24 May 2020

🧊🎡☯Data Warehousing: SQL Server Integration Services (The Good, the Bad and the Ugly)

Microsoft SQL Server Integration Services (SSIS) is a platform for building (enterprise-level) data integrations and data transformations solutions by using a rich set of built-in tasks and transformations, graphical tools for building packages, respectively a catalog for storing the packages. Formally called Data Transformation Services (DTS), it was introduced with SQL Server 2000 and with SQL Server 2005 it was rebranded as SSIS.

The Good: Since its introduction it was adopted by DBAs and (database) programmers because it allowed the import and export of data on the fly from and to SQL Server, flat files, other relational data sources, in fact any resource exposing a driver for ODBC or OLEDB libraries. The extract/load functionality was extended by a basic set of transformations, making from DTS the ideal ETL tool for data warehousing and integrations. The data from multiple sources and targets could be processed in parallel or sequentially, the ETL logic being encapsulated in one or more packages that could be run manually or scheduled via the SQL Server agent flexibly.

With SQL Server 2005 and further versions the SSIS framework was extended to support further data sources including XML, CAML-based SharePoint lists, OData, Hadoop or Azure Bloob. It allowed the storage of packages on the local storage or within the built-in catalog.

One could thus develop rich ETL functionality without writing a single line of code. In theory the packages could be run and modified also by non-IT users, which can be a plus in certain scenarios. On the other side one could build custom packages programmatically from the beginning, and thus extend the available data processing logic as seemed fit, being able to using existing code and whole libraries embedded into the packages or run via dlls calls .

The Bad: Despite the rich functionality, a data pipeline usually has a lower performance and is more difficult to troubleshoot compared with the built-in RDBMS functionality for data processing. Most, if not all transformations can be handled over SQL-based queries more efficiently as long the data are available on the same SQL Server instance. In addition, SQL provides better code reuse, maintainability, chances for refactoring, scalability and the solutions are easier to deploy. Therefore, one practice resumes in using SSIS only for import/export, the further logic being encapsulated into stored procedures and further database objects. This isn’t necessarily bad, on contrary, though specific expertise is needed then to modify the code.

The Ugly: SSIS is in general suitable for data warehousing and integrations solutions whose logic is ideally stable and well-defined. Therefore, SSIS is less suitable for ERP data migrations or similar task which at least at the beginning have an exploratory nature and an overwhelming complexity, multiple iterations being needed before the requirements were fully identified and understood. In extremis each iteration can involve a redesign, which can prove to be time-consuming. One could in theory attempt first understanding all the data, though this could mean starting the development late in the process, while the data for testing are required much earlier. One can still use SSIS for specific tasks, though implementing a whole solution could imply certain challenges that otherwise could have been avoided.

SSIS is not suitable for real-time complex data integrations which require the processing of a considerable amount of data, when specific architectures like SOA, Restful calls or other solution could be more efficient. When not adequately implemented a data integration can lead to more problems than it can solve. Best example is the increase in execution time with the volume of data, fact that can easily lead to time-outs and locking of data.

Previous Post <<||>> Next Post

🧮💫ERP: Proof-of-Concept (Part I: Migrating AdventureWorks to Dynamics 365 - Products)

ERP Implementations Series

Below is exemplified the migration of Products from AdventureWorks database to Dynamics 365 (D365), where a minimum of steps were considered. Variations (e.g. enrichment of data, successive migrations) and other Entities (Product Variants, Released Products, Released Product Variants, etc.) will be considered in future posts.

As the AdventureWorks database is available only for testing and exemplification purposes, there is no need for a data import layer, the data being prepared into the “Map” schema created for this purpose. In theory the same approach can be used in production systems as well, though usually it’s better to detach the migration layer from the source system(s) from performance or security reasons.

-- creating a schema into the AdventureWorks database 
CREATE SCHEMA [Map]

Step 1: Data Discovery

Within this step one attempts getting a sorrow understanding of the systems involved within the data migration, in this case AdventureWorks and D365. As basis for this will be analyzed the tables for each entity, respectively the relations existing between them, the values, the distribution as well the relations existing between attributes. Is needed to analyze the similarities as well differences between the involved data models at structural as well at value level.

In AdventureWorks the SKU (Stock Keeping Unit) has a Color, Size and Style as Dimensions, a Product being created for each SKU. In D365 one differentiates between Products and Dimensions associated with it, having thus two levels defined. In addition, in D365 a Product has also the Configuration as dimension:

In addition, the Products and their Dimensions are defined at master level with a minimal of attributes like the Dimension Group. After that the Products and their Dimensions can be released for each Business Unit (aka Data Area), where the detailed attributes for Purchasing, Sales, Production or Inventory are maintained. For those acquainted with Dynamics AX 2009 it’s the same structure.
Once the structural differences identified, one can start looking at the values that define a product in both systems. The following queries are based on the source system.

-- reviewing the sizes 
SELECT Size 
, count(*) NoSizes
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
ORDER BY Size

-- reviewing the colors 
SELECT Color
, count(*) NoColors
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Color
ORDER BY Color

-- reviewing the styles  
SELECT Style
, count(*) NoStyles
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Style
ORDER BY Style

If the above queries show what values are used, the following shows the dependencies between them:

-- reviewing the sizes, colors, styles  
SELECT Size
, Color
, Style
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
, Color
, Style
ORDER BY 1,2,3

-- reviewing the dependencies between sizes, colors, styles  
SELECT CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END HasSize
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END HasColor
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END HasStyle
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END
ORDER BY 1,2,3

The last query is probably the most important, as it shows how the products need to be configured into the target system:

As can be seen a product can have only Color, Color and Style, Size and Color, respectively no dimensions or all dimensions. It will be needed to define a Dimension Group for each of these cases (e.g. Col, ColSty, SizCol, SizColSty, None). (More information on this in a future post.)

Unfortunately, unless the target system is already in use, there are no values usually, though one can attempt entering a few representative values manually over the user interface, at least to see what tables get populated.

Step 2: Data Mapping

Once the main attributes from source and target were identified, one can create the mapping at attribute level between them. Typically, one includes all the relevant information for a migration, from table, attribute, description to attributes’ definition (e.g. type, length, precision, mandatory) in all the systems:

The mapping was kept to a minimum to display only the most relevant information. Except a warning concerning the length of an attribute, respectively a new attribute (the old item number), the mapping doesn’t involve any challenges.

A data dictionary or even a metadata repository for the involved systems can help in the process, otherwise one needs to access the information from the available documentation or system’s metadata and prepare the data manually.

The relevant metadata for D365 can be obtained from the Microsoft documentation. The data can be loaded into system via the EcoResProductV2Entity (see also data entities or the AX 2012 documentation for tables and enumeration data types).

Step 3: Building the source entity

AdventureWorks already provides a view which models the Products entity, though because of its structure it needs to suffer some changes, or sometimes more advisable, do the changes in a separate view as follows:

-- Products source entity 
CREATE VIEW Map.vProductDetails 
AS 
SELECT CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ItemIdOld 
, CASE WHEN Size<>'' THEN dbo.CutLeft(Name, '-',1) ELSE Name End Name 
, row_number() OVER(PARTITION BY CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ORDER BY ProductNumber) Ranking
, ProductNumber 
, Description 
, Color
, Size
, Style
, CultureId 
, Subcategory 
, Category
, MakeFlag
, FinishedGoodsFlag
, SellStartDate 
, SellEndDate 
, StandardCost 
, ListPrice 
, SafetyStockLevel 
, ReorderPoint 
FROM [Production].[vProductDetails]

-- reviewing the data 
SELECT *
FROM Map.vProductDetails 
WHERE CultureId = 'en'
ORDER BY ProductNumber

To prepare the data for the migration the Product Number as well the Name were stripped from the Size, this being done with the help of dbo.CutLeft function. The row_number ranking window function was used to allow later selecting the first size for a given Product.

The discovery process continues, this time in respect to the target. Its useful to understand for example whether a Product has more than one Color or Style, whether the prices vary between Sizes, whether attributes like the Subcategory are consistent between Sizes, etc. It’s useful to prove anything that could have impact on the migration logic. The list of tests will be extended while building the logic, as new information are discovered.

-- checking dimensions' definition
SELECT ItemidOld 
, count(Size) NoSizes
, count(DISTINCT Color) NoColors
, count(DISTINCT Style) NoStyles
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  --AND Ranking = 1
GROUP BY ItemidOld
ORDER BY ItemidOld

-- checking the price variances between dimensions 
SELECT ItemidOld 
, Min(StandardCost) MinStandardCost
, Max(StandardCost) MaxStandardCost
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(StandardCost, 0)) != Max(IsNull(StandardCost, 0)) 
ORDER BY ItemidOld

-- checking attribute's consistency between dimensions 
SELECT ItemidOld 
, Min(Subcategory) MinSubcategory
, Max(Subcategory) MaxSubcategory
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(Subcategory, '')) != Max(IsNull(Subcategory, '')) 
ORDER BY ItemidOld

When the view starts performing poorly, for example because of the number of joins or data’s volume, it might me useful to dump the data in a table and perform the tests on it.
Even if it’s not maybe the case, it’s useful to apply defensive techniques in the logic by handing adequately the nulls.

Step 4: Implementing the Mapping

The attributes which need to be considered here are based on the target entities. It might be needed to include also attributes that are further needed to build the logic.

-- Product Mapping 
CREATE VIEW [Map].vEcoResProductV2Entity
AS
SELECT ProductId 
, ItemidOld ItemId 
, 'Item' ProductType 
, 'ProductMaster' ProductSubtype 
, Left(Replace(Name, ' ', ''), 20) ProductSearchName 
, ItemidOld ProductNumber 
, Name ProductName 
, Description ProductDescription 
, CASE 
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !='' AND IsNull(Style, '')!='' THEN 'SizColSty'
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !=''  THEN 'SizCol'

        WHEN IsNull(Style, '') !='' AND IsNull(Color, '') !=''  THEN 'ColSty'
 WHEN IsNull(Color, '') !='' THEN 'Col'
 WHEN IsNull(Style, '')!='' THEN 'Sty'

        WHEN IsNull(Size, '')!='' THEN 'Siz'
 ELSE 'None'
  END ProductDimensionGroupName 
, 'WHS' StorageDimensionGroupName 
, CASE WHEN MakeFlag = 1 THEN 'SN' ELSE '' END TrackingDimensionGroupName 
, 'PredefinedVariants' VariantConfigurationTechnology 
, Subcategory ProductCategory 
, ItemidOld ItemIdOld 
, 1 IsNewItem 
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1

The Dimenstion Group is based on the above observation. It was supposed that all Products have inventory (see Storage Dimension), while the manufactured products will get a Serial Number (see Tracking Dimension). IsNewItem will be used further to migrate deltas (and thus to partition migrations).

Step 5: Building the Target Entity

The target entity is in the end only a table in which usually are kept only the attributes in scope. In this case the definition is given by the following DDL:

CREATE TABLE Map.EcoResProductV2Entity(
 Id int IDENTITY(1,1) NOT NULL,
 ProductId int NULL,
 ProductType nvarchar(20) NULL,
 ProductSubtype nvarchar(20) NULL,
 ProductsearchName nvarchar(255) NULL,
 ProductNumber nvarchar(20) NOT NULL,
 ProductName nvarchar(60) NULL,
 ProductDescription nvarchar(1000) NULL,
 ProductDimensionGroupName nvarchar(50) NULL,
 StorageDimensionGroupName nvarchar(50) NULL,
 TrackingDimensionGroupName nvarchar(50) NULL,
 VariantConfigurationTechnology nvarchar(50) NULL,
 ProductCategory nvarchar(255) NULL,
 ItemIdOld nvarchar(20) NULL,
 IsNewItem bit,
 CONSTRAINT I_EcoResProductV2Entity PRIMARY KEY CLUSTERED 
(
 Id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The table was build to match the name and definition from the target systems. The definition is followed by a few inserts based on the logic defined in the previous step:

-- preparing the data for EcoResProductV2Entity 
INSERT INTO [Map].EcoResProductV2Entity 
SELECT ITM.ProductId
, ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory
, ITM.ItemIdOld 
, ITM.IsNewItem
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld 

-- reviewing the data 
SELECT *
FROM [Map].EcoResProductV2Entity 
ORDER BY ItemId

The business might decide to take over the Product Number into the target system as unique identifier, though it’s not always the case. It might opt to create a new sequence number, which could start e.g. with 10000000 (8 characters). In such a case is changed only the logic for the Product Number, the value being generated using a ranking window function:

-- preparing the data for EcoResProductV2Entity  
DECLARE @StartItemId as int = 10000000
--INSERT INTO [Map].EcoResProductV2Entity 
SELECT ...
, @StartItemId + Rank() OVER(ORDER BY ITM.ProductType, ITM.ItemIdOld) ProductNumber
, ...
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld

Step 5: Reviewing the Data

Before exporting the data it makes sense to review the data from various perspectives: how many Products of a certain type will be created, whether the current and old product numbers are unique, etc. The scripts make sure that the consistency of the data in respect to the future systems was achieved.

-- checking values' frequency (overview, no implications)
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, count(*) NoRecords
FROM [Map].EcoResProductV2Entity ITM
WHERE IsNewItem = 1
GROUP BY ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology

-- check ProductNumber's uniqueness (no duplicates allowed)
SELECT ProductNumber
, Min(ItemidOld) 
, max(ItemIdold)
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ProductNumber
HAVING count(*)>1

-- check old Product's uniqueness (no duplicates allowed)
SELECT ItemIdOld
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ItemIdOld
HAVING count(*)>1

This section will grow during the implementation, as further entities will be added.

Step 6: Exporting the data

The export query is usually reflecting the entity and can include further data’s formatting, when needed:

-- Export Products
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory RetailProductCategoryName 
, ITM.ItemIdOld
FROM [Map].EcoResProductV2Entity ITM
WHERE ITM.isNewItem = 1
ORDER BY ProductNumber

Depending on the import needs, the data can be exported to Excel or a delimited text file (e.g. “|” pipe is an ideal delimiter.

Step 7: Validating the Data before Import

Before importing the data into the target system, it makes sense to have the data checked by the business or consultants. A visual check at this stage can help save time later.

Step 8: Validating the Data after Import

Unfortunately Microsoft doesn’t allow direct access to the D365 Production database, however one can still access various tables and entities’ content via the table browser. Anyway, the validation usually takes place into the UAT (User Acceptance Testing) system. So, if everything went well into the UAT and all measures were taken to have the same parameters across all systems, there should be no surprises during Go-Live.