SQL Troubles: Data Validation

Showing posts with label Data Validation. Show all posts

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)

{def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)

includes

the transportation of data
overlaps between stages
changes in flow

due to

new technologies
changing requirements

changes in scope
troubleshooting

due to data mismatches

{step} extraction

data is extracted directly from the source systems or intermediate repositories

data may be made available in intermediate repositories, when the direct access to the source system is not possible

⇐ this approach can add a complexity layer

{substep} data validation

an automated process that validates whether data pulled from sources has the expected values
relies on a validation engine

rejects data if it falls outside the validation rules
analyzes rejected records on an ongoing basis to

identifies what went wrong
corrects the source data
modifies extraction to resolve the problem in the next batches

{step} transform

transforms the data, removing extraneous or erroneous data
applies business rules
checks data integrity

ensures that the data is not corrupted in the source or corrupted by ETL
may ensure no data was dropped in previous stages

aggregates the data if necessary

{step} load

{substep} store the data into a staging layer

transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
{advantage} makes it easier to roll back, if something went wrong
{advantage} allows to develop the logic iteratively and publish the data only when needed
{advantage} can be used to generate audit reports for

regulatory compliance
diagnose and repair of data problems

modern ETL process perform transformations in place, instead of in staging areas

{substep} publish the data to the target

loads the data into the target table(s)
{scenario} the existing data are overridden every time the ETL pipeline loads a new batch

this might happen daily, weekly, or monthly

{scenario} add new data without overriding

the timestamp can indicate the data is new

{recommendation} prevent the loading process to error out due to disk space and performance limitations

{approach} building an ETL infrastructure

involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly

recurring process

e.g. data used for reporting

one-time process

e.g. data migration

may involve

multiple source or destination systems
different types of data

e.g. reference, master and transactional data
⇐ may have complex dependencies

different level of structuredness

e.g. structured, semistructured, nonstructured

different data formats
data of different quality
different ownership

{recommendation} consider ETL best practices

{best practice} define requirements upfront in a consolidated and consistent manner

allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
the requirements may involve all the aspects of the process

e.g. data extraction, data transformation, standard formatting, etc.

{best practice} define a high level strategy

allows to define the road ahead, risks and other aspects
allows to provide transparency
this may be part of a broader strategy that can be referenced

{best practice} align the requirements and various aspects to the existing strategies existing in the organization

allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
e.g. IT, business, Information Security, Data Management strategies

{best practice} define the scope upfront

allows to better estimate the effort and validate the outcomes
even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations

{best practice} manage the effort as a project and use a suitable Project Management methodology

allows to apply structured well-established PM practices
it might be suited to adapt the methodology to project's particularities

{best practice} convert data to standard formats to standardize data processing

allows to reduce the volume of issues resulted from data type mismatches
applies mainly to dates, numeric or other values for which can be defined standard formats

{best practice} clean the data in the source systems, when cost-effective

allows to reduces the overall effort, especially when this is done in advance
this should be based ideally on the scope

{best practice} define and enforce data ownership

allows to enforce clear responsibilities across the various processes
allows to reduce the overall effort

{best practice} document data dependencies

document the dependencies existing in the data at the various levels

{best practice} protocol data movement from source(s) to destination(s) in term of data volume

allows to provide transparence into the data movement process
allows to identify gaps in the data or broader issues
can be used for troubleshooting and understanding the overall data growth

{recommendation} consider proven systems, architectures and methodologies

allows to minimize the overall effort and costs associated with the process

Previous Post <<||>> Next Post

14 January 2025

🧭Business Intelligence: Perspectives (Part 22: Breaking Queries' Complexity)

Business Intelligence Series

Independently whether standalone or encapsulated in database objects, the queries written can become complex in time, respectively difficult to comprehend and maintain. One can reduce the cognitive load by identifying the aspects that enable one’s intuition - order, similarity and origin being probably the most important aspects that help coping with the inherent complexity.

One should start with the table having the lowest level of detail, usually a transaction table that forms the backbone of a certain feature. For example, for Purchase Orders this could be upon case the distribution or line level. If Invoices are added to the logic, and there could be multiple invoice line for a record from the former logic, then this becomes the new level of detail. Further on, if General Ledger lines are added, more likely this becomes the lowest level of detail, and so on.

This approach allows to keep a consistent way of building the queries while enabling to validate the record count, making sure that no duplicates are added to the logic. Conversely, one can start also from the table with the lowest level of details, and add tables successively based on their level of detail, though the database engine may generate upon case a suboptimal plan. In addition, checking the cardinality may involve writing a second query.

One should try to keep the tables belonging to the same entity together, when possible (e.g. Purchase Order vs. Vendor information). This approach allows to reduce the volume of work required to manage, review, test and understand the query later. If the blocks are too big, then occasionally it makes sense to bring pieces of logic into CTEs (Common Table Expressions), or much better into views that allow to better encapsulate and partition the logic.

CTEs allow to split the logic into logical steps, allowing occasionally to troubleshoot the logic on pieces though one should keep a balance between maintainability and detail. In extremis, people may create unnecessarily an additional CTE for each join. The longer and the more fragmented a query, the more difficult it becomes to troubleshoot and even understand. Readability can be better achieved though indentation, by keeping things together that belong together, respectively partitioning the logic in logical blocks that derive from the model.

Keeping things together should be applied also to the other elements. For example, the join constraints should be limited only to the fields participating in the join (and, if possible, all the other constraints should be brought in the WHERE clause). Bringing the join constraints in the WHERE clause, an approach used in the past, decreases query readability no matter how well the WHERE clause is structured, and occasionally can lead to constraints’ duplication or worse, to missing pieces of logic.

The order of the attributes should follow a logic that makes sense. One approach is to start from the tables with lowest cardinality that identify entities uniquely and move to the attributes that have a higher level of detail. However, not all attributes are equally important, and thus one might have to compromise and create multiple groups of data coming from different levels.

One should keep in mind that the more random the order of the attributes is, the more difficult it becomes to validate the data as one needs to jump multiple times either in the query or in the mask. Ideally one should find a balance between the two perspectives. Having an intuitive logic of how the attributes are listed increases queries’ readability, maintainability and troubleshooting. The more random attributes’ listing, the higher the effort for the same.

Previous Post <<||>> Next Post

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VII: Data Import Layer)

Data Migrations Series

The data requirements for the Data Migration (DM) and Data Quality (DQ) are driven by the processes implemented in the target system(s). Therefore, a good knowledge of these requirements can decrease the effort needed for these two subprojects considerably. The needed knowledge basis starts with the entities and their attributes, the dependencies existing between them and the various rules that apply, and ends with the parametrization requirements, respectively the architecture(s) that can be used to import the data.

The DM process starts with defining the entities in scope and their attributes, respectively identifying the corresponding entities and attributes from the legacy systems. The attributes not having a correspondent in the legacy system need to be provided by the business and integrated in the DM logic. In addition, it’s needed to consider also the attributes needed by the business and not available in the target system, some of them more likely available in the legacy systems. For such attributes is needed either to misuse an attribute from the target or to extend the target system.

For each entity is created a data mapping that basically documents the data transformations needed for migrating the data. In the process is needed to consider also attributes’ data types, the (standard) formatting, their domain of definition, as well the various rules that apply. Their implementation belongs into the DM layer from which the data are exported in a standard format as needed by the target system.

Exporting the data from the DM layer directly into the target system’s tables has in theory the lowest overhead even if the rejected records are difficult to track, the rejections resulting only from records’ ‘validation against database’s schema. For this approach to work, one must have a good knowledge of the database schema and of the business rules implemented into the target system.

To solve the issue with errors’ logging, systems have a further layer on top of the database model, which also allow running data validation against target system’s business rules. Modern import frameworks allow loading the data via a set of standard files with a predefined structure. The data can be thus imported manually or via load jobs into the system a log with the issues being generated in the process. Some frameworks allow even the manual editing of failed records, respectively to import the data. Unfortunately, calling the layer from the DM layer is not possible from a database, though this would bring seldom a benefit. Some third-party tools attempt to improve the import functionality by calling the target system’s import layer.

The import files must be generated from the DM layer in the required structure with the appropriate formatting. The challenge however resides in identifying all the attributes that should make scope of the load. It’s an iterative process which sometimes is backed by try-and-error heuristics. Unless target system’s validation rules are known beforehand, the rules need to be discovered in this process, which can prove time-consuming. The discoveries need to be integrated also in the DM and from here results the big number of changes that need to be performed.

Given the dependencies existing between entities the files need to be generated and loaded in a predefined order. These dependencies are reflected also in the data processing and the validation rules considered in the DM layer.

A quality checkpoint can be implemented between the export from the DM layer and import to enforce the four-eyes principle. It’s normally the last opportunity for trapping the eventual issues. A further quality check is performed after import by validating on whether the data were imported as expected.

Previous Post <<||>> Next Post

24 May 2020

🧮💫ERP: Proof-of-Concept (Part I: Migrating AdventureWorks to Dynamics 365 - Products)

ERP Implementations Series

Below is exemplified the migration of Products from AdventureWorks database to Dynamics 365 (D365), where a minimum of steps were considered. Variations (e.g. enrichment of data, successive migrations) and other Entities (Product Variants, Released Products, Released Product Variants, etc.) will be considered in future posts.

As the AdventureWorks database is available only for testing and exemplification purposes, there is no need for a data import layer, the data being prepared into the “Map” schema created for this purpose. In theory the same approach can be used in production systems as well, though usually it’s better to detach the migration layer from the source system(s) from performance or security reasons.

-- creating a schema into the AdventureWorks database 
CREATE SCHEMA [Map]

Step 1: Data Discovery

Within this step one attempts getting a sorrow understanding of the systems involved within the data migration, in this case AdventureWorks and D365. As basis for this will be analyzed the tables for each entity, respectively the relations existing between them, the values, the distribution as well the relations existing between attributes. Is needed to analyze the similarities as well differences between the involved data models at structural as well at value level.

In AdventureWorks the SKU (Stock Keeping Unit) has a Color, Size and Style as Dimensions, a Product being created for each SKU. In D365 one differentiates between Products and Dimensions associated with it, having thus two levels defined. In addition, in D365 a Product has also the Configuration as dimension:

In addition, the Products and their Dimensions are defined at master level with a minimal of attributes like the Dimension Group. After that the Products and their Dimensions can be released for each Business Unit (aka Data Area), where the detailed attributes for Purchasing, Sales, Production or Inventory are maintained. For those acquainted with Dynamics AX 2009 it’s the same structure.
Once the structural differences identified, one can start looking at the values that define a product in both systems. The following queries are based on the source system.

-- reviewing the sizes 
SELECT Size 
, count(*) NoSizes
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
ORDER BY Size

-- reviewing the colors 
SELECT Color
, count(*) NoColors
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Color
ORDER BY Color

-- reviewing the styles  
SELECT Style
, count(*) NoStyles
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Style
ORDER BY Style

If the above queries show what values are used, the following shows the dependencies between them:

-- reviewing the sizes, colors, styles  
SELECT Size
, Color
, Style
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
, Color
, Style
ORDER BY 1,2,3

-- reviewing the dependencies between sizes, colors, styles  
SELECT CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END HasSize
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END HasColor
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END HasStyle
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END
ORDER BY 1,2,3

The last query is probably the most important, as it shows how the products need to be configured into the target system:

As can be seen a product can have only Color, Color and Style, Size and Color, respectively no dimensions or all dimensions. It will be needed to define a Dimension Group for each of these cases (e.g. Col, ColSty, SizCol, SizColSty, None). (More information on this in a future post.)

Unfortunately, unless the target system is already in use, there are no values usually, though one can attempt entering a few representative values manually over the user interface, at least to see what tables get populated.

Step 2: Data Mapping

Once the main attributes from source and target were identified, one can create the mapping at attribute level between them. Typically, one includes all the relevant information for a migration, from table, attribute, description to attributes’ definition (e.g. type, length, precision, mandatory) in all the systems:

The mapping was kept to a minimum to display only the most relevant information. Except a warning concerning the length of an attribute, respectively a new attribute (the old item number), the mapping doesn’t involve any challenges.

A data dictionary or even a metadata repository for the involved systems can help in the process, otherwise one needs to access the information from the available documentation or system’s metadata and prepare the data manually.

The relevant metadata for D365 can be obtained from the Microsoft documentation. The data can be loaded into system via the EcoResProductV2Entity (see also data entities or the AX 2012 documentation for tables and enumeration data types).

Step 3: Building the source entity

AdventureWorks already provides a view which models the Products entity, though because of its structure it needs to suffer some changes, or sometimes more advisable, do the changes in a separate view as follows:

-- Products source entity 
CREATE VIEW Map.vProductDetails 
AS 
SELECT CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ItemIdOld 
, CASE WHEN Size<>'' THEN dbo.CutLeft(Name, '-',1) ELSE Name End Name 
, row_number() OVER(PARTITION BY CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ORDER BY ProductNumber) Ranking
, ProductNumber 
, Description 
, Color
, Size
, Style
, CultureId 
, Subcategory 
, Category
, MakeFlag
, FinishedGoodsFlag
, SellStartDate 
, SellEndDate 
, StandardCost 
, ListPrice 
, SafetyStockLevel 
, ReorderPoint 
FROM [Production].[vProductDetails]

-- reviewing the data 
SELECT *
FROM Map.vProductDetails 
WHERE CultureId = 'en'
ORDER BY ProductNumber

To prepare the data for the migration the Product Number as well the Name were stripped from the Size, this being done with the help of dbo.CutLeft function. The row_number ranking window function was used to allow later selecting the first size for a given Product.

The discovery process continues, this time in respect to the target. Its useful to understand for example whether a Product has more than one Color or Style, whether the prices vary between Sizes, whether attributes like the Subcategory are consistent between Sizes, etc. It’s useful to prove anything that could have impact on the migration logic. The list of tests will be extended while building the logic, as new information are discovered.

-- checking dimensions' definition
SELECT ItemidOld 
, count(Size) NoSizes
, count(DISTINCT Color) NoColors
, count(DISTINCT Style) NoStyles
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  --AND Ranking = 1
GROUP BY ItemidOld
ORDER BY ItemidOld

-- checking the price variances between dimensions 
SELECT ItemidOld 
, Min(StandardCost) MinStandardCost
, Max(StandardCost) MaxStandardCost
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(StandardCost, 0)) != Max(IsNull(StandardCost, 0)) 
ORDER BY ItemidOld

-- checking attribute's consistency between dimensions 
SELECT ItemidOld 
, Min(Subcategory) MinSubcategory
, Max(Subcategory) MaxSubcategory
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(Subcategory, '')) != Max(IsNull(Subcategory, '')) 
ORDER BY ItemidOld

When the view starts performing poorly, for example because of the number of joins or data’s volume, it might me useful to dump the data in a table and perform the tests on it.
Even if it’s not maybe the case, it’s useful to apply defensive techniques in the logic by handing adequately the nulls.

Step 4: Implementing the Mapping

The attributes which need to be considered here are based on the target entities. It might be needed to include also attributes that are further needed to build the logic.

-- Product Mapping 
CREATE VIEW [Map].vEcoResProductV2Entity
AS
SELECT ProductId 
, ItemidOld ItemId 
, 'Item' ProductType 
, 'ProductMaster' ProductSubtype 
, Left(Replace(Name, ' ', ''), 20) ProductSearchName 
, ItemidOld ProductNumber 
, Name ProductName 
, Description ProductDescription 
, CASE 
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !='' AND IsNull(Style, '')!='' THEN 'SizColSty'
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !=''  THEN 'SizCol'

        WHEN IsNull(Style, '') !='' AND IsNull(Color, '') !=''  THEN 'ColSty'
 WHEN IsNull(Color, '') !='' THEN 'Col'
 WHEN IsNull(Style, '')!='' THEN 'Sty'

        WHEN IsNull(Size, '')!='' THEN 'Siz'
 ELSE 'None'
  END ProductDimensionGroupName 
, 'WHS' StorageDimensionGroupName 
, CASE WHEN MakeFlag = 1 THEN 'SN' ELSE '' END TrackingDimensionGroupName 
, 'PredefinedVariants' VariantConfigurationTechnology 
, Subcategory ProductCategory 
, ItemidOld ItemIdOld 
, 1 IsNewItem 
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1

The Dimenstion Group is based on the above observation. It was supposed that all Products have inventory (see Storage Dimension), while the manufactured products will get a Serial Number (see Tracking Dimension). IsNewItem will be used further to migrate deltas (and thus to partition migrations).

Step 5: Building the Target Entity

The target entity is in the end only a table in which usually are kept only the attributes in scope. In this case the definition is given by the following DDL:

CREATE TABLE Map.EcoResProductV2Entity(
 Id int IDENTITY(1,1) NOT NULL,
 ProductId int NULL,
 ProductType nvarchar(20) NULL,
 ProductSubtype nvarchar(20) NULL,
 ProductsearchName nvarchar(255) NULL,
 ProductNumber nvarchar(20) NOT NULL,
 ProductName nvarchar(60) NULL,
 ProductDescription nvarchar(1000) NULL,
 ProductDimensionGroupName nvarchar(50) NULL,
 StorageDimensionGroupName nvarchar(50) NULL,
 TrackingDimensionGroupName nvarchar(50) NULL,
 VariantConfigurationTechnology nvarchar(50) NULL,
 ProductCategory nvarchar(255) NULL,
 ItemIdOld nvarchar(20) NULL,
 IsNewItem bit,
 CONSTRAINT I_EcoResProductV2Entity PRIMARY KEY CLUSTERED 
(
 Id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The table was build to match the name and definition from the target systems. The definition is followed by a few inserts based on the logic defined in the previous step:

-- preparing the data for EcoResProductV2Entity 
INSERT INTO [Map].EcoResProductV2Entity 
SELECT ITM.ProductId
, ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory
, ITM.ItemIdOld 
, ITM.IsNewItem
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld 

-- reviewing the data 
SELECT *
FROM [Map].EcoResProductV2Entity 
ORDER BY ItemId

The business might decide to take over the Product Number into the target system as unique identifier, though it’s not always the case. It might opt to create a new sequence number, which could start e.g. with 10000000 (8 characters). In such a case is changed only the logic for the Product Number, the value being generated using a ranking window function:

-- preparing the data for EcoResProductV2Entity  
DECLARE @StartItemId as int = 10000000
--INSERT INTO [Map].EcoResProductV2Entity 
SELECT ...
, @StartItemId + Rank() OVER(ORDER BY ITM.ProductType, ITM.ItemIdOld) ProductNumber
, ...
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld

Step 5: Reviewing the Data

Before exporting the data it makes sense to review the data from various perspectives: how many Products of a certain type will be created, whether the current and old product numbers are unique, etc. The scripts make sure that the consistency of the data in respect to the future systems was achieved.

-- checking values' frequency (overview, no implications)
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, count(*) NoRecords
FROM [Map].EcoResProductV2Entity ITM
WHERE IsNewItem = 1
GROUP BY ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology

-- check ProductNumber's uniqueness (no duplicates allowed)
SELECT ProductNumber
, Min(ItemidOld) 
, max(ItemIdold)
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ProductNumber
HAVING count(*)>1

-- check old Product's uniqueness (no duplicates allowed)
SELECT ItemIdOld
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ItemIdOld
HAVING count(*)>1

This section will grow during the implementation, as further entities will be added.

Step 6: Exporting the data

The export query is usually reflecting the entity and can include further data’s formatting, when needed:

-- Export Products
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory RetailProductCategoryName 
, ITM.ItemIdOld
FROM [Map].EcoResProductV2Entity ITM
WHERE ITM.isNewItem = 1
ORDER BY ProductNumber

Depending on the import needs, the data can be exported to Excel or a delimited text file (e.g. “|” pipe is an ideal delimiter.

Step 7: Validating the Data before Import

Before importing the data into the target system, it makes sense to have the data checked by the business or consultants. A visual check at this stage can help save time later.

Step 8: Validating the Data after Import

Unfortunately Microsoft doesn’t allow direct access to the D365 Production database, however one can still access various tables and entities’ content via the table browser. Anyway, the validation usually takes place into the UAT (User Acceptance Testing) system. So, if everything went well into the UAT and all measures were taken to have the same parameters across all systems, there should be no surprises during Go-Live.

21 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part III: The Data Preparation Layer)

Data Migrations Series

Once the source data (including the data needed for enrichment) were made available into the migration database, one can model the source entities by encapsulating the logic in views or table-valued functions (TVFs). This enables code’s maintainability (by providing better visibility over the transformations), reuse (various validations are necessary), and performance (by taking advantage of RDBMS native functionality), and flexibility in changing the code. The objects thus created can be used in a new set of similar objects supposed to contain the mapping logic between the source and target data models. One attempts thus to prepare the data as needed by the target system.

As each target entity is modelled, it’s useful to dump the resulting data into tables, which will be further used as source for the further logic, instead of using directly the views or TVFs. This allows to keep a copy of the data, perform a range of validations, and most important, can provide better performance as indexes can be built on the tables. In addition, one can further manipulate the data in tables as requested, e.g. by including information which are later available (e.g. attributes from the target system) or, for testing or correcting the data without affecting the built logic. From the same reasons such tables can be used in intermediate steps of the migration, inclusively when modelling the source data entities. However, one should avoid their excessive use, as this can complicate the architecture unnecessarily.

The last step into the preparation layer is to prepare queries which only select the attributes needed by the target system and include additional formatting. It’s in general recommended to detach the formatting from other transformations as this approach provides better flexibility in addressing the migration requirements. The data can be afterwards exported manually or via an automated job, the latter approach being recommended especially when the data need to be partitioned.

At this stage, after validating the data, they can be imported into the target system via the mechanisms available. Until here lies in theory the boundary of the migration logic, however this layer can be extended for data validation purposes. It would be helpful for example to assure that the data imported into the target entirely reflect the prepared data. It can happen that during import data are truncated, incorrectly imported (wrong attribute, values are changed or incorrectly mapped) or even whole records not being imported, with impact on data consistency.

After importing the data into the target system one can import the migrated data via ETL packages back into the migration database and build queries which match the data at attribute level. This step may seem redundant, though it’s a way to assure that the migration occurred according to the expectations and minimize thus the later surprises. If not for all entities, this type of import might be the easiest solution for importing data into the logic (e.g. when identity values need to be mapped into the logic after migrating an entity).

It can be also helpful to import the tables for dropdown values (typical parameters), to assure thus that the values used for parameterizing the system were built into the migration logic as expected. It may sound surprising that not all systems perform checks by imports or these checks were disabled for other reasons.

In data migrations is recommended to assure data internal consistency by design. Even if various validations for uniqueness, completeness, consistency, conformity, timeliness, referential integrity or even accuracy might seem as redundant or involve extra work, on the long run they pay-off as they allow trapping the issues early in the process.

19 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part I - An Introduction)

Data Migrations Series

A Data Migration (DM) is the move of all or a subset of the data available from one or more system(s) into other system(s). For ERP systems a DM this can be achieved by having a database in between which allows the import and combination of data from various sources, the further processing of data, respectively the preparation of the data as per target system(s)’ needs.

At a deeper lever is needed to model the entities from source and the target systems at level that allows easier processing, which resumes to removing and adding data, making mapping and transformations between the input and output models. In other words, a set of entities is transformed in another set of entities, while at a much lower level attributes from the source system must be mapped to the target system.

SQL scripting provides enough flexibility in modelling the source or target system(s), in changing the data and logic on the fly, in building additional functionality, in reusing logic or cleaning the data. It works with big amounts of structured data, while for unstructured data one can still combine the power of SQL with ad-hoc or even NoSQL processing. It all sounds easy, isn’t it? Then from where comes the complexity people talk about?

Knowing SQL and the features of a (relational) database (the how) is only a small part from all is needed. Building a DM solution implies the knowledge of the source data models and of the data behind (the what), knowing the target (where), how and when to import the data, why the data must be made available, in what way the data has impact on the target system(s), respectively who needs to be involved. The more of these aspects are known, the higher the chances for the DM to succeed.

When there’s a gap in the needed knowledge, then the knowledge must be acquired from the business and/or consultants, by investing time in data discovery or similar activities, in experimenting and testing. The more uncertainty, the more iterations are needed to achieve the degree of certainty (aka quality) needed. At minimum one needs at least 3-4 iterations to migrate the data – to build the concept, to test fully the migration, respectively to get the sign-off in a test environment and doing the migration into the production.

Another type of complexity is derived from the interdependencies existing between the DM and vital activities like system parameterization and data cleaning. System’s parametrization defines how a system behaves, an important number of the parametrization values being reflected also in the prepared data for the DM. Wrong parametrization values can be trapped in the process during the various iterations, however poor data quality will reflect though the system after Go-Live and will haunt the business on the long term, unless addressed correctly into the project.

All these aspects are reflected to some degree also in DM’s architecture (e.g. building restart points, including data validation and automatic cleaning) and process (e.g. sequencing of steps). From all the effort involved in DM, maybe 20-30% needs to be invested in building the solution. These 20-30% can be reduced in theory by using specialized DM tools, though the advantage provided by such tools can be illusory, especially when further limitations are involved. On the other side such tools may provide functionality that address the other 70-80%.

As usual in IT, there’s a trade-off between advantages and disadvantages. An in-house build DM solution may provide the needed flexibility in the detriment of a small time investment. Whether it pays-off is something one has to prove by walking this path.

10 May 2018

🔬Data Science: Cross-validation (Definitions)

"A method for assessing the accuracy of a regression or classification model. A data set is divided up into a series of test and training sets, and a model is built with each of the training set and is tested with the separate test set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A method for assessing the accuracy of a regression or classification model." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"A statistical method derived from cross-classification which main objective is to detect the outlying point in a population set." (Tomasz Ciszkowski & Zbigniew Kotulski, "Secure Routing with Reputation in MANET", 2008)

"Process by which an original dataset d is divided into a training set t and a validation set v. The training set is used to produce an effort estimation model (if applicable), later used to predict effort for each of the projects in v, as if these projects were new projects for which effort was unknown. Accuracy statistics are then obtained and aggregated to provide an overall measure of prediction accuracy." (Emilia Mendes & Silvia Abrahão, "Web Development Effort Estimation: An Empirical Analysis", 2008)

"A method of estimating predictive error of inducers. Cross-validation procedure splits that dataset into k equal-sized pieces called folds. k predictive function are built, each tested on a distinct fold after being trained on the remaining folds." (Gilles Lebrun et al, EA Multi-Model Selection for SVM, 2009)

"Method to estimate the accuracy of a classifier system. In this approach, the dataset, D, is randomly split into K mutually exclusive subsets (folds) of equal size (D1, D2, …, Dk) and K classifiers are built. The i-th classifier is trained on the union of all Dj ¤ j¹i and tested on Di. The estimate accuracy is the overall number of correct classifications divided by the number of instances in the dataset." (M Paz S Lorente et al, "Ensemble of ANN for Traffic Sign Recognition" [in "Encyclopedia of Artificial Intelligence"], 2009)

"The process of assessing the predictive accuracy of a model in a test sample compared to its predictive accuracy in the learning or training sample that was used to make the model. Cross-validation is a primary way to assure that over learning does not take place in the final model, and thus that the model approximates reality as well as can be obtained from the data available." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Validating a scoring procedure by applying it to another set of data." (Dougal Hutchison, "Automated Essay Scoring Systems", 2009)

"A method for evaluating the accuracy of a data mining model." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Cross-validation is a method of splitting all of your data into two parts: training and validation. The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"A technique used for validation and model selection. The data is randomly partitioned into K groups. The model is then trained K times, each time with one of the groups left out, on which it is evaluated." (Simon Rogers & Mark Girolami, "A First Course in Machine Learning", 2017)

"A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set." (Adrian Carballal et al, "Approach to Minimize Bias on Aesthetic Image Datasets", 2019)

SQL Troubles

Pages

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

14 January 2025

🧭Business Intelligence: Perspectives (Part 22: Breaking Queries' Complexity)

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VII: Data Import Layer)

24 May 2020

🧮💫ERP: Proof-of-Concept (Part I: Migrating AdventureWorks to Dynamics 365 - Products)

21 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part III: The Data Preparation Layer)

19 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part I - An Introduction)

10 May 2018

🔬Data Science: Cross-validation (Definitions)

About Me