SQL Troubles: data mapping

Showing posts with label data mapping. Show all posts

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part IV: Quality Acceptance Criteria IV)

Data Migrations Series

Reliability

Reliability is the degree to which a solution performs its intended functions under stated conditions without failure. In other words, a DM is reliable if it performs what was intended by design. The data should be migrated only when migration’s reliability was confirmed by the users as part of the sign-off process. The dry-runs as well the final iteration for the UAT have the objective of confirming solution’s reliability.

Reversibility

Reversibility is the degree to which a solution can return to a previous state without starting the process from the beginning. For example, it should be possible to reverse the changes made to a table by returning to the previous state. This can involve having a copy of the data stored respectively deleting and reloading the data when necessary.

Considering that the sequence in which the various activities is fix, in theory it’s possible to address reversibility by design, e.g. by allowing to repeat individual steps or by creating rollback points. Rollback points are especially important when loading the data into the target system.

Robustness

Robustness is the degree to which the solution can accommodate invalid input or environmental conditions that might affect data’s processing or other requirements (e,g. performance). If the logic can be stabilized over the various iterations, the variance in data quality can have an important impact on a solutions robustness. One can accommodate erroneous input by relaxing schema’s rules and adding further quality checks.

Security

Security is the degree to which the DM solution protects the data so that only authorized people have access to the respective data to the defined level of authorization as data are moved through the solution. The security provided by a solution needs to be considered against the standards and further requirements defined within the organization. In case no such standards are available, one can in theory consider the industry best practices.

Scalability

Scalability is the degree to which the solution is able to respond to an increased workload. Given that the number of data considered during the various iterations vary in volume, a solution’s scalability needs to be considered in respect to the volume of data to be migrated.

Standardization

Standardization is the degree to which technical standards were implemented for a solution to guarantee certain level of performance or other aspects considered as import. There can be standards for data storage, processing, access, transportation, or other aspects associated with the migration processes. Moreover, especially when multiple DMs are in scope, organizations can define a set of standards and guidelines that should be further considered.

Testability

Testability is the degree to which a solution can be tested in the respect to the set of functional and data-related requirements. Even if for the success of a migration are important the data in their final form, to achieve that is needed to validate the logic and test thoroughly the transformations performed on the data. As the data go trough the data pipelines, they need to be tested in the critical points – points where the data suffer important transformations. Moreover, one can consider record counters for the records processed in each such critical point, to assure that no record was lost in the process.

Traceability

Traceability is the degree to which the changes performed on the data can be traced from the target to the source systems as record, respectively at entity level. In theory, it’s enough to document the changes at attribute level, though upon case it might needed to document also the changes performed on individual values.

Mappings at attribute level allow tracing the data flow, while mappings at value level allow tracing the changes occurrent within values.

Previous Post <> Next Post

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed.

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap.

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings).

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process.

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes. Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed.

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best.

Previous Post <<||>> Next Post

24 May 2020

🧮💫ERP: Proof-of-Concept (Part I: Migrating AdventureWorks to Dynamics 365 - Products)

ERP Implementations Series

Below is exemplified the migration of Products from AdventureWorks database to Dynamics 365 (D365), where a minimum of steps were considered. Variations (e.g. enrichment of data, successive migrations) and other Entities (Product Variants, Released Products, Released Product Variants, etc.) will be considered in future posts.

As the AdventureWorks database is available only for testing and exemplification purposes, there is no need for a data import layer, the data being prepared into the “Map” schema created for this purpose. In theory the same approach can be used in production systems as well, though usually it’s better to detach the migration layer from the source system(s) from performance or security reasons.

-- creating a schema into the AdventureWorks database 
CREATE SCHEMA [Map]

Step 1: Data Discovery

Within this step one attempts getting a sorrow understanding of the systems involved within the data migration, in this case AdventureWorks and D365. As basis for this will be analyzed the tables for each entity, respectively the relations existing between them, the values, the distribution as well the relations existing between attributes. Is needed to analyze the similarities as well differences between the involved data models at structural as well at value level.

In AdventureWorks the SKU (Stock Keeping Unit) has a Color, Size and Style as Dimensions, a Product being created for each SKU. In D365 one differentiates between Products and Dimensions associated with it, having thus two levels defined. In addition, in D365 a Product has also the Configuration as dimension:

In addition, the Products and their Dimensions are defined at master level with a minimal of attributes like the Dimension Group. After that the Products and their Dimensions can be released for each Business Unit (aka Data Area), where the detailed attributes for Purchasing, Sales, Production or Inventory are maintained. For those acquainted with Dynamics AX 2009 it’s the same structure.
Once the structural differences identified, one can start looking at the values that define a product in both systems. The following queries are based on the source system.

-- reviewing the sizes 
SELECT Size 
, count(*) NoSizes
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
ORDER BY Size

-- reviewing the colors 
SELECT Color
, count(*) NoColors
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Color
ORDER BY Color

-- reviewing the styles  
SELECT Style
, count(*) NoStyles
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Style
ORDER BY Style

If the above queries show what values are used, the following shows the dependencies between them:

-- reviewing the sizes, colors, styles  
SELECT Size
, Color
, Style
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
, Color
, Style
ORDER BY 1,2,3

-- reviewing the dependencies between sizes, colors, styles  
SELECT CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END HasSize
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END HasColor
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END HasStyle
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END
ORDER BY 1,2,3

The last query is probably the most important, as it shows how the products need to be configured into the target system:

As can be seen a product can have only Color, Color and Style, Size and Color, respectively no dimensions or all dimensions. It will be needed to define a Dimension Group for each of these cases (e.g. Col, ColSty, SizCol, SizColSty, None). (More information on this in a future post.)

Unfortunately, unless the target system is already in use, there are no values usually, though one can attempt entering a few representative values manually over the user interface, at least to see what tables get populated.

Step 2: Data Mapping

Once the main attributes from source and target were identified, one can create the mapping at attribute level between them. Typically, one includes all the relevant information for a migration, from table, attribute, description to attributes’ definition (e.g. type, length, precision, mandatory) in all the systems:

The mapping was kept to a minimum to display only the most relevant information. Except a warning concerning the length of an attribute, respectively a new attribute (the old item number), the mapping doesn’t involve any challenges.

A data dictionary or even a metadata repository for the involved systems can help in the process, otherwise one needs to access the information from the available documentation or system’s metadata and prepare the data manually.

The relevant metadata for D365 can be obtained from the Microsoft documentation. The data can be loaded into system via the EcoResProductV2Entity (see also data entities or the AX 2012 documentation for tables and enumeration data types).

Step 3: Building the source entity

AdventureWorks already provides a view which models the Products entity, though because of its structure it needs to suffer some changes, or sometimes more advisable, do the changes in a separate view as follows:

-- Products source entity 
CREATE VIEW Map.vProductDetails 
AS 
SELECT CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ItemIdOld 
, CASE WHEN Size<>'' THEN dbo.CutLeft(Name, '-',1) ELSE Name End Name 
, row_number() OVER(PARTITION BY CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ORDER BY ProductNumber) Ranking
, ProductNumber 
, Description 
, Color
, Size
, Style
, CultureId 
, Subcategory 
, Category
, MakeFlag
, FinishedGoodsFlag
, SellStartDate 
, SellEndDate 
, StandardCost 
, ListPrice 
, SafetyStockLevel 
, ReorderPoint 
FROM [Production].[vProductDetails]

-- reviewing the data 
SELECT *
FROM Map.vProductDetails 
WHERE CultureId = 'en'
ORDER BY ProductNumber

To prepare the data for the migration the Product Number as well the Name were stripped from the Size, this being done with the help of dbo.CutLeft function. The row_number ranking window function was used to allow later selecting the first size for a given Product.

The discovery process continues, this time in respect to the target. Its useful to understand for example whether a Product has more than one Color or Style, whether the prices vary between Sizes, whether attributes like the Subcategory are consistent between Sizes, etc. It’s useful to prove anything that could have impact on the migration logic. The list of tests will be extended while building the logic, as new information are discovered.

-- checking dimensions' definition
SELECT ItemidOld 
, count(Size) NoSizes
, count(DISTINCT Color) NoColors
, count(DISTINCT Style) NoStyles
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  --AND Ranking = 1
GROUP BY ItemidOld
ORDER BY ItemidOld

-- checking the price variances between dimensions 
SELECT ItemidOld 
, Min(StandardCost) MinStandardCost
, Max(StandardCost) MaxStandardCost
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(StandardCost, 0)) != Max(IsNull(StandardCost, 0)) 
ORDER BY ItemidOld

-- checking attribute's consistency between dimensions 
SELECT ItemidOld 
, Min(Subcategory) MinSubcategory
, Max(Subcategory) MaxSubcategory
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(Subcategory, '')) != Max(IsNull(Subcategory, '')) 
ORDER BY ItemidOld

When the view starts performing poorly, for example because of the number of joins or data’s volume, it might me useful to dump the data in a table and perform the tests on it.
Even if it’s not maybe the case, it’s useful to apply defensive techniques in the logic by handing adequately the nulls.

Step 4: Implementing the Mapping

The attributes which need to be considered here are based on the target entities. It might be needed to include also attributes that are further needed to build the logic.

-- Product Mapping 
CREATE VIEW [Map].vEcoResProductV2Entity
AS
SELECT ProductId 
, ItemidOld ItemId 
, 'Item' ProductType 
, 'ProductMaster' ProductSubtype 
, Left(Replace(Name, ' ', ''), 20) ProductSearchName 
, ItemidOld ProductNumber 
, Name ProductName 
, Description ProductDescription 
, CASE 
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !='' AND IsNull(Style, '')!='' THEN 'SizColSty'
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !=''  THEN 'SizCol'

        WHEN IsNull(Style, '') !='' AND IsNull(Color, '') !=''  THEN 'ColSty'
 WHEN IsNull(Color, '') !='' THEN 'Col'
 WHEN IsNull(Style, '')!='' THEN 'Sty'

        WHEN IsNull(Size, '')!='' THEN 'Siz'
 ELSE 'None'
  END ProductDimensionGroupName 
, 'WHS' StorageDimensionGroupName 
, CASE WHEN MakeFlag = 1 THEN 'SN' ELSE '' END TrackingDimensionGroupName 
, 'PredefinedVariants' VariantConfigurationTechnology 
, Subcategory ProductCategory 
, ItemidOld ItemIdOld 
, 1 IsNewItem 
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1

The Dimenstion Group is based on the above observation. It was supposed that all Products have inventory (see Storage Dimension), while the manufactured products will get a Serial Number (see Tracking Dimension). IsNewItem will be used further to migrate deltas (and thus to partition migrations).

Step 5: Building the Target Entity

The target entity is in the end only a table in which usually are kept only the attributes in scope. In this case the definition is given by the following DDL:

CREATE TABLE Map.EcoResProductV2Entity(
 Id int IDENTITY(1,1) NOT NULL,
 ProductId int NULL,
 ProductType nvarchar(20) NULL,
 ProductSubtype nvarchar(20) NULL,
 ProductsearchName nvarchar(255) NULL,
 ProductNumber nvarchar(20) NOT NULL,
 ProductName nvarchar(60) NULL,
 ProductDescription nvarchar(1000) NULL,
 ProductDimensionGroupName nvarchar(50) NULL,
 StorageDimensionGroupName nvarchar(50) NULL,
 TrackingDimensionGroupName nvarchar(50) NULL,
 VariantConfigurationTechnology nvarchar(50) NULL,
 ProductCategory nvarchar(255) NULL,
 ItemIdOld nvarchar(20) NULL,
 IsNewItem bit,
 CONSTRAINT I_EcoResProductV2Entity PRIMARY KEY CLUSTERED 
(
 Id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The table was build to match the name and definition from the target systems. The definition is followed by a few inserts based on the logic defined in the previous step:

-- preparing the data for EcoResProductV2Entity 
INSERT INTO [Map].EcoResProductV2Entity 
SELECT ITM.ProductId
, ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory
, ITM.ItemIdOld 
, ITM.IsNewItem
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld 

-- reviewing the data 
SELECT *
FROM [Map].EcoResProductV2Entity 
ORDER BY ItemId

The business might decide to take over the Product Number into the target system as unique identifier, though it’s not always the case. It might opt to create a new sequence number, which could start e.g. with 10000000 (8 characters). In such a case is changed only the logic for the Product Number, the value being generated using a ranking window function:

-- preparing the data for EcoResProductV2Entity  
DECLARE @StartItemId as int = 10000000
--INSERT INTO [Map].EcoResProductV2Entity 
SELECT ...
, @StartItemId + Rank() OVER(ORDER BY ITM.ProductType, ITM.ItemIdOld) ProductNumber
, ...
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld

Step 5: Reviewing the Data

Before exporting the data it makes sense to review the data from various perspectives: how many Products of a certain type will be created, whether the current and old product numbers are unique, etc. The scripts make sure that the consistency of the data in respect to the future systems was achieved.

-- checking values' frequency (overview, no implications)
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, count(*) NoRecords
FROM [Map].EcoResProductV2Entity ITM
WHERE IsNewItem = 1
GROUP BY ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology

-- check ProductNumber's uniqueness (no duplicates allowed)
SELECT ProductNumber
, Min(ItemidOld) 
, max(ItemIdold)
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ProductNumber
HAVING count(*)>1

-- check old Product's uniqueness (no duplicates allowed)
SELECT ItemIdOld
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ItemIdOld
HAVING count(*)>1

This section will grow during the implementation, as further entities will be added.

Step 6: Exporting the data

The export query is usually reflecting the entity and can include further data’s formatting, when needed:

-- Export Products
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory RetailProductCategoryName 
, ITM.ItemIdOld
FROM [Map].EcoResProductV2Entity ITM
WHERE ITM.isNewItem = 1
ORDER BY ProductNumber

Depending on the import needs, the data can be exported to Excel or a delimited text file (e.g. “|” pipe is an ideal delimiter.

Step 7: Validating the Data before Import

Before importing the data into the target system, it makes sense to have the data checked by the business or consultants. A visual check at this stage can help save time later.

Step 8: Validating the Data after Import

Unfortunately Microsoft doesn’t allow direct access to the D365 Production database, however one can still access various tables and entities’ content via the table browser. Anyway, the validation usually takes place into the UAT (User Acceptance Testing) system. So, if everything went well into the UAT and all measures were taken to have the same parameters across all systems, there should be no surprises during Go-Live.

11 February 2017

⛏️Data Management: Data Mapping (Definitions)

"The process of identifying correspondence between source data elements and target data elements when migrating data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The process of noting the relationship of a data element to something or somebody." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"(1) The process of associating one data element, field, or idea with another data element, field, or idea. (2) In source-to-target mapping, the process of determining (and the resulting documentation of) where the data in a source data store will be moved to another (target) data store." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The assignment of source data entities and attributes to target data entities and attributes, and the resolution of disparate data." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data mapping is the process of creating data element mappings between two distinct data models. This activity is considered to be part of data integration." (Piethein Strengholt, "Data Management at Scale", 2020)

"The process defining a link between two disparate data models. It is often the first step towards data integration." (MuleSoft)

"The process of assigning a source data element to a target data element." (Information Management)

"Data mapping is the process of creating data element mappings between two different data models and is used as a first step for a wide array of data integration tasks, including data transformation between a data source and a destination." (Solutions Review)

"Data mapping is the process of defining a link between two disparate data models in the aim of future data integration." (kloudless)

"Data mapping is the process of mapping source data fields to destination related target fields." (Adobe)

01 July 2012

📦Data Migrations (DM): An Introduction

Data Migrations Series

Introduction

Basically, Data Migration is the movement of data from one IS (Information System), the legacy system, to a new IS, the target system, supposed to replace entirely or partially the legacy system. In the best scenario there are no differences between the two IS or the differences are minimal, negligible. In the worst scenario, there are multiple legacy systems used as source, and even multiple target systems, with important differences between them, differences that can even be translated in incompatibilities at multiple levels. Such architectures can span geographies, departments, organizations or industries; can involve a multitude of vendors, generations of systems, network types, different regulations, etc. In many Data Migrations the overall picture can be really complex, though for the sake of simplicity it’s enough to focus on the simplest scenario in which there is a single source and a single target system, with some differences between them. Abstraction can be made also of the fact that many migrations are parts of bigger projects, for example ERP implementations or any other type of applications migrations.

Data Migration is quite a complex topic, for many appearing like a black box in which data come in and data come out. That’s valid for the typical user as well for the IT professionals who haven’t been involved in Data Migration projects. There are many books on topics that are tangent to Data Migration – Data Management, Data Quality, Data Integration or Data Warehousing. Excepting some presentations available on the Web, a few methodologies exposed by important companies, one or two books, and a few blogs, there isn’t much material available on Data Migration. The “trend” is also a reflection of the low importance given to Data Migration as subject, even if many professionals working in the field warn about the considerable impact a Data Migration can have on a project in particular, and on business in general.

Approaching a topic like Data Migration can be, upon case, a complex task, however with a little intuition and some guidance its complexity falls apart. Often, when exploring such a topic, of help can be the 5W1H technique or its extended forms. The technique resumes to searching for answers to the “what”, “where”, “why”, “how”, “when”, “who” and “with what” questions. In case of Data Migration the questions are formulated as: what (data) to migrate, where to migrate, why to migrate, how to migrate, when to migrate, who migrates and with what to migrate?

Why to migrate?

A Data Migration occurs as follow up of a need – an old system exists in place and can’t cope anymore with business’ growth, a company made an acquisition and the systems need to be consolidated, or the organization decided to change its infrastructure, the processes, the business model in order address nowadays business requirements like flexibility, availability, manageability, automation, cost cuts, etc. In other words a Data Migration occurs as a need for change, and it can be itself a change in what concerns technical infrastructure, process, procedures, data flow, ways of doing business. A migration has quite an impact on the business, so here is an entitled question: does it really makes sense to migrate? Why not start from 0 with the new system?!

The migration can be a 0 point for an organization, though unless a company is starting anew, there are some data laying there in the old system(s) that need to be further available - for example open Purchase Orders that need to be fulfilled, Invoices that need to be paid, a catalog with all the Products and the available stock, information about Customers, what they bought, what they browsed or what they want to buy for Christmas, etc. At least some of the data need to be made available in one form and another also within the new architecture, if not the new system.

The availability of old data can be solved by keeping the old system(s) in place, functional, even if the system won’t be fed with new data, or maybe it will. Keeping a system alive involves additional costs for maintaining the infrastructure – software and hardware licenses, consultants, administrators and other people responsible for the optimal work of such a system. This can become with time quite an unnecessary burden. It can be an acceptable choice for some organizations, but unlikely as best/good practice. And even if the system is kept, more likely there will be data that need to be available also in the new system. Can be discussed also about integration of the two systems, but again, does it make sense? The bottom line is that in multiple scenarios a Data Migration can prove to be the optimal solution for an organization.

What data to migrate?

Even if it looks like a silly question, it can be one of most complex questions to answer. In theory is needed to migrate all the data, but are really needed all the data? Typically in a database can be found historical data not used anymore by the business, obsolete data marked or not for deletion, garbage data entered by mistake or remained after incomplete deletions, all these having low or no value for the business. Hopefully there are also “good data”, quintessential for the business. Somebody would say “what a hack, why do we need to philosophize so much, let’s migrate all the data!”. The decision can be understandable, though what if the percentage of “good data” is quite small in comparison with the total volume of data which can measure a few terabytes?! Sure, nowadays data centers can handle without problems terabytes of data, though there are some factors to be considered – it can be quite a challenge to migrate so many data, the volume of data affects also the performance of databases in particular, and IS in general, and a more natural reason – why store something that has minimal value for you?!

It makes sense to migrate only the data that have value for an organization, but what data are needed then? Normally this starts by understanding what entities the business deals with and which are the attributes that characterizes them. Many of the entities can be met in organization’s daily activity, and maybe are already defined in organization’s glossary or Data Dictionary, so a review of the available inventory might do. If not, more effort needs to be spent for this purpose; activities specific to Data Discovery, Data Categorization, Data Definition or Data Profiling tasks can help after case to fill the understanding gaps. Except categorization the others are not all necessary, same as the analysis can be deep enough to serve the purpose.

A first categorization was made above when data were considered as valuable, not valuable or in between. A second categorization can be made based on data’s usage: obsolete (not used anymore or marked for deletion), new (not used and recently entered), historical (data used in the past) and actual (data in use). A third categorization can be made on the status of the entities they represent, status that can be associated to the phase of the process the entity represent (e.g. active, inactive, open, invoices, closed, blocked, etc.). There can be considered other meaningful categorizations as long they prove to be important in identifying the useful data.

An important categorization in migrations, in particular, and Data Management, in general, is to split data in master data, transaction data and setup data. Master data are data are data that change only seldom and have a long life (until become obsolete), are referenced through all the system, and are vital to an organization through their meaning (e.g. Customers, Suppliers, Products, Assets, Employees, Accounts, etc.). Transaction data in exchange are data that change often and have a relatively short life, typically are referenced by other transactions and can be associated with documents or movements through the system (e.g. Purchase Orders, Sales Orders, Invoices, Receipts, Assets Movements, etc.). Setup data are data used to configure a system (e.g. Transaction Types, Document Types, Roles, Permissions, etc.). This categorization deserves the full attention, because each of the three elements needs a different handling approach in migration or Data Management.

Based on the identified categories can be considered some rough migration rules in deciding what data (actually records) to migrate, for example: - master data, unless they become obsolete, and open transactions are often considered to be migrated entirely; - historical transaction data spanning a few years back can be migrated in case they are needed in the process; - master data referenced by transaction data migrated need to be migrated too - setup data are entered manually - historical data are archived. There can be also exceptions from the rules, so such possible scenarios need to be considered too.

Each entity is defined by multiple attributes (also called properties, dimensions). They need to go through a similar “categorization” process. In deciding what attributes to migrate is important to consider especially their role in defining the entity. Some of them define uniquely an entity (e.g. Customer Number, Product Number, Serial Number), physical characteristics of the entity (e.g. color, weight, height), categorize the entity (e.g. Category, Type) or its status (e.g. Active, Blocked, Invoiced), imply various events (e.g. Creation Date, Delivery Date, Invoice Date), and so on. It looks like another type of categorization, and it is, though it’s more difficult to create some rough rules based on it, because in the end the business dictates which Attributes are needed. In fact, most of the Attributes used (with distinct not null values) in the legacy system are more likely needed also in the new system, unless the process changed considerably, or the business is supposed to change also its model.

Where to migrate the data?

When the Data Migration subject is brought on the table, a decision was already made about the target system. So the “where” question is partially answered, however it addresses only the peak of the iceberg. It shows that an iceberg lies there, in front of us, though under the deep of the waters there is something more, lot of questions and issues that need to be addressed. Like the source, the target needs to be further detailed in entities and their attributes; the targeted processes and procedures need to be considered together with the constraints imposed by the new system. It’s actually needed to identify the data requirements for the new systems and corroborate them with the requirements of the old system. Mapping the entities and attributes available in the two systems, process known as Data Mapping, can offer a good overview of what lays ahead, what similarities and gaps exist. There will be attributes that are available in the legacy but not in the target system, and therefore the target system needs to be extended or the data associated with the respective attributes can be left out. From the opposed perspective, there can be mandatory attributes in the target system which are not available in the organization, and therefore the associated data must be collected and/or made available for the migration. There can be cases when the data are not available in the legacy system but distributed in various other external or internal sources, so there can be an option to migrate or integrate the respective data, extend the processes to accommodate such scenarios, etc.

Only when the mapping of data is ready and the various related questions addressed, the “where” question is fully answered. Given the continuous changes done to the target system that may still happen a few days before Go Live, Data Mapping can remain a hot topic until then.

With what to migrate?

This question addresses the mix of tools used to migrate the data, and by extension the whole architecture developed for this purpose. As many experts point out, there is no general solution for such an approach because each migration is challenged by different requirements and architectures. ETL (Extract, Transform, Load) and Data Integration tools were mainly designed for this kind of purposes – moving data between data sources – therefore more likely the whole Data Migration architecture will be built around such a tool. In addition is needed to be addressed topics like assessment and reporting of Data Quality, Data Cleaning, Data Enrichment, Data Backup or Data Security. They will technically ensure that the data are migrated within intended level of quality and security.

For each of these topics are available one or more tools on the market. The challenge is to find the right mixture for the overall architecture, to make them work together in an efficient and effective manner. One of the problems such tools have is that they look to the Data Migration or similar problems from their own perspective, making them hard to integrate with other tools. Given the increasing need for Data Migration, more likely exist there tools that cover most of its requirements, each with its own advantages and disadvantages. Starting with a new tool can prove to be quite challenge in itself. Many recommend following a methodology and using tools that already proved their capabilities in other projects. That’s a good approach, though need to be considered also costs, available resources, effort to build the infrastructure, the learning curve, etc. For some migrations MS Excel or Access will do, for others a more complex framework is needed. Keep in mind that there is no perfect architecture, just the architecture that will drive you to achieve your targets.

How to migrate the data?

“How” refers mainly to the migration approach, steps, methodologies, processes and procedures used to migrate the data. Secondly, and not less important, it refers to how the mix of tools is used for migration – in other words the implementation. Despite the huge variety of tools and means of achieving the target, there can be depicted some generalities for each of these topics.

Migration approach refers to the overall strategy considered for a migration – typically on whether the data are migrated all together, the new system becoming functional and replacing the legacy system (the big-bang migration), or the data are migrated in phases, the legacy and target systems functioning in parallel for a certain amount of time (the phased-out migration). Can be met other variations of migration approaches, under various denominations. It’s important to know the advantages and disadvantages of both or all approaches, especially in what concerns their application in your organization.

“Steps” is just a misnomer for the actual Project Plan in which are considered the different phases and activities of such a project. In a general Data Migration project, can be discussed about Data Discovery, Data Definition, Data Collection, Data Consolidation, Data Mapping, Data Conversion, Data Transformation, Data Quality Assessment, Data Cleaning, Data Storage, etc. Some of these steps can be considered as standalone processes, sometimes being already part of the processes’ landscape existing in an organization. Other steps are just simple activities. Both types of steps share some important characteristics – they can be highly iterative and complex, are owned by the business, the IT functioning as facilitator, each of them depends on the input from other steps, and require continuous feedback, etc.

A Data Migration is (should be) managed as any other IT project, and therefore can be discussed about project-specific methodologies like PMBOK, Prince2 or PRISM. Many of the before mentioned steps come with their luggage of methodologies too. In addition, considering that IT functions as a service, could be considered service-specific methodologies like ITIL, ISO/IEC, Six Sigma, etc.

The actual implementation of all these depends entirely on the project’s scope, the knowledge of all those involved, the constraints met and the resources available for such a project. Many of the IT-specific problems and situations are specific across all IT projects.

Who will migrate the data?

There is no Data Migration project that can be done without the business, the de facto owner of such a project and its output. There is lot of input needed from the business, its continuous involvement through the various stages is necessary for the whole duration. Unless the Data Migration resumes to a rudimentary tool like Excel and can be handled without too much expertise, a Data Migration needs technical resources that can elicit the requirements, translate them in technical requirements, built the infrastructure and maybe migrate the data. It entirely depends on the overall architecture and methodology what people are involved. In the best case scenario the migration will resume to one person pushing a button and the data flow as magic from source to the target system. In reality, multiple people will have to take care of migration, pushing some magic buttons in a chain of parallel and even redundant steps, monitoring and validating the process. Data owners, data stewards, data custodians, data architects, database administrators, migration and quality assurance specialists, developers, consultants and many other people can be involved, each of them playing their role.

When to migrate the data?

Intuitively, data are or should be migrated when the target system is ready to receive the new data, thus when the development was finished, the system tested, and all the preparation for Data Migration were made. The statement is valid for any type of migration. How such a date or dates are calculated when a project starts is in itself kind of science or just a matter of needs. There are projects in which the dates for each milestone or phase are calculated back from a desired Go Live date, or projects in which the Go Live is calculated incrementally based on the steps to be performed. For dates’ calculation can be used also benchmarking from the field. The bottom line is that the data must be migrated on time for the Go Live and with a minimum disruption for the business.

Conclusion

Whether standalone or as subproject of another project, a Data Migration can be or become quite a complex thematic that, through its outcomes, affects the business considerably. In the above paragraphs were considered some of the important aspects of such a project, the focus being more on figuring out what a migration implies rather than a detailed exploration. It’s also a mental exercise and an invitation into the thematic.