Data Migrations Series |
Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.
To address these
requirements the DM logic must reach a certain level of performance and quality
that allows importing the data as expected. From project’s beginning until UAT the
DM team will integrate the various information iteratively, will need to test the
changes several times, troubleshoot the deviations from expectations. The volume
of effort required for these activities can be overwhelming. It’s not only important
for the whole solution to be performant but each step must be designed so that besides
fast execution, the changes and troubleshooting must involve a minimum of overhead.
For better understanding
the importance, imagine a quest game in which the character has to go through a
labyrinth with traps. If the player made a mistake he’ll need to restart from a
certain distant point in time or even from the beginning. Now imagine that for each
mistake he has the possibility of going one step back try a new option and move
forward. For some it may look like cheating though in this way one can finish the
game relatively quickly. It would be great if executing a DM could allow the same
flexibility.
Unfortunately, unless the
data are stored between steps or each step is a different package, an ETL solution
doesn’t provide the flexibility of changing the code, moving one step behind, rerunning
the step and performing troubleshooting, and this over and over again like in the
quest game. To better illustrate the impact of such approach let’s consider that
the DM has about 40 entities and one needs to perform on average 20 changes per
entity. If one is able to move forwards and backwards probably each change will
take about a few minutes to execute the code. Otherwise rerunning a whole package
can take 5-10 times or even more as this can depend on packages’ size and data volume.
For 800 changes only an additional minute per change equates with 800 minutes (about
13 hours).
In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.
To assure that the consistency
of the data is kept one needs to build for each entity a set of validation queries
that check for duplicates, for special cases, for data integrity, incorrect format,
etc. The queries can be included in the sequence of logic used for the DM. Thus,
one can react promptly to each unexpected value. When required, the validation rules
can be built within reports and used in the data cleaning process by users, or even
logged periodically per entity for tracking the progress.
Previous Post <<||>> Next Post