ETL tools are ideal for extracting the needed data from the legacy system(s). They offer a considerable number of connectors to standard databases that leverage legacy systems’ data access layers or own frameworks, both categories providing acceptable performance for a wide range of solutions. Otherwise, third-party connectors can be considered as well, though their advantage might reside in the extra features they bring out-of-the-box in the detriment of performance loss, and thus should be used with caution.
Besides that, ETL tools
provide also rich visual functionality that allow users building complex pipelines
with transformations that process the data as data go through the pipeline. Further
features like data profiling or cleansing bring additional benefits.
As usually only a subset
of the legacy data is needed for the migration, an ETL solution allows extracting
only the data in scope as filtering and other logic can be used in the extraction
mechanism. Whether one loads the tables or entities 1:1 or aggregates the data from
multiple tables is a matter of choice, even if the former two approaches are usually
recommended.
As alternative to an ETL
tool is building own extraction layer based for example on a powerful data access
layer like ADO.Net. This might prove to be a cheaper alternative especially when
ETL capabilities aren’t needed. This depends also on the overall architectural approach.
Attempting to build a desktop-based application for a DM can prove to be a foolhardy
approach especially when dealing with a considerable volume of data. Moreover, it
would be needed to build features that are already available in ETL tools (transformations, workflows)
or databases (indexes for performance optimization, join-based logic).
When the volume of data
exceeds the capabilities of ETL tools one can consider ELT tools which load first
the data before applying any transformations on them. Such tools are designed for
the processing of what is known as big data (data having high volume, high velocity,
high variety and different veracity).
When considering the best
data extraction approach, it’s important to know where the data will be stored for
processing. Given that DMs are data processing intensive the best data storage solution
for processing would be a modern relational database. Besides performance, scalability,
security, concurrency, failover mechanism some databases offer the possibility to
connect directly to other servers via server links functionality. Despite this latter
feature an ETL tool can still have considerable advantages for data extraction.
On the other side the DM
logic can be in theory built entirely in the ETL tool without storing the data within
a database, though this adds a high overhead on the server resources on which the
solution runs as all the data needed for processing need to be loaded in memory.
Even if the data are loaded in batches and processed as the batches go through the
pipeline, the complexity of the processing can make challenging implementing any
optimization techniques directly into the ETL tool. Moreover, fully ETL-based solutions
are difficult to troubleshoot and change as the requirements change.
To address the high resources’
consumption of the ETL tools one can store the intermediary results into database
tables on which indexes can be created for performance optimization. Moreover, the
logic can be encapsulated in database objects and used in the processing. This approach
enables troubleshooting, performing validations and restarting the processing from
a given step in the detriment of splitting the logic between multiple ETL packages.
This can be an acceptable price to pay for more flexibility. Given that most ETL
transformations can be replaced with SQL-based logic the ETL tool can be used only
for data extraction.
Previous Post <<||>> Next Post