Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.
This is probably the most important aspect, even if there can
be further advantages, like using built-in connectors to a wide range of sources
or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools
have the same capabilities, maybe reduced to certain data sources, though their
newer versions seem to bridge the gap.
One of the most stressed advantages of ELT is the possibility
of having all the (business) data in the repository, though these are not
technological advantages. The same can be obtained via ETL tools, even if this might
involve upon case a bigger effort, effort depending on the functionality existing
in each tool. It’s true that ETL solutions have a narrower scope by loading a subset
of the available data, or that transformations are made before loading the data,
though this depends on the scope considered while building the data warehouse or
data mart, respectively the design of ETL packages, and both are a matter of choice,
choices that can be traced back to business requirements or technical best practices.
Some of the advantages seen are context-dependent – the context
in which the technologies are put, respectively the problems are solved. It is often
imputed to ETL solutions that the available data are already prepared (aggregated,
converted) and new requirements will drive additional effort. On the other side,
in ELT-based solutions all the data are made available and eventually further transformed,
but also here the level of transformations made depends on specific requirements.
Independently of the approach used, the data are still available if needed, respectively
involve certain effort for further processing.
Building usable and reliable data models is dependent on good
design, and in the design process reside the most important challenges. In theory,
some think that in ETL scenarios the design is done beforehand though that’s not
necessarily true. One can pull the raw data from the source and build the data models
in the target repositories.
Data conversion and cleaning is needed under both approaches.
In some scenarios is ideal to do this upfront, minimizing the effect these processes
have on data’s usage, while in other scenarios it’s helpful to address them later
in the process, with the risk that each project will address them differently. This
can become an issue and should be ideally addressed by design (e.g. by building
an intermediate layer) or at least organizationally (e.g. enforcing best practices).
Advancing that ELT is better just because the data are true
(being in raw form) can be taken only as a marketing slogan. The degree of truth
data has depends on the way data reflects business’ processes and the way data are
maintained, while their quality is judged entirely on their intended use. Even if
raw data allow more flexibility in handling the various requests, the challenges
involved in processing can be neglected only under the consequences that follow
from this.
Looking at the analytics and data integration cloud-based technologies,
they seem to allow both approaches, thus building optimal solutions relying on professionals’
wisdom of making appropriate choices.
Previous Post <<||>>Next Post
No comments:
Post a Comment