Data Warehousing Series |
As part of the ETL process
with applicability to Data Warehousing, Data Migrations, Data Integrations or similar
scenarios the extraction subprocess is responsible for preparing and implementing
the logic required to extract the data from the various source systems at the required
level of detail. The extraction is done typically based on SQL queries as long one
deals with relational databases or any OLEDB or ODBC-based data repositories
including flat or MS Office files.
One can consider the preparation
of the extraction logic as separate design subprocess of the targeted solution.
Even if high-level design decisions are considered at the respective level, the
low-level design needs to be considered at ETL package level. As part of the process
are identified the source of the data in terms of system, tables and attributes
to be imported, as well the joins, business and transformation rules that need to
be applied on the data. This can involve reengineering the logic from the source
system(s) as well data profiling, discovery or exploration activities.
A common practice is to
copy the source tables 1:1 into the solution, eventually by considering only the
needed attributes to minimize the necessary space, loading time and content’s complexity,
even if this would add more effort into the design phase to identify only the needed
attributes. If further attributes are identified at a later stage, the packages
need to be modified accordingly. If the data volume or the number of unnecessary
attributes is neglectable, copying the table 1:1 could prove to be the best strategy.
A second approach is to
model within the extraction the (business) entity as designed within the source
system. For example, the entity could be split over multiple tables from design
or other purposes. Thus, the extraction query will attempt modeling the entity.
This approach reduces to some degree the number of tables from the targeted solution,
as well the number of ETL packages involved, while providing a clear depiction of
the entities involved.
A third approach is to
extract the data as needed by the target system, eventually as a mix between master
and transaction data, fact which could easily lead to data redundancy with different
timeliness and all the consequences resulting from this. This approach is usually
met in solutions which require fast data availability in the detriment of design.
Unfortunately, there can
be design constraints or choice considerations that could lead to a mix between
these approaches. If the impact caused by the mix between the first two approaches
is minimal, the third approach can cause more challenges, though it might be a small
price to pay as long the considered data are disconnected from other data.
To reduce the redundancy
of data, it’s recommended to consider as goal creating a unique source of facts,
which can be obtained by minimizing as much as possible the overlaps between tables,
respectively entities. Ideally there should be no overlaps. On the other sides the
overlaps can be acceptable when the same data are available in more systems and
the solution requires all the data to be available.
If the above approaches
consider the vertical partitioning of the data, there can be also horizontal partitioning
needs especially when a subset of the data is needed or when is needed to partition
the data based on a set of values. In addition, one might be forced to include also
transformation rules directly into the extraction logic, for example to handle conversion
issues or minimize certain design overhead early in the process. In practice it
makes sense to link such choices to business rules and document them accordingly.
Previous Post <<||>> Next Post
No comments:
Post a Comment