As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.
A commonly used
approach is dumping the data into an intermediary table from the staging area, table
with no constraints that matches only the data types from the source. Once the data
loaded, they are further copied into the production table. This approach allows
minimizing the unavailability of the production table as the load from an
external data source normally takes longer than copying the data within the
same database or instance. That might not be the case when the data are available
in the same data center, however loading the data first in a staging table
facilitates troubleshooting and testing. This approach allows also dropping the
indexes on the production table before loading the data and recreating them
afterwards. In practice, this proves to be an efficient method for improving
data loads’ efficiency.
In general, it’s
recommended to import the data 1:1 compared with the source query, though the transformations
used can increase or decrease the number of attributes considered. The
recommendation applies as well to the cases in which data come from different
sources, primarily to separate the pipelines, as systems can have different
refreshing requirements and other constraints.
One can consider
adding a timestamp reflecting the refresh date and upon case also additional
metadata (e.g. identifier for source system, unique identifier for the record).
The timestamp is especially important when the data are imported incrementally -
only the data created since the last load are loaded. Except the unique
identifier, these metadata can however be saved also in a separate table, with
the same granularity as the table (1:1) or one record for each load per table
and system, storing a reference to the respective record into the load table. There
are seldom logical argumentations for using the former approach, while the
latter works well when the metadata are used only for auditing purposes. If the
metadata are needed in further data processing and performance is important,
then the metadata can be considered directly in the load table(s).
A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.
There are scenarios in
which the requirements for the ETL packages are driven by the target (load)
tables – the format is already given - one needing thus to accommodate the data
into the existing tables or extended the respective tables to accommodate more
attributes. It’s the case for load tables storing data from multiple systems
with similar purpose (e.g. financial data from different ERP systems needed for
consolidations).
No comments:
Post a Comment