SQL Troubles: 🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

SQL Troubles

Pages

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

No comments:

About Me