27 December 2020

Data Warehousing: ETL (The Extract Subprocess)

 

Data Warehousing
Data Warehousing Series

As part of the ETL process with applicability to Data Warehousing, Data Migrations, Data Integrations or similar scenarios the extraction subprocess is responsible for preparing and implementing the logic required to extract the data from the various source systems at the required level of detail. The extraction is done typically based on SQL queries as long one deals with relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files.

One can consider the preparation of the extraction logic as separate design subprocess of the targeted solution. Even if high-level design decisions are considered at the respective level, the low-level design needs to be considered at ETL package level. As part of the process are identified the source of the data in terms of system, tables and attributes to be imported, as well the joins, business and transformation rules that need to be applied on the data. This can involve reengineering the logic from the source system(s) as well data profiling, discovery or exploration activities.

A common practice is to copy the source tables 1:1 into the solution, eventually by considering only the needed attributes to minimize the necessary space, loading time and content’s complexity, even if this would add more effort into the design phase to identify only the needed attributes. If further attributes are identified at a later stage, the packages need to be modified accordingly. If the data volume or the number of unnecessary attributes is neglectable, copying the table 1:1 could prove to be the best strategy.

A second approach is to model within the extraction the (business) entity as designed within the source system. For example, the entity could be split over multiple tables from design or other purposes. Thus, the extraction query will attempt modeling the entity. This approach reduces to some degree the number of tables from the targeted solution, as well the number of ETL packages involved, while providing a clear depiction of the entities involved.

A third approach is to extract the data as needed by the target system, eventually as a mix between master and transaction data, fact which could easily lead to data redundancy with different timeliness and all the consequences resulting from this. This approach is usually met in solutions which require fast data availability in the detriment of design.

Unfortunately, there can be design constraints or choice considerations that could lead to a mix between these approaches. If the impact caused by the mix between the first two approaches is minimal, the third approach can cause more challenges, though it might be a small price to pay as long the considered data are disconnected from other data.

To reduce the redundancy of data, it’s recommended to consider as goal creating a unique source of facts, which can be obtained by minimizing as much as possible the overlaps between tables, respectively entities. Ideally there should be no overlaps. On the other sides the overlaps can be acceptable when the same data are available in more systems and the solution requires all the data to be available.

If the above approaches consider the vertical partitioning of the data, there can be also horizontal partitioning needs especially when a subset of the data is needed or when is needed to partition the data based on a set of values. In addition, one might be forced to include also transformation rules directly into the extraction logic, for example to handle conversion issues or minimize certain design overhead early in the process. In practice it makes sense to link such choices to business rules and document them accordingly.

Previous Post <<||>> Next Post

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.