Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.
Last updated: 21-Jan-2025
[Data Warehousing] Extract, Transform, Load (ETL)
- {def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)
- includes
- the transportation of data
- overlaps between stages
- changes in flow
- due to
- new technologies
- changing requirements
- changes in scope
- troubleshooting
- due to data mismatches
- {step} extraction
- data is extracted directly from the source systems or intermediate repositories
- data may be made available in intermediate repositories, when the direct access to the source system is not possible
- ⇐ this approach can add a complexity layer
- {substep} data validation
- an automated process that validates whether data pulled from sources has the expected values
- relies on a validation engine
- rejects data if it falls outside the validation rules
- analyzes rejected records on an ongoing basis to
- identifies what went wrong
- corrects the source data
- modifies extraction to resolve the problem in the next batches
- {step} transform
- transforms the data, removing extraneous or erroneous data
- applies business rules
- checks data integrity
- ensures that the data is not corrupted in the source or corrupted by ETL
- may ensure no data was dropped in previous stages
- aggregates the data if necessary
- {step} load
- {substep} store the data into a staging layer
- transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
- {advantage} makes it easier to roll back, if something went wrong
- {advantage} allows to develop the logic iteratively and publish the data only when needed
- {advantage} can be used to generate audit reports for
- regulatory compliance
- diagnose and repair of data problems
- modern ETL process perform transformations in place, instead of in staging areas
- {substep} publish the data to the target
- loads the data into the target table(s)
- {scenario} the existing data are overridden every time the ETL pipeline loads a new batch
- this might happen daily, weekly, or monthly
- {scenario} add new data without overriding
- the timestamp can indicate the data is new
- {recommendation} prevent the loading process to error out due to disk space and performance limitations
- {approach} building an ETL infrastructure
- involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly
- recurring process
- e.g. data used for reporting
- one-time process
- e.g. data migration
- may involve
- multiple source or destination systems
- different types of data
- e.g. reference, master and transactional data
- ⇐ may have complex dependencies
- different level of structuredness
- e.g. structured, semistructured, nonstructured
- different data formats
- data of different quality
- different ownership
- {recommendation} consider ETL best practices
- {best practice} define requirements upfront in a consolidated and consistent manner
- allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
- the requirements may involve all the aspects of the process
- e.g. data extraction, data transformation, standard formatting, etc.
- {best practice} define a high level strategy
- allows to define the road ahead, risks and other aspects
- allows to provide transparency
- this may be part of a broader strategy that can be referenced
- {best practice} align the requirements and various aspects to the existing strategies existing in the organization
- allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
- e.g. IT, business, Information Security, Data Management strategies
- {best practice} define the scope upfront
- allows to better estimate the effort and validate the outcomes
- even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations
- {best practice} manage the effort as a project and use a suitable Project Management methodology
- allows to apply structured well-established PM practices
- it might be suited to adapt the methodology to project's particularities
- {best practice} convert data to standard formats to standardize data processing
- allows to reduce the volume of issues resulted from data type mismatches
- applies mainly to dates, numeric or other values for which can be defined standard formats
- {best practice} clean the data in the source systems, when cost-effective
- allows to reduces the overall effort, especially when this is done in advance
- this should be based ideally on the scope
- {best practice} define and enforce data ownership
- allows to enforce clear responsibilities across the various processes
- allows to reduce the overall effort
- {best practice} document data dependencies
- document the dependencies existing in the data at the various levels
- {best practice} protocol data movement from source(s) to destination(s) in term of data volume
- allows to provide transparence into the data movement process
- allows to identify gaps in the data or broader issues
- can be used for troubleshooting and understanding the overall data growth
- {recommendation} consider proven systems, architectures and methodologies
- allows to minimize the overall effort and costs associated with the process