SQL Troubles: 🗄️Data Management: Data Cleansing (Part I: An Introduction)

14 January 2010

🗄️Data Management: Data Cleansing (Part I: An Introduction)

Data Management Series

Even if often Data Cleansing or Data Scrubbing is considered synonymous to Data Cleaning, it actually includes the Data Validation and Data Cleaning process, where Data Validation is the process of checking data quality against a set of rules (referred as data quality rules) in a given set of data, and Data Cleaning or Data Correction is the process of correcting the data quality issues identified in Data Validation. Data Cleansing consists in general of the following steps:

Step 1: Defining the scope

The scope definition resumes at the identification of elements, attributes and the records that need to be cleaned. If is intended only to improve overall data quality and nothing else, the focus will be mainly on the data elements and attributes perceived as having low data quality. For data conversion the elements in scope are defined based on the requirements of the destination system and overall business requirements are defined the attributes in scope, an exercise that needs to be performed is the data mapping between the destination on one side and the legacy system(s) and the other data sources (e.g. Excel, MS Access files) on the other side.

From case to case in scope for cleansing might be all the records available in the legacy system(s) for a given set of data elements and attributes, or only a fraction of them, usually the focus is mainly the active and open records, and eventually some historical records. By active records are referred those records not cancelled or disabled, either master and transactional data , by open records the (transactional) records not marked as closed, records on which an action is still pending (e.g. pending Invoice Receipt, Payment, etc.), while by historical records are referred the already closed transactional data (e.g. Purchase Orders, Sales Orders, etc.).

Given the importance of master data for the business they are often the starting point for cleaning, they being also the data elements with the longest issue resolution, the transactional data being in theory easier to clean. The two types of data need to be synchronized the referent data related to the transactional data in scope needs to be loaded too.

Step 2: Defining the data quality rules and error types.

The data quality rules are defined for the attributes in scope and the data quality dimensions considered that apply, typically data accuracy, completeness, conformity, consistency, duplication and referential integrity. The issues are eventually further categorized and prioritized.

Step 3: Validating the data quality rules

The data quality rules are validated against the actual data sets, the output of the rules are consolidated in reports that follow to be interpreted. The number of errors are reported and stored for historical trending in case several iterations are needed for data cleaning.

Step 4: Identifying the data quality issues.

The reports are reviewed by the data owners, the issues being identified and the rules validated, further steps being performed in order to identify the root causes for the discovered issues. This usually leads to the calibration of data quality rules as new scenarios (e.g. exceptions, new error types) can be discovered.

Step 5: Correcting the data quality issues.

The issues are mitigated and corrected, issues’ resolution time, the time needed to correct the issues, is directly dependent on the volume of issues, their complexity, the number of resources allocated to the task, on whether the issues are corrected in the source or on a copy of the data, on whether the cleansing is done manually or specific software tools are used. Specific Data Cleaning software tools can help decreasing considerably the volume of manual work involved during the process, being preferred to use automatic and semiautomatic data cleansing methods whenever is possible.

Step 6: Validating the changes against the same data quality rules.

Once the issues are corrected the data sets are validated against the same rules used at Step 3, the Steps 3 to 5 being repeated until the expected data quality level is achieved. As already highlighted in a previous post, Data Cleansing is not a one-time endeavor, but an iterative process, new iterations being requested by changes occurred in rules or in scope, by the sequencing/scheduling of cleansing work or by reporting frequency (centered or not along the important milestones).

Step 7: Documenting/Reporting error results and validation outcomes, if needed.

Starting with data elements, attributes, and the logic used to select the records in scope for cleaning, and ending with the data quality reports with the issues and the improvement trends it makes sense to document everything and summarize it in a set of best practices and lesson learned sessions.

Step 8: Modify processes/procedures to reduce issues occurrence.

There are chances to discover that the errors found during validation could be avoided by modifying the existing processes, procedures or standards, or enforcing new validation rules in the legacy systems. When the changes are complex might be started even individual projects to address them.

SQL Troubles

Pages

14 January 2010

🗄️Data Management: Data Cleansing (Part I: An Introduction)

No comments:

About Me