SQL Troubles: Data Cleaning

Showing posts with label Data Cleaning. Show all posts

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)

{def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)

includes

the transportation of data
overlaps between stages
changes in flow

due to

new technologies
changing requirements

changes in scope
troubleshooting

due to data mismatches

{step} extraction

data is extracted directly from the source systems or intermediate repositories

data may be made available in intermediate repositories, when the direct access to the source system is not possible

⇐ this approach can add a complexity layer

{substep} data validation

an automated process that validates whether data pulled from sources has the expected values
relies on a validation engine

rejects data if it falls outside the validation rules
analyzes rejected records on an ongoing basis to

identifies what went wrong
corrects the source data
modifies extraction to resolve the problem in the next batches

{step} transform

transforms the data, removing extraneous or erroneous data
applies business rules
checks data integrity

ensures that the data is not corrupted in the source or corrupted by ETL
may ensure no data was dropped in previous stages

aggregates the data if necessary

{step} load

{substep} store the data into a staging layer

transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
{advantage} makes it easier to roll back, if something went wrong
{advantage} allows to develop the logic iteratively and publish the data only when needed
{advantage} can be used to generate audit reports for

regulatory compliance
diagnose and repair of data problems

modern ETL process perform transformations in place, instead of in staging areas

{substep} publish the data to the target

loads the data into the target table(s)
{scenario} the existing data are overridden every time the ETL pipeline loads a new batch

this might happen daily, weekly, or monthly

{scenario} add new data without overriding

the timestamp can indicate the data is new

{recommendation} prevent the loading process to error out due to disk space and performance limitations

{approach} building an ETL infrastructure

involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly

recurring process

e.g. data used for reporting

one-time process

e.g. data migration

may involve

multiple source or destination systems
different types of data

e.g. reference, master and transactional data
⇐ may have complex dependencies

different level of structuredness

e.g. structured, semistructured, nonstructured

different data formats
data of different quality
different ownership

{recommendation} consider ETL best practices

{best practice} define requirements upfront in a consolidated and consistent manner

allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
the requirements may involve all the aspects of the process

e.g. data extraction, data transformation, standard formatting, etc.

{best practice} define a high level strategy

allows to define the road ahead, risks and other aspects
allows to provide transparency
this may be part of a broader strategy that can be referenced

{best practice} align the requirements and various aspects to the existing strategies existing in the organization

allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
e.g. IT, business, Information Security, Data Management strategies

{best practice} define the scope upfront

allows to better estimate the effort and validate the outcomes
even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations

{best practice} manage the effort as a project and use a suitable Project Management methodology

allows to apply structured well-established PM practices
it might be suited to adapt the methodology to project's particularities

{best practice} convert data to standard formats to standardize data processing

allows to reduce the volume of issues resulted from data type mismatches
applies mainly to dates, numeric or other values for which can be defined standard formats

{best practice} clean the data in the source systems, when cost-effective

allows to reduces the overall effort, especially when this is done in advance
this should be based ideally on the scope

{best practice} define and enforce data ownership

allows to enforce clear responsibilities across the various processes
allows to reduce the overall effort

{best practice} document data dependencies

document the dependencies existing in the data at the various levels

{best practice} protocol data movement from source(s) to destination(s) in term of data volume

allows to provide transparence into the data movement process
allows to identify gaps in the data or broader issues
can be used for troubleshooting and understanding the overall data growth

{recommendation} consider proven systems, architectures and methodologies

allows to minimize the overall effort and costs associated with the process

Previous Post <<||>> Next Post

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.

This is probably the most important aspect, even if there can be further advantages, like using built-in connectors to a wide range of sources or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools have the same capabilities, maybe reduced to certain data sources, though their newer versions seem to bridge the gap.

One of the most stressed advantages of ELT is the possibility of having all the (business) data in the repository, though these are not technological advantages. The same can be obtained via ETL tools, even if this might involve upon case a bigger effort, effort depending on the functionality existing in each tool. It’s true that ETL solutions have a narrower scope by loading a subset of the available data, or that transformations are made before loading the data, though this depends on the scope considered while building the data warehouse or data mart, respectively the design of ETL packages, and both are a matter of choice, choices that can be traced back to business requirements or technical best practices.

Some of the advantages seen are context-dependent – the context in which the technologies are put, respectively the problems are solved. It is often imputed to ETL solutions that the available data are already prepared (aggregated, converted) and new requirements will drive additional effort. On the other side, in ELT-based solutions all the data are made available and eventually further transformed, but also here the level of transformations made depends on specific requirements. Independently of the approach used, the data are still available if needed, respectively involve certain effort for further processing.

Building usable and reliable data models is dependent on good design, and in the design process reside the most important challenges. In theory, some think that in ETL scenarios the design is done beforehand though that’s not necessarily true. One can pull the raw data from the source and build the data models in the target repositories.

Data conversion and cleaning is needed under both approaches. In some scenarios is ideal to do this upfront, minimizing the effect these processes have on data’s usage, while in other scenarios it’s helpful to address them later in the process, with the risk that each project will address them differently. This can become an issue and should be ideally addressed by design (e.g. by building an intermediate layer) or at least organizationally (e.g. enforcing best practices).

Advancing that ELT is better just because the data are true (being in raw form) can be taken only as a marketing slogan. The degree of truth data has depends on the way data reflects business’ processes and the way data are maintained, while their quality is judged entirely on their intended use. Even if raw data allow more flexibility in handling the various requests, the challenges involved in processing can be neglected only under the consequences that follow from this.

Looking at the analytics and data integration cloud-based technologies, they seem to allow both approaches, thus building optimal solutions relying on professionals’ wisdom of making appropriate choices.

Previous Post <<||>>Next Post

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

Data Migrations Series

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

29 September 2020

𖣯🧮Strategic Management: Simplicity (Part V: ERP Implementations' Story I)

Strategic Management Series

Probably ERP Implementations are one of the most complex type of projects one deals with in the IT world, however their complexity seldom resides in technologies themselves, but in the effort that needs to be made by organizations before, during and post-implementations. Through their transformative nature ERP implementations have the potential of changing the whole organization if their potential is exploited accordingly, which is unfortunately not always the case. Therefore, the challenges don’t resume only to managing a project or implementing a technology, but also in managing change, and that usually happens or needs to happen at several levels.

Typically, the change is considered mainly at IT infrastructure and processual level, because at these levels most of the visible changes happen – that’s what steals the show. For the whole project duration is about replacing one or more legacy systems, making sure that the new infrastructure works as expected. The more an organization deviates from the standard the more effort is needed, and this effort can exhaust an organization’s resources to the degree that will need some time to recover after that, financially, but maybe more important from a vital point of view.

Even if the technological and processual layers are important, as they form the foundation on which an organization builds upon, besides the financial and material flow there are also the data, informational and knowledge flows, which seems to be neglected. Quite often that’s where the transformational potential resides. If an organization is not able to change positively these flows, on the long term the implementation will deal with problems people wished to be addressed much earlier, when the effort and effect would have met the lowest resistance, respectively the highest impact.

An ERP implementation involves the migration of data between source(s) and target(s), the data requirements, including the one of appropriate quality, being regarded in respect to the target system(s). As within the data migration steps the data are extracted from the various sources, enriched, and prepared for import into the target system(s), there is the potential of bringing data quality to a level which would help the organization further. It’s probably simpler to imagine the process of taking the data from one place, cleaning and enriching the data to bring it to the needed form, and then putting the data into the new system. It’s a unique chance of improving data quality without touching the source or target system(s) while getting a considerable value.

Unfortunately, many organizations’ efforts to improve the quality of their data stop after the implementation. If there’s no focus and there are no structures in place to continue the effort, sooner or later data’s quality will decrease despite the earlier made efforts. Investing for example in a long-term data quality improvement or even a data management initiative might prove to be an exploratory and iterative process in which mistakes are maybe made, the direction might need to be changed, though, as long learning is involved, in this often resides the power of changing for the better.

When one talks about information there are two aspects to it: how an organization arrives from data to actionable information that reach timely the people who need it, respectively how information is further aggregated, recombined, shared, and harnessed into knowledge. These are the first three layers of knowledge (aka DIKW) pyramid, and an organization’s real success story is in how can manage these flows together, while increasing the value they provide for the organization. It’s an effort that must start with the implementation itself, or even earlier, and continue after the implementation, as an organization seems fit.

Previous Post <<||>> Next Post

Written: Sep-2020, Last Reviewed: Mar-2024

19 April 2019

🌡Performance Management: Mastery (Part I: The Need for Perfection vs. Excellence)

A recurring theme occurring in various contexts over the years seemed to be corroborated with the need for perfection, need going sometimes in extremis beyond common sense. The simplest theory attempting to explain at least some of these situations is that people tend to confuse excellence with perfection, from this confusion deriving false beliefs, false expectations and unhealthy behavior.

Beyond the fact that each individual has an illusory image of what perfection is about, perfection is in certain situations a limiting force rooted in the idealistic way of looking at life. Primarily, perfection denotes that we will never be good enough to reach it as we are striving to something that doesn’t exist. From this appears the external and internal criticism, criticism that instead of helping us to build something it drains out our energy to the extent that it destroys all we have built over the years with a considerable effort. Secondarily, on the long run, perfection has the tendency to steal our inner peace and balance, letting fear take over – the fear of not making mistakes, of losing the acceptance and trust of the others. It focuses on our faults, errors and failures instead of driving us to our goals. In extremis it relieves the worst in people, actors and spectators altogether.

In its proximate semantics though at diametral side through its implications, excellence focuses on our goals, on the aspiration of aiming higher without implying a limit to it. It’s a shift of attention from failure to possibilities, on what matters, on reaching our potential, on acknowledging the long way covered. It allows us building upon former successes and failures. Excellence is what we need to aim at in personal and professional life. Will Durant explaining Aristotle said that: “We are what we repeatedly do. Excellence, then, is not an act, but a habit.”

People who attempt giving 100% of their best to achieve a (positive) goal are to admire, however the proximity of 100% is only occasionally achievable, hopefully when needed the most. 100% is another illusory limit we force upon ourselves as it’s correlated to the degree of achievement, completeness or quality an artefact or result can ideally have. We rightly define quality as the degree to which something is fit for purpose. Again, a moving target that needs to be made explicit before we attempt to reach it otherwise quality envisions perfection rather than excellence and effort is wasted.

Considering the volume of effort needed to achieve a goal, Pareto’s principles (aka the 80/20 rule) seems to explain the best its underlying forces. The rule states that roughly 80% of the effects come from 20% of the causes. A corollary is that we can achieve 80% of a goal with 20% of the effort needed altogether to achieve it fully. This means that to achieve the remaining 20% toward the goal we need to put four times more of the effort already spent. This rule seems to govern the elaboration of concepts, designs and other types of documents, and I suppose it can be easily extended to other activities like writing code, cleaning data, improving performance, etc.

Given the complexity, urgency and dependencies of the tasks or goals before us probably it's beneficial sometimes to focus first on the 80% of their extent, so we can make progress, and focus on the remaining 20% if needed, when needed. This concurrent approach can allow us making progress faster in incremental steps. Also, in time, through excellence, we can bridge the gap between the two numbers as is needed less time and effort in the process.

27 October 2018

💎SQL Reloaded: Wish List (Part I: Replace From)

With SQL Server 2017 Microsoft introduced the Trim function, which not only replaces the combined use of LTrim and RTrim functions, but also replaces other specified characters from the start or end of a string (see my previous post):

-- Trim special characters 
SELECT Trim ('# ' FROM '# 843984 #') Example1
, Trim ('[]' FROM '[843984]') Example2

Output:
Example1 Example2
---------- --------
843984 843984

Similarly, I wish I had a function that replaces special characters from a whole string (not only the trails), for example:

-- Replace special characters 
SELECT Replace ('# ' FROM '# 84#3984 #', '') Example1
, Replace ('[]' FROM '[84][39][84]', '') Example2

Unfortunately, as far I know, there is no such simple function. Therefore, in order to replace the “]”, “[“ and “#” special characters from a string one is forced either to write verbose expressions like in the first example or to include the logic into a user-defined function like in the second:

-- a chain of replacements 
SELECT Replace(Replace(Replace('[#84][#39][#84]', '[' , ''), ']', ''), '#', '') Example1

-- encapsulated replacements
CREATE FUNCTION [dbo].[ReplaceSpecialChars](
  @string nvarchar(max)
, @replacer as nvarchar(1) 
) RETURNS nvarchar(max)
-- replaces the special characters from a string with a given replacer
AS
BEGIN   
  IF CharIndex('#', @string) > 0  
     SET @string = replace(@string, '#', @replacer) 
        
  IF CharIndex('[', @string) > 0  
     SET @string = replace(@string, '[', @replacer) 
    
  IF CharIndex(']', @string) > 0  
     SET @string = replace(@string, ']', @replacer) 
                                
  RETURN Trim(@string)
END

-- testing the function 
SELECT [dbo].[ReplaceSpecialChars]('[#84][#39][#84]', '') Example2

In data cleaning the list of characters to replace can get considerable big (somewhere between 10 and 30 characters). In addition, one can deal with different scenarios in which the strings to be replaced differ and thus one is forced to write multiple such functions.

To the list of special characters often needs to be considered also language specific characters like ß, ü, ä, ö that are replaced with ss, ue, ae, respectively oe (see also the post).

Personally, I would find such a replace function more than useful. What about you?

Happy coding!

15 February 2018

🔬Data Science: Data Preparation (Definitions)

Data preparation: "The process which involves checking or logging the data in; checking the data for accuracy; entering the data into the computer; transforming the data; and developing and documenting a database structure that integrates the various measures. This process includes preparation and assignment of appropriate metadata to describe the product in human readable code/format." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data Preparation describes a range of processing activities that take place in order to transform a source of data into a format, quality and structure suitable for further analysis or processing. It is often referred to as Data Pre-Processing due to the fact it is an activity that organises the data for a follow-on processing stage." (experian) [source]

"Data preparation [also] called data wrangling, it’s everything that is concerned with the process of getting your data in good shape for analysis. It’s a critical part of the machine learning process." (RapidMiner) [source]

"Data preparation is an iterative-agile process for exploring, combining, cleaning and transforming raw data into curated datasets for self-service data integration, data science, data discovery, and BI/analytics." (Gartner)

"Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data." (Talend) [source]

01 February 2018

🔬Data Science: Data Analysis (Definitions)

"Obtaining information from measured or observed data." (Ildiko E Frank & Roberto Todeschini, "The Data Analysis Handbook", 1994)

"Refers to the process of organizing, summarizing and visualizing data in order to draw conclusions and make decisions." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A combination of human activities and computer processes that answer a research question or confirm a research hypotheses. It answers the question from data files, using empirical methods such as correlation, t-test, content analysis, or Mill’s method of agreement." (Jens Mende, "Data Flow Diagram Use to Plan Empirical Research Projects", 2009)

"The study and presentation of data to create information and knowledge." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Process of applying statistical techniques to evaluate data." (Sally-Anne Pitt, "Internal Audit Quality", 2014)

"Research phase in which data gathered from observing participants are analysed, usually with statistical procedures." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Data analysis is the process of creating meaning from data. […] Data analysis is the process of creating information from data through the creation of data models and mathematics to find patterns." (Michael Heydt, "Learning Pandas" 2nd Ed, 2017)

"Data analysis is the process of organizing, cleaning, transforming, and modeling data to obtain useful information and ultimately, new knowledge." (John R. Hubbard, Java Data Analysis, 2017)

"Techniques used to organize, assess, and evaluate data and information." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide )", 2017)

"This is a class of statistical methods that make it possible to process a very large volume of data and identify the most interesting aspects of its structure. Some methods help to extract relations between different sets of data, and thus, draw statistical information that makes it possible to describe the most important information contained in the data in the most succinct manner possible. Other techniques make it possible to group data in order to identify its common denominators clearly, and thereby understand them better." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"The process and techniques for transforming and evaluating information using qualitative or quantitative tools to discover findings or inform conclusions." (Tiffany J Cresswell-Yeager & Raymond J Bandlow, "Transformation of the Dissertation: From an End-of-Program Destination to a Program-Embedded Process", 2020)

"Data Analysis is a process of gathering and extracting information from the data already present in different ways and order to study the pattern occurs." (Kirti R Bhatele, "Data Analysis on Global Stratification", 2020)

"A data lifecycle stage that involves the techniques that produce synthesized knowledge from organized information. A process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains." (CODATA)

"is the process of inspecting, cleansing, transforming, and modeling data to discover useful information, and support decision-making. The many different types of data analysis include data mining, a predictive technique used for modeling and knowledge discovery, and business intelligence, which relies on aggregation and focuses on business information." (Accenture)

"This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries to identify patterns that can improve an organization." (KDnuggets)

"Data Analysis is the process of inspecting, cleansing, transforming, and modeling data to discover useful information, and support decision-making. The many different types of data analysis include data mining, a predictive technique used for modeling and knowledge discovery, and business intelligence, which relies on aggregation and focuses on business information." (Accenture)

20 May 2017

⛏️Data Management: Data Scrubbing (Definitions)

"The process of making data consistent, either manually, or automatically using programs." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

Processing data to remove or repair inconsistencies." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"The process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal, from data records, of identifying information (i.e., information linking the record to an individual) plus any other information that is considered unwanted. This may include any personal, sensitive, or private information contained in a record, any incriminating or otherwise objectionable language contained in a record, and any information irrelevant to the purpose served by the record." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process. (Robert F Smallwood, Information Governance: Concepts, Strategies, and Best Practices, 2014)

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as 'dirty data'. Data cleansing is an essential task for preserving data quality." (Teradata) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"Part of the process of building a data warehouse out of data coming from multiple online transaction processing (OLTP) systems." (Microsoft Technet)

"The process of filtering, merging, decoding, and translating source data to create validated data for the data warehouse." (Information Management)

23 February 2017

⛏️Data Management: Data Cleaning/Cleansing (Definitions)

"A processing step where missing or inaccurate data is replaced with valid values." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"The process of validating data prior to a data analysis or Data Mining. This includes both ensuring that the values of the data are valid for a particular attribute or variable (e.g., heights are all positive and in a reasonable range) and that the values for given records or set of records are consistent." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"The process of correcting errors or omissions in data. This is often part of the extraction, transformation, and loading (ETL) process of extracting data from a source system, usually before attempting to load it into a target system. This is also known as data scrubbing." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The removal of inconsistencies, errors, and gaps in source data prior to its incorporation into data warehouses or data marts to facilitate data integration and improve data quality." (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"Software used to identify potential data quality problems. For example, if a customer is listed multiple times in a customer database using variations of the spelling of his or her name, the data cleansing software ensures that each data element is consistent so there is no confusion. Such software is used to make corrections to help standardize the data." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"The process of reviewing and improving data to make sure it is correct, up to date, and not duplicated." (Tony Fisher, "The Data Asset", 2009)

"The process of correcting data errors to bring the level of data quality to an acceptable level for the information user needs." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The act of detecting and removing and/or correcting data in a database. Also called data scrubbing." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures", 2012)

"Synonymous with data fixing or data correcting, data cleaning is the process by which errors, inexplicable anomalies, and missing values are somehow handled. There are three options for data cleaning: correcting the error, deleting the error, or leaving it unchanged." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of detecting, removing, or correcting incorrect data." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"The process of finding and fixing errors and inaccuracies in data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The process of eliminating inaccuracies, irregularities, and discrepancies from data." (Jim Davis & Aiman Zeid, "Business Transformation", 2014)

"The process of reviewing and revising data in order to delete duplicates, correct errors, and provide consistency." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"the processes of identifying and resolving potential data errors." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"A sub-process in data preprocessing, where we remove punctuation, stop words, etc. from the text." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration)." (Open Data Handbook)

"The process of detecting and correcting faulty records, leading to highly accurate BI-informed decisions, as enormous databases and rapid acquisition of data can lead to inaccurate or faulty data that impacts the resulting BI and analysis. Correcting typographical errors, de-duplicating records, and standardizing syntax are all examples of data cleansing." (Insight Software)

"Transforming data in its native state to a pre-defined standardized format using vendor software." (Solutions Review)

"Data cleansing is the effort to improve the overall quality of data by removing or correcting inaccurate, incomplete, or irrelevant data from a data system. […] Data cleansing techniques are usually performed on data that is at rest rather than data that is being moved. It attempts to find and remove or correct data that detracts from the quality, and thus the usability, of data. The goal of data cleansing is to achieve consistent, complete, accurate, and uniform data." (Informatica) [source]

"Data cleansing is the process of modifying data to improve accuracy and quality." (Xplenty) [source]

"Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted." (Sisense) [source]

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency." (Analytics Insight)

16 January 2017

⛏️Data Management: Data Quality Management [DQM] (Definitions)

[Total Data Quality Management:] "An approach that manages data proactively as the outcome of a process, a valuable asset rather than the traditional view of data as an incidental by-product." (Karolyn Kerr, "Improving Data Quality in Health Care", 2009)

"The application of total quality management concepts and practices to improve data and information quality, including setting data quality policies and guidelines, data quality measurement (including data quality auditing and certification), data quality analysis, data cleansing and correction, data quality process improvement, and data quality education." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data Quality Management (DQM) is about employing processes, methods, and technologies to ensure the quality of the data meets specific business requirements." (Mark Allen & Dalton Cervo, "Strategy, Scope, and Approach" [in "Multi-Domain Master Data Management"], 2015)

"DQM is the management of company data in a manner aware of quality. It is a sub-function of data management and analyzes, improves and assures the quality of data in the company. DQM includes all activities, procedures and systems to achieve the data quality required by the business strategy. Among other things, DQM transfers approaches for the management of quality for physical goods to immaterial goods like data." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Data quality management (DQM) is a set of practices aimed at improving and maintaining the quality of data across a company’s business units." (altexsoft) [source]

"Data quality management is a set of practices that aim at maintaining a high quality of information. DQM goes all the way from the acquisition of data and the implementation of advanced data processes, to an effective distribution of data. It also requires a managerial oversight of the information you have." (Data Pine) [source]

"Data quality management is a setup process, which is aimed at achieving and maintaining high data quality. Its main stages involve the definition of data quality thresholds and rules, data quality assessment, data quality issues resolution, data monitoring and control." (ScienceSoft) [source]

"Data quality management is the act of ensuring suitable data quality." (Xplenty) [source]

"Data quality management provides a context-specific process for improving the fitness of data that’s used for analysis and decision making. The goal is to create insights into the health of that data using various processes and technologies on increasingly bigger and more complex data sets." (SAS) [source]

"Data quality management (DQM) refers to a business principle that requires a combination of the right people, processes and technologies all with the common goal of improving the measures of data quality that matter most to an enterprise organization." (BMC) [source]

"Put most simply, data quality management is the process of reviewing and updating your customer data to minimize inaccuracies and eliminate redundancies, such as duplicate customer records and duplicate mailings to the same address." (EDQ) [source]

10 November 2012

📦Data Migrations (DM): Data Quality’s Perspective I (A Bird’s-Eye View)

Data Migrations Series

Imagine you just finished a Data Migration (DM) project, everything went smoothly, the data were loaded into the new system with a minimum amount of issues, inherent sometimes to such complex projects, the users started to use the new system, everybody seemed to be satisfied, and a few weeks later within the company rumors propagate with the speed of light – “the migrated data are wrong”, “the new system can’t be used” , “IT did a bad job”, “we have to get back to the previous system”, and so on. The panic propagates, a few heads fall, the business tries to revert to the old system but there’s lot of new data available in the new system, and it’s not so trivial to move the data back to the old, in the meantime other rumors appear, and… it’s just a scenario but this could happen to any company if not the appropriate measures were taken at the right time. What could help a company when something like this happens?! A good Plan B aka a good Migration Fallback Plan/Policy, but that’s something nobody would like to do except extreme situations.

A common approach to any type of projects as well to a DM project is to identify and mitigate the risks before or during the project. That’s something I started to do a few days ago, to prepare a list with the risks associated with DM projects. For this exercise I tried to remember what things went wrong in previous similar projects I worked on and to figure out what else could go wrong. Some online resources helped me to refresh my memory too, and I think I found also two or three things I haven’t really thought about. My attempt was primarily focused on this type of problem mentioned above – minimizing the risks of not having the right data when the new system goes live. Before jumping into the thematic I would like to sketch the bigger picture, as I perceive it.

Having the right data when a system goes live primarily means having good Data Quality (DQ) in the target system after the data were migrated! As a DM is the best exemplification of the GIGO (Garbage-In Garbage-Out) principle, in order to have good DQ in the target is important to handle DQ latest during the DM project. That’s essential and common sense – you can’t expect to have good data in the new system when there’s lot of garbage in the old. So, a DM and a Risk Management for such a project should be built around this. In fact not having a DQ initiative or project in a DM project is one of the most important risks a company can take. Maybe in small DM, a DQ initiative isn’t necessary, though when the data are important for your company, DQ is a must! In addition DQ assessments have to be performed in alignment with the new system, and not the old. Even if the data have good quality into the old system, the quality of your data after DM will be judged in corroboration with the new system. This is a requirement that can be easily overlooked and its implications misunderstood!

Many think that DQ is one time activity, we do it for a DM project and we’ll have quality data and never have to care about their quality anymore. Totally false! DQ has to be part of a broader strategy, call it Data Governance, Master Data Management, Data Management or any other initiative in which data plays an important role. DQ is an on going, iterative and consolidated effort, it doesn’t end after DM but continues for the whole data life-cycle, as long the data have value for an organization. It doesn’t help if the data have high quality when the data are migrated and a few weeks or months later the overall quality and trust in data decreased considerably.

Keeping an acceptable level of DQ must be an organization’s strategy, and must be built a culture toward DQ. People need to be aware of the importance of having good quality data, and especially the consequences of having bad quality data. DQ doesn’t concern only the owners or stewards of data, or the people working with data, it concerns the whole organization because decisions are made based on those data, processes are changed and improved, an organization’s performance is often judged based on data. The quality of data is a matter of perception, on how users see the quality of data in corroboration with the needs they have, and the needs change over time. Primarily being aware what good DQ means and which are an organization’s needs in respect to data, it’s also a way of minimizing the negative perception of data, of gaining trust in data and some solid basis on which decisions can be made. Secondarily, these organizational data needs need to be addressed in a DM, they are the success factors upon which the success of a DM project is judged.

For sure considerable costs are associated with DQ initiatives and everything related to data which doesn’t always represent a direct cost component in the products or services handled by an organization. Considering that not all data have the same importance for an organization, it makes sense to prioritize the DQ effort as a whole and the data cleaning needs in particular, the focus should be the data with the highest impact and with time to tackle data with lower and lower impact. It must be found equilibrium between the DQ costs and the value of data. Most probably is important to spend resources on raising people’s awareness in respect to DQ early rather than cleaning retroactively data later. It also make sense to invest in tools that help to clean data using automated or semi-automated methods, though some manual/visual control need to be in place too.

DQ and the way the problems associated with it are tackled depends more on an organization’s internal kitchen – people, partners, organization, strategy, maturity, culture, geography, infrastructure, methodologies used, etc. What it matters is how the various negative and important aspects of an organization are aligned in order to take advantage of one of the most important assets an organization has is its data! For this is important to adopt methodologies that support DQ, align them and tweak them as requested, in order to make most of your data! But before or while doing that remember that a DM is an organization’s opportunity to change the quality of its data and its data strategy!

Previous Post <<||>> Next Post

14 January 2010

🗄️Data Management: Data Cleansing (Part I: An Introduction)

Data Management Series

Even if often Data Cleansing or Data Scrubbing is considered synonymous to Data Cleaning, it actually includes the Data Validation and Data Cleaning process, where Data Validation is the process of checking data quality against a set of rules (referred as data quality rules) in a given set of data, and Data Cleaning or Data Correction is the process of correcting the data quality issues identified in Data Validation. Data Cleansing consists in general of the following steps:

Step 1: Defining the scope

The scope definition resumes at the identification of elements, attributes and the records that need to be cleaned. If is intended only to improve overall data quality and nothing else, the focus will be mainly on the data elements and attributes perceived as having low data quality. For data conversion the elements in scope are defined based on the requirements of the destination system and overall business requirements are defined the attributes in scope, an exercise that needs to be performed is the data mapping between the destination on one side and the legacy system(s) and the other data sources (e.g. Excel, MS Access files) on the other side.

From case to case in scope for cleansing might be all the records available in the legacy system(s) for a given set of data elements and attributes, or only a fraction of them, usually the focus is mainly the active and open records, and eventually some historical records. By active records are referred those records not cancelled or disabled, either master and transactional data , by open records the (transactional) records not marked as closed, records on which an action is still pending (e.g. pending Invoice Receipt, Payment, etc.), while by historical records are referred the already closed transactional data (e.g. Purchase Orders, Sales Orders, etc.).

Given the importance of master data for the business they are often the starting point for cleaning, they being also the data elements with the longest issue resolution, the transactional data being in theory easier to clean. The two types of data need to be synchronized the referent data related to the transactional data in scope needs to be loaded too.

Step 2: Defining the data quality rules and error types.

The data quality rules are defined for the attributes in scope and the data quality dimensions considered that apply, typically data accuracy, completeness, conformity, consistency, duplication and referential integrity. The issues are eventually further categorized and prioritized.

Step 3: Validating the data quality rules

The data quality rules are validated against the actual data sets, the output of the rules are consolidated in reports that follow to be interpreted. The number of errors are reported and stored for historical trending in case several iterations are needed for data cleaning.

Step 4: Identifying the data quality issues.

The reports are reviewed by the data owners, the issues being identified and the rules validated, further steps being performed in order to identify the root causes for the discovered issues. This usually leads to the calibration of data quality rules as new scenarios (e.g. exceptions, new error types) can be discovered.

Step 5: Correcting the data quality issues.

The issues are mitigated and corrected, issues’ resolution time, the time needed to correct the issues, is directly dependent on the volume of issues, their complexity, the number of resources allocated to the task, on whether the issues are corrected in the source or on a copy of the data, on whether the cleansing is done manually or specific software tools are used. Specific Data Cleaning software tools can help decreasing considerably the volume of manual work involved during the process, being preferred to use automatic and semiautomatic data cleansing methods whenever is possible.

Step 6: Validating the changes against the same data quality rules.

Once the issues are corrected the data sets are validated against the same rules used at Step 3, the Steps 3 to 5 being repeated until the expected data quality level is achieved. As already highlighted in a previous post, Data Cleansing is not a one-time endeavor, but an iterative process, new iterations being requested by changes occurred in rules or in scope, by the sequencing/scheduling of cleansing work or by reporting frequency (centered or not along the important milestones).

Step 7: Documenting/Reporting error results and validation outcomes, if needed.

Starting with data elements, attributes, and the logic used to select the records in scope for cleaning, and ending with the data quality reports with the issues and the improvement trends it makes sense to document everything and summarize it in a set of best practices and lesson learned sessions.

Step 8: Modify processes/procedures to reduce issues occurrence.

There are chances to discover that the errors found during validation could be avoided by modifying the existing processes, procedures or standards, or enforcing new validation rules in the legacy systems. When the changes are complex might be started even individual projects to address them.

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

Data Management Series

Conformity or format compliance, as named by [1], refers to the extent data are in the expected format, each attribute being associated with a set of metadata like type (e.g. text, numeric, alphanumeric, positive), length, precision, scale, or any other formatting patterns (e.g. phone number, decimal and digit grouping symbols).

Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used alternatively (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore, it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, whenever is possible, the format should not be confusing.

For example all the dates in a data set or in a set of data sets being object of the same global context (e.g. data migration, reporting) should have the same format, being preferred a format of type dd-mon-yyyy which, ignoring the different values the month could have for different language settings, it lets no space for interpretations (e.g. 01-10-2009 vs. 10-01-2009). There are also situations in which the constraints imposed by the various applications used restraints the flexibility of working adequately with the local computer formats.

If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the Users and whether they applied the formatting standards defined. In the absence of standards, Users might come with their own encoding, and even then, they might change it over time.

The use of different encodings could be also required by the standards within a specific country, organization, or other type of such entity. All these together makes from alphanumeric attributes the most often candidate for data cleaning, and the business rules used can be quite complex, needing to handle each specific case. For example, the VAT code could have different length from country to country, and more than one encoding could be used reflecting the changes in formatting policy.

In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However, considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use 'standard' delimiters for formatting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, e.g. when the length of each chunk of data is the same, though chunk length-based formatting could prove to be limited when the length of a chunk changes.

Note:
Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management", Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.

20 November 2005

Database Management: Trivia (Part I: Who's gonna rule the world?)

Database Management Series

This morning I started to read an article from Information Week about the future of database administrators, as it seems they will rule the world, at least the world of data. It's not a joke, their role will become more and more important over the next years, why this?

1. Databases become bigger and more complex

The size and complexity of databases is increasing from year to year, same their importance in making faster decision based on current or historical data. The bigger the size and complexity, bigger also the need for ad-hoc or one mouse click reports, and who’s the person who has the knowledge and resources to do this?!

2. Data mining

Historical data doesn't resume only to simple reporting, but data mining is becoming more popular, having as purpose the discovery of trends in the business, facilitating somehow predictions for business' future.

Let’s not forget that the big players on the database market started to offer data mining solutions, incorporated or not in their database platforms, the data mining features are a start in providing new views over the data.

Big companies will opt probably for data analysts, but for medium-sized and small companies, the role of a data analyst could be easily taken by a database administrator.

3. The knowledge that matters

Even if databases are backend for complex UI, in time, if not immediately, the database administrator becomes aware of underlaying overall structure of the database and, why not, leaded by curiosity or business' requirements he becomes aware of the content of the data. Even if is a paradox, the database administrator has sometimes more data knowledge than a manager and the skills to do with data that magic that is important for managers.

4. Data Cleansing

More and more companies are becoming aware of the outliers from their data, so data cleansing is becoming a priority, and again the database administrator has an important role in this process.

5. Database Security

Database platforms are growing, unfortunately the same thing happens with their security holes and the number of attacks. A skilled database administrator must have the knowledge to overcome these problems and solve the issues as soon as possible.

6. Training

Even if can be considered a consequence of the previously enumerated facts, it deserves to be considered separately. Companies are starting to realize how important it to have a skilled database administrator, so probably they will provide the requested training or help the DBA to achieve this.

7. On site/offshore compromise

If in the past years, has been popular the concept of off shoring people, including database administrators; probably soon companies will become aware that reducing direct maintenance costs doesn’t mean a reduction of overall costs, which may contain also indirect costs. The delayed response time generated by communication issues, availability or resources plays an important role in the increase indirect costs, most of the big companies will be forced then to opt for a compromise between on site and offshore.

In this case the direct benefit for a database administrator is not so obvious, except for the financial benefit there is also an image advantage, even if is disregarded, the image and the fact a person is not anymore isolated is important.

Reviewing the post almost 20 years later with the knowledge of today, I can just smile. Besides data analysts, the DBAs were at that time the closest to the data. Probably, many DBAs walked the new paths created by the various opportunities. Trying to uncover the unknown through the known is a foolish attempt, but it’s the best we can do to prepare us for what comes next.

Created: Nov-2005, Last Reviewed: Mar-2024

SQL Troubles

Pages

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

29 September 2020

𖣯🧮Strategic Management: Simplicity (Part V: ERP Implementations' Story I)

19 April 2019

🌡Performance Management: Mastery (Part I: The Need for Perfection vs. Excellence)

27 October 2018

💎SQL Reloaded: Wish List (Part I: Replace From)

15 February 2018

🔬Data Science: Data Preparation (Definitions)

01 February 2018

🔬Data Science: Data Analysis (Definitions)

20 May 2017

⛏️Data Management: Data Scrubbing (Definitions)

23 February 2017

⛏️Data Management: Data Cleaning/Cleansing (Definitions)

16 January 2017

⛏️Data Management: Data Quality Management [DQM] (Definitions)

10 November 2012

📦Data Migrations (DM): Data Quality’s Perspective I (A Bird’s-Eye View)

14 January 2010

🗄️Data Management: Data Cleansing (Part I: An Introduction)

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

20 November 2005

Database Management: Trivia (Part I: Who's gonna rule the world?)

About Me