23 February 2017

Data Management: Data Cleaning/Cleansing (Definitions)

"A processing step where missing or inaccurate data is replaced with valid values." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"The process of validating data prior to a data analysis or Data Mining. This includes both ensuring that the values of the data are valid for a particular attribute or variable (e.g., heights are all positive and in a reasonable range) and that the values for given records or set of records are consistent." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"The process of correcting errors or omissions in data. This is often part of the extraction, transformation, and loading (ETL) process of extracting data from a source system, usually before attempting to load it into a target system. This is also known as data scrubbing." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The removal of inconsistencies, errors, and gaps in source data prior to its incorporation into data warehouses or data marts to facilitate data integration and improve data quality." (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"Software used to identify potential data quality problems. For example, if a customer is listed multiple times in a customer database using variations of the spelling of his or her name, the data cleansing software ensures that each data element is consistent so there is no confusion. Such software is used to make corrections to help standardize the data." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"The process of reviewing and improving data to make sure it is correct, up to date, and not duplicated." (Tony Fisher, "The Data Asset", 2009)

"The process of correcting data errors to bring the level of data quality to an acceptable level for the information user needs." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The act of detecting and removing and/or correcting data in a database. Also called data scrubbing." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures", 2012)

"Synonymous with data fixing or data correcting, data cleaning is the process by which errors, inexplicable anomalies, and missing values are somehow handled. There are three options for data cleaning: correcting the error, deleting the error, or leaving it unchanged." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The process of detecting, removing, or correcting incorrect data." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"The process of finding and fixing errors and inaccuracies in data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The process of removing corrupt, redundant, and inaccurate data in the data governance process." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014) 

"The process of eliminating inaccuracies, irregularities, and discrepancies from data." (Jim Davis & Aiman Zeid, "Business Transformation", 2014)

"The process of reviewing and revising data in order to delete duplicates, correct errors, and provide consistency." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"the processes of identifying and resolving potential data errors." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"A sub-process in data preprocessing, where we remove punctuation, stop words, etc. from the text." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"Processing a dataset to make it easier to consume. This may involve fixing inconsistencies and errors, removing non-machine-readable elements such as formatting, using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, conversion to a suitable file format, reconciliation of labels with another dataset being used (see data integration)." (Open Data Handbook) 

"The process of detecting and correcting faulty records, leading to highly accurate BI-informed decisions, as enormous databases and rapid acquisition of data can lead to inaccurate or faulty data that impacts the resulting BI and analysis. Correcting typographical errors, de-duplicating records, and standardizing syntax are all examples of data cleansing." (Insight Software)

"Transforming data in its native state to a pre-defined standardized format using vendor software." (Solutions Review)

"Data cleansing is the effort to improve the overall quality of data by removing or correcting inaccurate, incomplete, or irrelevant data from a data system.  […] Data cleansing techniques are usually performed on data that is at rest rather than data that is being moved. It attempts to find and remove or correct data that detracts from the quality, and thus the usability, of data. The goal of data cleansing is to achieve consistent, complete, accurate, and uniform data." (Informatica) [source]

"Data cleansing is the process of modifying data to improve accuracy and quality." (Xplenty) [source]

"Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted." (Sisense) [source]

"Data Cleansing (or Data Scrubbing) is the action of identifying and then removing or amending any data within a database that is: incorrect, incomplete, duplicated." (experian) [source]

"Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as 'dirty data'. Data cleansing is an essential task for preserving data quality." (Teradata) [source]

"Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated." (Techtarget) [source]

"the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency." (Analytics Insight)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.