29 April 2018

Data Science: Data Standardization (Definitions)

"The process of reaching agreement on common data definitions, formats, representation and structures of all data layers and elements." (United Nations, "Handbook on Geographic Information Systems and Digital Mapping", Studies in Methods No. 79, 2000)

[value standardization:] "Refers to the establishment and adherence of data to standard formatting practices, ensuring a consistent interpretation of data values." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"Converting data into standard formats to facilitate parsing and thus matching, linking, and de-duplication. Examples include: “Avenue” as “Ave.” in addresses; “Corporation” as “Corp.” in business names; and variations of a specific company name as one version." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Normalizes data values to meet format and semantic definitions. For example, data standardization of address information may ensure that an address includes all of the required pieces of information and normalize abbreviations (for example Ave. for Avenue)." (Martin Oberhofer et al, "Enterprise Master Data Management", 2008)

"Using rules to conform data that is similar into a standard format or structure. Example: taking similar data, which originates in a variety of formats, and transforming it into a single, clearly defined format." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"a process in information systems where data values for a data element are transformed to a consistent representation." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"Data standardization is the process of converting data to a common format to enable users to process and analyze it." (Sisense) [source]

"In the context of data analysis and data mining: Where “V” represents the value of the variable in the original datasets: Transformation of data to have zero mean and unit variance. Techniques used include: (a) Data normalization; (b) z-score scaling; (c) Dividing each value by the range: recalculates each variable as V /(max V – min V). In this case, the means, variances, and ranges of the variables are still different, but at least the ranges are likely to be more similar; and, (d) Dividing each value by the standard deviation. This method produces a set of transformed variables with variances of 1, but different means and ranges." (CODATA)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.