SQL Troubles

26 January 2018

🔬Data Science: Standard Deviation (Definitions)

"A commonly used measure that defines the variation in a data set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A measure of the variability in a set of data. It is calculated by taking the square root of the variance. Standard deviations are not additive; the variances are." (Clyde M Creveling, "Six Sigma for Technical Processes", 2006)

"The degree of dispersion of a group of scores around the average. If most scores are close to the average, the standard deviation is low. Conversely, if the scores are widely dispersed, the standard deviation is large." (Ruth C Clark, "Building Expertise: Cognitive Methods for Training and Performance Improvement", 2008)

"The measured range of economic volatility that can occur during the course of doing business." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)

"A measure of how distributed the values of a probability curve are, relative to the average." (Jon Radoff, "Game On: Energize Your Business with Social Media Games", 2011)

"The amount of dispersal among test scores or other outcome results. A larger standard deviation indicates greater spread among test scores, while a smaller standard deviation indicates greater consistency among scores." (Ruth C Clark & Richard E Mayer, "e-Learning and the Science of Instruction", 2011)

"Describes dispersion about the data set’s mean. You can think of a standard deviation as an average deviation from the mean. See also average; variance." (E C Nelson & Stephen L Nelson, "Excel Data Analysis For Dummies ", 2015)

"Square root of variance. The standard deviation is an index of variability in the distribution of scores." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"the square root of the variance of a sample or distribution. For well-behaved, reasonably symmetric data distributions without long tails, we would expect most of the observations to lie within two sample standard deviations from the sample mean." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

25 January 2018

🔬Data Science: Regression Analysis (Definitions)

"A set of statistical operations that helps to predict the value of the dependent variable from the values of one or more independent variables." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

"A statistical tool that measures the strength of relationship between one or more independent variables with a dependent variable. It builds upon the correlation concepts to develop an empirical, databased model. Correlation describes the X and Y relationship with a single number (the Pearson’s Correlation Coefficient (r)), whereas regression summarizes the relationship with a line - the regression line." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"A statistical procedure for estimating mathematically the average relationship between the dependent variable (e.g., sales) and one or more independent variables (e.g., price and advertising)." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"Regression analysis is a statistical technique for estimating the relationship between a set of predictors (independent variables) and an outcome variable (dependent variable). Linear least-squares regression, in which the relationship is expressed in a linear form, is the most common type of regression analysis. The mathematical model used in least-squares linear regression is often called the general linear model (GLM)." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"A statistical technique which seeks to find a line which best fits through a set of data as plotted on a graph, seeking to find the cleanest path which deviates the least from any instance within the set." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[regression] "Using one data set to predict the results of a second." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The statistical process of predicting one or more continuous variables, such as profit or loss, based on other attributes in the dataset." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A family of methods for fitting a line or curve to a dataset, used to simplify or make sense of a number of apparently random data points." (Meta S Brown, "Data Mining For Dummies", 2014)

"An analytic technique where a series of input variables are examined in relation to their corresponding output results in order to develop a mathematical or statistical relationship." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)

"A statistical technique for estimating relationships between variables." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Process to statistically estimate the relationship between different attributes." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Plotting pairs of independent and dependent variables in an XY chart and then finding a linear or exponential equation that best describes the plotted data." (E C Nelson & Stephen L Nelson, "Excel Data Analysis For Dummies", 2015)

"A statistical procedure that produces an equation for predicting a variable (the criterion measure) from one or more other variables (the predictor measures)." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical technique used to estimate the mathematical relationship between a dependent variable, such as quantity demanded, and one or more explanatory variables, such as price and income." (Jeffrey M Perloff & James A Brander, "Managerial Economics and Strategy" 2nd Ed., 2016)

"A statistical process for estimating the relationships between variables, often used to forecast the change in a variable based on changes in other variables. Linear regression is used to analyze continuous variables, and logistic regression is used for discrete variables." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"In a machine learning context, regression is the task of assigning scalar value to examples." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Algorithms used to predict values for new data based on training data fed into the system. Areas where regression in machine learning is used to predict future values include drug response modeling, marketing, real estate and financial forecasting." (Accenture)

"To define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable." (Analytics Insight)

24 January 2018

🔬Data Science: Data Processing (Definitions)

"The act of turning raw data into meaningful output, generally associated with computers." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"Any process that converts data into information. The processing is usually assumed to be automated and running on an information system." (Eleutherios A Papathanassiou & Xenia J Mamakou, "Privacy Issues in Public Web Sites", Handbook of Research on Public Information Technology, 2008)

"Obtaining, recording or holding the data, or carrying out any operation on the data, including organising, adapting or altering it. Retrieval, consultation or use of the data, disclosure of the data, and alignment, combination, blocking, erasure or destruction of the data are all legally classed as processing." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The operation performed on data through capture, transformation, and storage, in order to derive new information according to a given set of rules." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Collection and elaboration of sensing data with the aim to derivate/infer new knowledge from original raw data." (Paolo Bellavista et al, "Crowdsensing in Smart Cities: Technical Challenges, Open Issues, and Emerging Solution Guidelines", 2015)

"The act of data manipulation through integration of mathematical tools, statistics, and computer application to generate information." (Babangida Zubairu, "Security Risks of Biomedical Data Processing in Cloud Computing Environment", 2018)

"Any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction." (Yordanka Ivanova, "Data Controller, Processor, or Joint Controller: Towards Reaching GDPR Compliance in a Data- and Technology-Driven World", 2020)

"Data processing is any action performed to turn raw data into useful information." (Xplenty) [source]

"Data processing occurs when data is collected and translated into usable information. […] Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organization." (Talend) [source]

19 January 2018

🔬Data Science: Structured Data (Definitions)

"Data that has a strict metadata defined, such as a SQL Server table’s column." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Data that has enforced composition to specified datatypes and relationships and is managed by technology that allows for querying and reporting." (Keith Gordon, "Principles of Data Management", 2007)

"Database data, such as OLTP (Online Transaction Processing System) data, which can be sorted." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"A collection of records or data that is stored in a computer; records maintained in a database or application." (Robert F Smallwood, "Managing Electronic Records: Methods, Best Practices, and Technologies", 2013)

"Data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, a customer’s name, address, and so on)." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Data that fits cleanly into a predefined structure." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Data that is described by a data model, for example, business data in a relational database" (Hasso Plattner, "A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases" 2nd Ed., 2014)

"Data that is managed by a database management system" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"In statistics and data mining, any type of data whose values have clearly defined meaning, such as numbers and categories." (Meta S Brown, "Data Mining For Dummies", 2014)

"Data that adheres to a strict definition." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, for a customer’s name, address, and so on)." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Data that resides in a fixed field within a file or individual record, such as a row & column database." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Information that sits in a database, file, or spreadsheet. It is generally organized and formatted. In retail, this data can be point-of-sale data, inventory, product hierarchies, or others." (Brittany Bullard, "Style and Statistics", 2016)

"A data field of a definable data type, usually of a specified size or range, that can be easily processed by a computer." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"Data that can be stored in a table. Every instance in the table has the same set of attributes. Contrast with unstructured data." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Data that is identifiable as it is organized in structure like rows and columns. The data resides in fixed fields within a record or file or the data is tagged correctly and can be accurately identified." (Analytics Insight)

"Refers to information with a high degree of organization, meaning that it can be seamlessly included in a relational database and quickly searched by straightforward search engine algorithms and/or other search operations. Structured data examples include dates, numbers, and groups of words and number 'strings'. Machine-generated structured data is on the increase and includes sensor data and financial data." (Accenture)

15 January 2018

🔬Data Science: Semi-Structured Data (Definitions)

"Data that has flexible metadata, such as XML." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"'Text' documents, such as e-mail, word processing, presentations, and spreadsheets, whose content can be searched." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"Data that, although unstructured, still has some degree of structure. A good example is e-mail: Even though it is predominantly text, it has logical blocks with different purposes." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Data that have already been processed to some extent." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management" 11th Ed., 2014)

"A structured data type that does not have a formal definition, like a document. It has tags or other markers to enforce a hierarchy of records within a particular object, but may be different from another object." (Jason Williamson, Getting a Big Data Job For Dummies, 2015)

"Semi-structured data has some structures that are often manifested in images and data from sensors." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"a form a structured data that does not have a formal structure like structured data. It does however have tags or other markers to enforce hierarchy of records." (Analytics Insight)

🔬Data Science: Big Data (Definitions)

"Big Data: when the size and performance requirements for data management become significant design and decision factors for implementing a data management and analysis system. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration." (Jimmy Guterman, 2009)

"A buzzword for the challenges of and approaches to working with data sets that are too big to manage with traditional tools, such as relational databases. So called NoSQL databases, clustered data processing tools like MapReduce, and other tools are used to gather, store, and analyze such data sets." (Dean Wampler, "Functional Programming for Java Developers", 2011)

"Big data: techniques and technologies that make handling data at extreme scale economical." (Brian Hopkins, "Big Data, Brewer, And A Couple Of Webinars", 2011) [source]

"Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value." (McKinsey & Co., "Big Data: The Next Frontier for Innovation, Competition, and Productivity", 2011)

"Data volumes that are exceptionally large, normally greater than 100 Terabyte and more commonly refer to the Petabyte and Exabyte range. Big data has begun to be used when discussing Data Warehousing and analytic solutions where the volume of data poses specific challenges that are unique to very large volumes of data including: data loading, modeling, cleansing, and analytics, and are often solved using massively parallel processing, or parallel processing and distributed data solutions." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it." (Edd Wilder-James, "What is big data?", 2012) [source]

"A collection of data whose very size, rate of accumulation, or increased complexity makes it difficult to analyze and comprehend in a timely and accurate manner." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"A colloquial term referring to exceedingly large datasets that are otherwise unwieldy to deal with in a reasonable amount of time in the absence of specialized tools. They are different from normal data in terms of volume, velocity, and variety and typically require unique approaches for capture, processing, analysis, search, and visualization." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Big data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information." (Microsoft, 2013) [source]

"Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (Tim O’Reilly, [email correspondence, 2013)

"The capability to manage a huge volume of disparate data, at the right speed and within the right time frame, to allow real-time analysis and reaction. Big data is typically broken down by three characteristics, including volume (how much data), velocity (how fast that data is processed), and variety (the various types of data)." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A colloquial term referring to datasets that are otherwise unwieldy to deal with in a reasonable amount of time in the absence of specialized tools. Common characteristics include large amounts of data (volume), different types of data (variety), and ever-increasing speed of generation (velocity). They typically require unique approaches for capture, processing, analysis, search, and visualization." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"An extremely large database which generally defies standard methods of analysis." (Owen P. Hall Jr., "Teaching and Using Analytics in Management Education", 2014)

"Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze." (Xiuli He et al, Supply Chain Analytics: Challenges and Opportunities, 2014)

"More data than can be processed by today's database systems, or acutely high volume, velocity, and variety of information assets that demand IG to manage and leverage for decision-making insights and cost management." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The term that refers to data that has one or more of the following dimensions, known as the four Vs: Volume, Variety, Velocity, and Veracity." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A collection of models, techniques and algorithms that aim at representing, managing, querying and mining large-scale amounts of data (mainly semi-structured data) in distributed environments (e.g., Clouds)." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", 2015)

"A process to deliver decision-making insights. The process uses people and technology to quickly analyze large amounts of data of different types (traditional table structured data and unstructured data, such as pictures, video, email, and Tweets) from a variety of sources to produce a stream of actionable knowledge." (James R Kalyvas & Michael R Overly, "Big Data: A Businessand Legal Guide", 2015)

"A relative term referring to data that is difficult to process with conventional technology due to extreme values in one or more of three attributes: volume (how much data must be processed), variety (the complexity of the data to be processed) and velocity (the speed at which data is produced or at which it arrives for processing). As data management technologies improve, the threshold for what is considered big data rises. For example, a terabyte of slow-moving simple data was once considered big data, but today that is easily managed. In the future, a yottabyte data set may be manipulated on desktop, but for now it would be considered big data as it requires extraordinary measures to process." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Big data is a discipline that deals with processing, storing, and analyzing heterogeneous (structured/semistructured/unstructured) large data sets that cannot be handled by traditional information management technologies that have been used to process structured data. Gartner defined big data based on the three Vs: volume, velocity, and variety." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"Records that are so large (terabytes and exabytes) and diverse (from sensors to social media data) that they require new, powerful technologies for storage, management, analysis and visualization." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Term used to describe the exponential growth, variety, and availability of data, both structured and unstructured." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A broad term for large and complex data sets that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set." (Suren Behari, "Data Science and Big Data Analytics in Financial Services: A Case Study", 2016)

"A combination of facts and artifacts drawn from a myriad of sources and stored without regard to rational or normalized disciplines or structures." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A term that describes a large dataset that grows in size over time. It refers to the size of dataset that exceeds the capturing, storage, management, and analysis of traditional databases. The term refers to the dataset that has large, more varied, and complex structure, accompanies by difficulties of data storage, analysis, and visualization. Big Data are characterized with their high-volume, -velocity and –variety information assets." (Kenneth C C Yang & Yowei Kang, "Real-Time Bidding Advertising: Challenges and Opportunities for Advertising Curriculum, Research, and Practice", 2016)

"Big data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems)." (Davy Cielen et al, "Introducing Data Science", 2016)

"For digital resources, inexpensive storage and high bandwidth have largely eliminated capacity as a constraint for organizing systems, with an exception for big data, which is defined as a collection of data that is too big to be managed by typical database software and hardware architectures." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)

"Large sets of data that are leveraged to make better business decisions. Retail data can be sales, product inventory, e-mail offers, customer information, competitor pricing, product descriptions, social media, and much more." (Brittany Bullard, "Style and Statistics", 2016)

"A term used to describe large sets of structured and unstructured data. Data sets are continually increasing in size and may grow too large for traditional storage and retrieval. Data may be captured and analyzed as it is created and then stored in files." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"Datasets of structured and unstructured information that are so large and complex that they cannot be adequately processed and analyzed with traditional data tools and applications. |" (Jonathan Ferrar et al, "The Power of People", 2017)

"Big data are often defined in terms of the three Vs: the extreme volume of data, the variety of the data types, and the velocity at which the data must be processed." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Very large data volumes that are complex and varied, and often collected and must be analyzed in real time." (Daniel J. Power & Ciara Heavin, "Data-Based Decision Making and Digital Transformation", 2018)

"A generic term that designates the massive volume of data that is generated by the increasing use of digital tools and information systems. The term big data is used when the amount of data that an organization has to manage reaches a critical volume that requires new technological approaches in terms of storage, processing, and usage. Volume, velocity, and variety are usually the three criteria used to qualify a database as 'big data'." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." (Thomas Ochs & Ute A Riemann, "IT Strategy Follows Digitalization", 2019)

"The capability to manage a huge volume of disparate data, at the right speed and within the right time frame, to allow real time analysis and reaction." (K Hariharanath, "BIG Data: An Enabler in Developing Business Models in Cloud Computing Environments", 2019)

"A term used to refer to the massive datasets generated in the digital age. Both the volume and speed at which data are generated is far greater than in the past and requires powerful computing technologies." (Osman Kandara & Eugene Kennedy, "Educational Data Mining: A Guide for Educational Researchers", 2020)

"Refers to data sets that are so voluminous and complex that traditional data processing application software is inadequate to deal with them." (James O Odia & Osaheni T Akpata, "Role of Data Science and Data Analytics in Forensic Accounting and Fraud Detection", 2021)

"The evolving term that describes a large volume of structured, semi-structured and unstructured data that has the potential to be mined for information and used in machine learning projects and other advanced analytics applications." (Nenad Stefanovic, "Big Data Analytics in Supply Chain Management", 2021)

"The term 'big data' is related to gathering and storing extra-large volume of structured, semi-structured and unstructured data with high Velocity and Variability to be used in advanced analytics applications." (Ahmad M Kabil, Integrating Big Data Technology Into Organizational Decision Support Systems, 2021)

"A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." (Board International)

"A collection of data so large that it cannot be stored, transmitted or processed by traditional means." (Open Data Handbook)

"an accumulation of data that is too large and complex for processing by traditional database management tools" (Merriam-Webster)

"Extremely large data sets that may be analyzed to reveal patterns and trends and that are typically too complex to be dealt with using traditional processing techniques." (Solutions Review)

"is a term for very large and complex datasets that exceed the ability of traditional data processing applications to deal with them. Big data technologies include data virtualization, data integration tools, and search and knowledge discovery tools." (Accenture)

"The practices and technology that close the gap between the data available and the ability to turn that data into business insight." (Forrester)

"Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Big data has one or more of the following characteristics: high volume, high velocity or high variety." (IBM) [source]

"Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves." (SAS) [source]

"Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications." (Techtarget)

"Big data is a term used for large data sets that include structured, semi-structured, and unstructured data." (Xplenty) [source]

"Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." (Gartner)

"Big data is the catch-all term used to describe gathering, analyzing, and storing massive amounts of digital information to improve operations." (Talend) [source]

"Big data refers to the 21st-century phenomenon of exponential growth of business data, and the challenges that come with it, including holistic collection, storage, management, and analysis of all the data that a business owns or uses." (Informatica) [source]

14 January 2018

🔬Data Science: Unstructured Data (Definitions)

"Data that does not neatly fit into a tabular structure with well-defined and bounded definitions. Examples of unstructured data are e-mail messages and video streams. Many customer databases contain comment fields where customer service reps put in additional notes about customers." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"Computerised information which does not have a data structure that is easily readable by a machine, including audio, video and unstructured text such as the body of a word-processed document - effectively this is the same as multimedia data." (Keith Gordon, "Principles of Data Management", 2007)

"Data that has no metadata, such as text files." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Natively bitmapped data, such as video, audio, pictures, and MRI scans, that can be sensed either visually, audibly, or both." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"Data that does not fit into a structured data model or does not fit well into relational tables. Common examples include binary information such as video or audio and free-text information." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Data that does not follow a specified data format. Unstructured data can be text, video, images, and so on." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Unstructured data has no real structure, such as the data in an email and a memo. Interestingly, estimates have 85% of all business information as unstructured data. There are now many products coming on the market that can put some structure into unstructured data so that it can be categorized or organized hierarchically." (Michael M David & Lee Fesperman, "Advanced SQL Dynamic Data Modeling and Hierarchical Processing", 2013)

"Data that exist in their original (raw) state; that is in the format in which they were collected." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management Ed. 11", 2014)

"Data whose logical organization is not apparent to the computer" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Information (typically stored digitally) that either does not have a predefined data model or is not organized in a predefined manner. Most unstructured data is created by humans and includes email, documents, text messages, tweets, blogs, and more." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Text, audio, video, and other types of complex data that won’t easily fit into a conventional relational database. Unstructured data isn’t as simple as the numbers and short strings that most data analysts use." (Meta S Brown, "Data Mining For Dummies", 2014)

"Data that cannot fit cleanly into a predefined structure." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Data without data model or that a computer program cannot easily use (in the sense of understanding its content). Examples are word processing documents or electronic mail" (Hasso Plattner, "A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases" 2nd Ed., 2014)

"Data (generally text-based) which is not presented in a structured form such as a database, ontology, table, etc. Newspaper articles, government reports, blogs, and e-mails are all examples of unstructured data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Data that doesn’t fit into a fixed and strict definition. Things like sound files, images, text, and web pages can be considered unstructured data." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Information that does not follow a specified data format. Unstructured data can be text, video, images, and such." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Data that does not have a specific format. It can be customer reviews, tweets, pictures, or even hashtags." (Brittany Bullard, "Style and Statistics", 2016)

"A type of data where each instance in the data set may have its own internal structure; that is, the structure is not necessarily the same in every instance. For example, text data are often unstructured and require a sequence of operations to be applied to them in order to extract a structured representation for each instance." (John D Kelleher & Brendan Tierney, "Data science", 2018)

03 January 2018

🔬Data Science: Models (Definitions)

"A model is essentially a calculating engine designed to produce some output for a given input." (Richard C Lewontin, "Models, Mathematics and Metaphors", Synthese, Vol. 15, No. 2, 1963)

"A model is an abstract description of the real world. It is a simple representation of more complex forms, processes and functions of physical phenomena and ideas." (Moshe F Rubinstein & Iris R Firstenberg, "Patterns of Problem Solving", 1975)

"A model is an attempt to represent some segment of reality and explain, in a simplified manner, the way the segment operates." (E Frank Harrison, "The managerial decision-making process" , 1975)

"A model is a representation containing the essential structure of some object or event in the real world." (David W Stockburger, "Introductory Statistics", 1996)

"A model is a deliberately simplified representation of a much more complicated situation." (Robert M Solow, "How Did Economics Get That Way and What Way Did It Get?", Daedalus Vol. 126 (1), 1997)

"Models are synthetic sets of rules, pictures, and algorithms providing us with useful representations of the world of our perceptions and of their patterns." (Burton G Malkiel, "A Random Walk Down Wall Street", 1999)

"A model is an imitation of reality" (Ian T Cameron & Katalin M Hangos, "Process Modelling and Model Analysis", 2001)

"Models are replicas or representations of particular aspects and segments of the real world" (Paulraj Ponniah, "Data Modeling Fundamentals", 2007)

"A model is a simplification of reality." (Alexey Voinov, "Systems Science and Modeling for Ecological Economics", 2008)

"a model is a representation of reality intended for some definite purpose." (Michael Pidd, "Tools for Thinking" 3rd Ed., 2009)

"A model is a representation of some subject matter." (Alec Sharp & Patrick McDermott, "Workflow Modeling" 2nd Ed, 2009)

"An abstract representation of how something is built (or is to be built), or how something works (or is observed as working)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A model is a simplified representation of a system. It can be conceptual, verbal, diagrammatic, physical, or formal (mathematical)." (Hiroki Sayama, "Introduction to the Modeling and Analysis of Complex Systems", 2015)

"A formal set of relationships that can be manipulated to test assumptions. A simulation that tests the number of units that can be processed each hour under a set of conditions is an example of a model. Models do not need to be graphical." (Appian)

"Model is simply a representation or simulation of some real-world phenomenon." (Accenture)

02 January 2018

🔬Data Science: Data (Definitions)

"Facts and figures used in computer programs." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"A representation of facts, concepts, or instructions suitable to permit communication, interpretation, or processing by humans or by automatic means. (2) Used as a synonym for documentation in U.S. government procurement regulations." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"A recording of facts, concepts, or instructions on a storage medium for communication, retrieval, and processing by automatic means and presentation as information that is understandable by human beings." (William H Inmon, "Building the Data Warehouse", 2005)

"An atomic element of information. Represented as bits within mass storage devices, memory, and pprocessors." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"Information documented by a language system representing facts, text, graphics, bitmapped images, sound, and analog or digital live-video segments. Data is the raw material of a system supplied by data producers and is used by information consumers to create information." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A term applied to organized information." (Gavin Powell, "Beginning Database Design", 2006)

"Numeric information or facts collected through surveys or polls, measurements or observations that need to be effectively organized for decision making." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Raw, unrelated numbers or entries, e.g., in a database; raw forms of transactional representations." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"Data is a representation of facts, concepts or instructions in a formalized manner suitable for communication, interpretation or processing by humans or automatic means." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Numeric information or facts collected through surveys or polls, measurements or observations that need to be effectively organized for decision making." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"Hub A common approach for a technical implementation of a service-oriented MDM solution. Data Hubs store and manage some data attributes and the metadata containing the location of data attributes in external systems in order to create a single physical or federated trusted source of information about customers, products, and so on." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"Raw facts, that is, facts that have not yet been processed to reveal their meaning to the end user." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"Facts represented as text, numbers, graphics, images, sound, or video (with no additional defining context); the raw material used to create information." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures 2nd Ed", 2012)

"Data are abstract representations of selected characteristics of real-world objects, events, and concepts, expressed and understood through explicitly definable conventions related to their meaning, collection, and storage. We also use the term data to refer to pieces of information, electronically captured, stored (usually in databases), and capable of being shared and used for a range of organizational purposes."(Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Data are abstract representations of selected characteristics of real-world objects, events, and concepts, expressed and understood through explicitly definable conventions related to their meaning, collection, and storage. We also use the term data to refer to pieces of information, electronically captured, stored (usually in databases), and capable of being shared and used for a range of organizational purposes." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013)

"A collection of values assigned to base measures, derived measures and/or indicators." (David Sutton, "Information Risk Management: A practitioner’s guide", 2014)

"Raw facts, that is, facts that have not yet been processed to reveal their meaning to the end user." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management" 11th Ed., 2014)

"A formalized (meaning suitable for further processing, interpretation and communication) representation of business objects or transactions." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Data is a collection of one or more pieces if information." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)

"Facts about events, objects, and associations. Example: data about a sale would include date, amount, and method of payment." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"Discrete, unorganized, unprocessed measurements or raw observations." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide )", 2017)

"Any values from an application that can be transformed into facts and eventually information.." (Piethein Strengholt, "Data Management at Scale", 2020)

"A set of collected facts. There are two basic kinds of numerical data: measured or variable data … and counted or attribute data." (ASQ)
"A representation of information as stored or transmitted." (NISTIR 4734)

"A representation of information, including digital and non-digital formats." (NIST Privacy Framework Version 1.0)

"A variable-length string of zero or more (eight-bit) bytes." (NIST SP 800-56B Rev. 2)

"Any piece of information suitable for use in a computer." (NISTIR 7693)

"(1) Anything observed in the documentation or operation of software that deviates from expectations based on previously verified software products or reference documents.(2) A representation of facts, concepts, or instructions in a manner suitable for communication, interpretation, or processing by humans or by automatic means." (IEEE 610.5-1990)

"Data may be thought of as unprocessed atomic statements of fact. It very often refers to systematic collections of numerical information in tables of numbers such as spreadsheets or databases. When data is structured and presented so as to be useful and relevant for a particular purpose, it becomes information available for human apprehension. See also knowledge." (Open Data Handbook)

"Distinct pieces of digital information that have been formatted in a specific way." (NIST SP 800-86)

"Information in a specific representation, usually as a sequence of symbols that have meaning." (CNSSI 4009-2015 IETF RFC 4949 Ver 2)

"Pieces of information from which “understandable information” is derived." (NIST SP 800-88 Rev. 1)

“re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing” (ISO 11179)

01 January 2018

🔬Data Science: Data Science (Definitions)

"A set of quantitative and qualitative methods that support and guide the extraction of information and knowledge from data to solve relevant problems and predict outcomes." (Xiuli He et al, "Supply Chain Analytics: Challenges and Opportunities", 2014)

"A collection of models, techniques and algorithms that focus on the issues of gathering, pre-processing, and making sense-out of large repositories of data, which are seen as 'data products'." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", 2015)

"Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains. […] Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data produced today. It adds methods from computer science to the repertoire of statistics." (Davy Cielen et al, "Introducing Data Science", 2016)

"The workflows and processes involved in the creation and development of data products." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"The discipline of analysis that helps relate data to the events and processes that produce and consume it for different reasons." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"The extraction of knowledge from large volumes of unstructured data which is a continuation of the field data mining and predictive analytics, also known as knowledge discovery and data mining (KDD)." (Suren Behari, "Data Science and Big Data Analytics in Financial Services: A Case Study", 2016)

"A knowledge acquisition from data through scientific method that comprises systematic observation, experiment, measurement, formulation, and hypotheses testing with the aim of discovering new ideas and concepts." (Babangida Zubairu, "Security Risks of Biomedical Data Processing in Cloud Computing Environment", 2018)

"Data science is a collection of techniques used to extract value from data. It has become an essential tool for any organization that collects, stores, and processes data as part of its operations. Data science techniques rely on finding useful patterns, connections, and relationships within data. Being a buzzword, there is a wide variety of definitions and criteria for what constitutes data science. Data science is also commonly referred to as knowledge discovery, machine learning, predictive analytics, and data mining. However, each term has a slightly different connotation depending on the context." (Vijay Kotu & Bala Deshpande, "Data Science" 2nd Ed., 2018)

"A field that builds on and synthesizes a number of relevant disciplines and bodies of knowledge, including statistics, informatics, computing, communication, management, and sociology to translate data into information, knowledge, insight, and intelligence for improving innovation, productivity, and decision making." (Zhaohao Sun, "Intelligent Big Data Analytics: A Managerial Perspective", 2019)

"Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms, both structured and unstructured similar to data mining." (K Hariharanath, "BIG Data: An Enabler in Developing Business Models in Cloud Computing Environments", 2019)

"Is a broad field that refers to the collective processes, theories, concepts, tools and technologies that enable the review, analysis, and extraction of valuable knowledge and information from raw data. It is geared toward helping individuals and organizations make better decisions from stored, consumed and managed data." (Maryna Nehrey & Taras Hnot, "Data Science Tools Application for Business Processes Modelling in Aviation", 2019)

"It is a new discipline that combines elements of mathematics, statistics, computer science, and data visualization. The objective is to extract information from data sources. In this sense, data science is devoted to database exploration and analysis. This discipline has recently received much attention due to the growing interest in big data." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"the study and application of techniques for deriving insights from data, including constructing algorithms for prediction. Traditional statistical science forms part of data science, which also includes a strong element of coding and data management." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"A relatively new term applied to an interdisciplinary field of study focused on methods for collecting, maintaining, processing, analyzing and presenting results from large datasets." (Osman Kandara & Eugene Kennedy, "Educational Data Mining: A Guide for Educational Researchers", 2020)

"Data Science is the branch of science that uses technologies to predict the upcoming nature of different things such as a market or weather conditions. It shows a wide usage in today’s world." (Kirti R Bhatele, "Data Analysis on Global Stratification", 2020)

"Data science is a methodical form of integrating statistics, algorithms, scientific methods, models and visualization methods for interpretation of outcomes in organizational problem solving and fact based decision making." (Tanushri Banerjee & Arindam Banerjee, "Designing a Business Analytics Culture in Organizations in India", 2021)

"Data science is a multi-disciplinary field that follows scientific approaches, methods, and processes to extract knowledge and insights from structured, semi-structured and unstructured data." (Ahmad M Kabil, Integrating Big Data Technology Into Organizational Decision Support Systems, 2021)

Data Science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights." (R Suganya et al, "A Literature Review on Thyroid Hormonal Problems in Women Using Data Science and Analytics: Healthcare Applications", 2021)

"Data Science is the science and art of using computational methods to identify and discover influential patterns in data." (M Govindarajan, "Introduction to Data Science", 2021)

"Data science is the study of data. It involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data - both structured and unstructured." (Pankaj Pathak, "A Survey on Tools for Data Analytics and Data Science", 2021)

"It is a science of multiple disciplines used for exploring knowledge from data using complex scientific algorithms and methods." (Vandana Kalra et al, "Machine Learning and Its Application in Monitoring Diabetes Mellitus", 2021)

"The concept that utilizes scientific and software methods, IT infrastructure, processes, and software systems in order to gather, process, analyze and deliver useful information, knowledge and insights from various data sources." (Nenad Stefanovic, "Big Data Analytics in Supply Chain Management", 2021)

"This is an evolving field that deals with the gathering, preparation, exploration, visualization, organisation, and storage of large groups of data and the extraction of valuable information from large volumes of data that may exist in an unorganised or unstructured form." (James O Odia & Osaheni T Akpata, "Role of Data Science and Data Analytics in Forensic Accounting and Fraud Detection", 2021)

"A field of study involving the processes and systems used to extract insights from data in all of its forms. The profession is seen as a continuation of the other data analysis fields, such as statistics." (Solutions Review)

"The discipline of using data and advanced statistics to make predictions. Data science is also focused on creating understanding among messy and disparate data. The “what” a scientist is tackling will differ greatly by employer." (KDnuggets)

"Unites statistical systems and processes with computer and information science to mine insights with structured and/or unstructured data analytics." (Accenture)

"Data science is a multidisciplinary approach to finding, extracting, and surfacing patterns in data through a fusion of analytical methods, domain expertise, and technology. This approach generally includes the fields of data mining, forecasting, machine learning, predictive analytics, statistics, and text analytics." (Tibco) [source]

"Data science is an interdisciplinary field that combines social sciences, advanced statistics, and computer engineering skills to acquire, store, organize, and analyze information across a variety of sources." (TDWI)

"Data science is the multidisciplinary field that focuses on finding actionable information in large, raw or structured data sets to identify patterns and uncover other insights. The field primarily seeks to discover answers for areas that are unknown and unexpected." (Sisense) [source]

"Data science is the practical application of advanced analytics, statistics, machine learning, and the associated activities involved in those areas in a business context, like data preparation for example." (RapidMiner) [source]

"Data Science unites statistical systems and processes with computer and information science to mine insights with structured and/or unstructured data analytics." (Accenture)

29 December 2017

🗃️Data Management: Numeracy (Just the Quotes)

"The great body of physical science, a great deal of the essential fact of financial science, and endless social and political problems are only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex world-wide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write." (Herbert G Wells, "Mankind in the Making", 1903)

"[…] statistical literacy. That is, the ability to read diagrams and maps; a 'consumer' understanding of common statistical terms, as average, percent, dispersion, correlation, and index number." (Douglas Scates, "Statistics: The Mathematics for Social Problems", 1943)

"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." (Samuel S Wilks, 1951 [paraphrasing Herber Wells] )

"Just as by ‘literacy’, in this context, we mean much more than its dictionary sense of the ability to read and write, so by ‘numeracy’ we mean more than mere ability to manipulate the rule of three. When we say that a scientist is ‘illiterate’, we mean that he is not well enough read to be able to communicate effectively with those who have had a literary education. When we say that a historian or a linguist is ‘innumerate’ we mean that he cannot even begin to understand what scientists and mathematicians are talking about." (Sir Geoffrey Crowther, "A Report of the Central Advisory Committee for Education", 1959)

"It is perhaps possible to distinguish two different aspects of numeracy […]. On the one hand is an understanding of the scientific approach to the study of phenomena - observation, hypothesis, experiment, verification. On the other hand, there is the need in the modern world to think quantitatively, to realise how far our problems are problems of degree even when they appear as problems of kind." (Sir Geoffrey Crowther, "A Report of the Central Advisory Committee for Education", 1959)

"Numeracy has two facets - reading and writing, or extracting numerical information and presenting it. The skills of data presentation may at first seem ad hoc and judgemental, a matter of style rather than of technology, but certain aspects can be formalized into explicit rules, the equivalent of elementary syntax." (Andrew Ehrenberg, "Rudiments of Numeracy", Journal of Royal Statistical Society, 1977)

"People often feel inept when faced with numerical data. Many of us think that we lack numeracy, the ability to cope with numbers. […] The fault is not in ourselves, but in our data. Most data are badly presented and so the cure lies with the producers of the data. To draw an analogy with literacy, we do not need to learn to read better, but writers need to be taught to write better." (Andrew Ehrenberg, "The problem of numeracy", American Statistician 35(2), 1981)

"We would wish ‘numerate’ to imply the possession of two attributes. The first of these is an ‘at-homeness’ with numbers and an ability to make use of mathematical skills which enable an individual to cope with the practical mathematical demands of his everyday life. The second is ability to have some appreciation and understanding of information which is presented in mathematical terms, for instance in graphs, charts or tables or by reference to percentage increase or decrease." (Cockcroft Committee, "Mathematics Counts: A Report into the Teaching of Mathematics in Schools", 1982)

"To function in today's society, mathematical literacy - what the British call ‘numeracy' - is as essential as verbal literacy […] Numeracy requires more than just familiarity with numbers. To cope confidently with the demands of today's society, one must be able to grasp the implications of many mathematical concepts - for example, change, logic, and graphs - that permeate daily news and routine decisions - mathematical, scientific, and cultural - provide a common fabric of communication indispensable for modern civilized society. Mathematical literacy is especially crucial because mathematics is the language of science and technology." (National Research Council, "Everybody counts: A report to the nation on the future of mathematics education", 1989)

"Illiteracy and innumeracy are social ills created in part by increased demand for words and numbers. As printing brought words to the masses and made literacy a prerequisite for productive life, so now computing has made numeracy an essential feature of today's society. But it is innumeracy, not numeracy, that dominates the headlines: ignorance of basic quantitative tools is endemic […] and is approaching epidemic levels […]." (Lynn A Steen, "Numeracy", Daedalus Vol. 119 No. 2, 1990)

"[…] data analysis in the context of basic mathematical concepts and skills. The ability to use and interpret simple graphical and numerical descriptions of data is the foundation of numeracy […] Meaningful data aid in replacing an emphasis on calculation by the exercise of judgement and a stress on interpreting and communicating results." (David S Moore, "Statistics for All: Why, What and How?", 1990)

"To be numerate is more than being able to manipulate numbers, or even being able to ‘succeed’ in school or university mathematics. Numeracy is a critical awareness which builds bridges between mathematics and the real world, with all its diversity. […] in this sense […] there is no particular ‘level’ of Mathematics associated with it: it is as important for an engineer to be numerate as it is for a primary school child, a parent, a car driver or gardener. The different contexts will require different Mathematics to be activated and engaged in […] "(Betty Johnston, "Critical Numeracy", 1994)

"We believe that numeracy is about making meaning in mathematics and being critical about maths. This view of numeracy is very different from numeracy just being about numbers, and it is a big step from numeracy or everyday maths that meant doing some functional maths. It is about using mathematics in all its guises - space and shape, measurement, data and statistics, algebra, and of course, number - to make sense of the real world, and using maths critically and being critical of maths itself. It acknowledges that numeracy is a social activity. That is why we can say that numeracy is not less than maths but more. It is why we don’t need to call it critical numeracy being numerate is being critical." (Dave Tout & Beth Marr, "Changing practice: Adult numeracy professional development", 1997)

"To be numerate means to be competent, confident, and comfortable with one’s judgements on whether to use mathematics in a particular situation and if so, what mathematics to use, how to do it, what degree of accuracy is appropriate, and what the answer means in relation to the context." (Diana Coben, "Numeracy, mathematics and adult learning", 2000)

"Numeracy is the ability to process, interpret and communicate numerical, quantitative, spatial, statistical, even mathematical information, in ways that are appropriate for a variety of contexts, and that will enable a typical member of the culture or subculture to participate effectively in activities that they value." (Jeff Evans, "Adults´ Mathematical Thinking and Emotion", 2000)

"Ignorance of relevant risks and miscommunication of those risks are two aspects of innumeracy. A third aspect of innumeracy concerns the problem of drawing incorrect inferences from statistics. This third type of innumeracy occurs when inferences go wrong because they are clouded by certain risk representations. Such clouded thinking becomes possible only once the risks have been communicated." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"In my view, the problem of innumeracy is not essentially 'inside' our minds as some have argued, allegedly because the innate architecture of our minds has not evolved to deal with uncertainties. Instead, I suggest that innumeracy can be traced to external representations of uncertainties that do not match our mind’s design - just as the breakdown of color constancy can be traced to artificial illumination. This argument applies to the two kinds of innumeracy that involve numbers: miscommunication of risks and clouded thinking. The treatment for these ills is to restore the external representation of uncertainties to a form that the human mind is adapted to." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Overcoming innumeracy is like completing a three-step program to statistical literacy. The first step is to defeat the illusion of certainty. The second step is to learn about the actual risks of relevant events and actions. The third step is to communicate the risks in an understandable way and to draw inferences without falling prey to clouded thinking. The general point is this: Innumeracy does not simply reside in our minds but in the representations of risk that we choose." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Statistical innumeracy is the inability to think with numbers that represent uncertainties. Ignorance of risk, miscommunication of risk, and clouded thinking are forms of innumeracy. Like illiteracy, innumeracy is curable. Innumeracy is not simply a mental defect 'inside' an unfortunate mind, but is in part produced by inadequate 'outside' representations of numbers. Innumeracy can be cured from the outside." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Mathematics is often thought to be difficult and dull. Many people avoid it as much as they can and as a result much of the population is mathematically illiterate. This is in part due to the relative lack of importance given to numeracy in our culture, and to the way that the subject has been presented to students." (Julian Havil , "Gamma: Exploring Euler's Constant", 2003)

"One can be highly functionally numerate without being a mathematician or a quantitative analyst. It is not the mathematical manipulation of numbers (or symbols representing numbers) that is central to the notion of numeracy. Rather, it is the ability to draw correct meaning from a logical argument couched in numbers. When such a logical argument relates to events in our uncertain real world, the element of uncertainty makes it, in fact, a statistical argument." (Eric R Sowey, "The Getting of Wisdom: Educating Statisticians to Enhance Their Clients' Numeracy", The American Statistician 57(2), 2003)

"Mathematics and numeracy are not congruent. Nor is numeracy an accidental or automatic by-product of mathematics education at any level. When the goal is numeracy some mathematics will be involved but mathematical skills alone do not constitute numeracy." (Theresa Maguire & John O'Donoghue, "Numeracy concept sophistication - an organizing framework, a useful thinking tool", 2003)

"Mathematical literacy is an individual’s capacity to identify and understand the role that mathematics plays in the world, to make well-founded judgements and to use and engage with mathematics in ways that meet the needs of that individual’s life as a constructive, concerned and reflective citizen." (OECD, "Assessing scientific, reading and mathematical literacy: a framework for PISA 2006", 2006)

"Statistical literacy is more than numeracy. It includes the ability to read and communicate the meaning of data. This quality makes people literate as opposed to just numerate. Wherever words (and pictures) are added to numbers and data in your communication, people need to be able to understand them correctly." (United Nations, "Making Data Meaningful" Part 4: "A guide to improving statistical literacy", 2012)

"When a culture is founded on the principle of immediacy of experience, there is no need for numeracy. It is impossible to consume more than one thing at a time, so differentiating between 'a small amount', 'a larger amount' and 'many' is enough for survival." (The Open University, "Understanding the environment: learning and communication", 2016)

27 December 2017

🗃️Data Management: Data Quality (Just the Quotes)

"[...] it is a function of statistical method to emphasize that precise conclusions cannot be drawn from inadequate data." (Egon S Pearson & H Q Hartley, "Biometrika Tables for Statisticians" Vol. 1, 1914)

"Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data." (Roy D G Allen, "Statistics for Economists", 1951)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"There is no substitute for honest, thorough, scientific effort to get correct data (no matter how much it clashes with preconceived ideas). There is no substitute for actually reaching a correct chain of reasoning. Poor data and good reasoning give poor results. Good data and poor reasoning give poor results. Poor data and poor reasoning give rotten results." (Edmund C Berkeley, "Computers and Automation", 1969)

"Detailed study of the quality of data sources is an essential part of applied work. [...] Data analysts need to understand more about the measurement processes through which their data come. To know the name by which a column of figures is headed is far from being enough." (John W Tukey, "An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects", 1982)

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"Data obtained without any external disturbance or corruption are called clean; noisy data mean that a small random ingredient is added to the clean data." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Probability theory is a serious instrument for forecasting, but the devil, as they say, is in the details - in the quality of information that forms the basis of probability estimates." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"If the data is usually bad, and you find that you have to gather some data, what can you do to do a better job? First, recognize what I have repeatedly said to you, the human animal was not designed to be reliable; it cannot count accurately, it can do little or nothing repetitive with great accuracy. [...] Second, you cannot gather a really large amount of data accurately. It is a known fact which is constantly ignored. It is always a matter of limited resources and limited time. [...] Third, much social data is obtained via questionnaires. But it a well documented fact the way the questions are phrased, the way they are ordered in sequence, the people who ask them or come along and wait for them to be filled out, all have serious effects on the answers." (Richard Hamming, "The Art of Doing Science and Engineering: Learning to Learn", 1997)

"Blissful data consist of information that is accurate, meaningful, useful, and easily accessible to many people in an organization. These data are used by the organization’s employees to analyze information and support their decision-making processes to strategic action. It is easy to see that organizations that have reached their goal of maximum productivity with blissful data can triumph over their competition. Thus, blissful data provide a competitive advantage." (Margaret Y Chu, "Blissful Data", 2004)

"Let’s define dirty data as: ‘… data that are incomplete, invalid, or inaccurate’. In other words, dirty data are simply data that are wrong. […] Incomplete or inaccurate data can result in bad decisions being made. Thus, dirty data are the opposite of blissful data. Problems caused by dirty data are significant; be wary of their pitfalls." (Margaret Y Chu, "Blissful Data", 2004)

"Processes must be implemented to prevent bad data from entering the system as well as propagating to other systems. That is, dirty data must be intercepted at its source. The operational systems are often the source of informational data; thus dirty data must be fixed at the operational data level. Implementing the right processes to cleanse data is, however, not easy." (Margaret Y Chu, "Blissful Data", 2004)

"Equally critical is to include data quality definition and acceptable quality benchmarks into the conversion specifications. No product design skips quality specifications. including quality metrics and benchmarks. Yet rare data conversion follows suit. As a result, nobody knows how successful the conversion project was until data errors get exposed in the subsequent months and years. The solution is to perform comprehensive data quality assessment of the target data upon conversion and compare the results with pre-defined benchmarks." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Much data in databases has a long history. It might have come from old 'legacy' systems or have been changed several times in the past. The usage of data fields and value codes changes over time. The same value in the same field will mean totally different thing in different records. Knowledge or these facts allows experts to use the data properly. Without this knowledge, the data may bc used literally and with sad consequences. The same is about data quality. Data users in the trenches usually know good data from bad and can still use it efficiently. They know where to look and what to check. Without these experts, incorrect data quality assumptions are often made and poor data quality becomes exposed." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"The big part of the challenge is that data quality does not improve by itself or as a result of general IT advancements. Over the years, the onus of data quality improvement was placed on modern database technologies and better information systems. [...] In reality, most IT processes affect data quality negatively, Thus, if we do nothing, data quality will continuously deteriorate to the point where the data will become a huge liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"While we might attempt to identify and correct most data errors, as well as try to prevent others from entering the database, the data quality will never be perfect. Perfection is practically unattainable in data quality as with the quality of most other products. In truth, it is also unnecessary since at some point improving data quality becomes more expensive than leaving it alone. The more efficient our data quality program, the higher level of quality we will achieve- but never will it reach 100%. However, accepting imperfection is not the same as ignoring it. Knowledge of the data limitations and imperfections can help use the data wisely and thus save time and money, The challenge, of course, is making this knowledge organized and easily accessible to the target users. The solution is a comprehensive integrated data quality meta data warehouse." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Achieving a high level of data quality is hard and is affected significantly by organizational and ownership issues. In the short term, bandaging problems rather than addressing the root causes is often the path of least resistance." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Communicate loudly and widely where there are data quality problems and the associated risks with deploying BI tools on top of bad data. Also advise the different stakeholders on what can be done to address data quality problems - systematically and organizationally. Complaining without providing recommendations fixes nothing." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Data quality is such an important issue, and yet one that is not well understood or that excites business users. It’s often perceived as being a problem for IT to handle when it’s not: it’s for the business to own and correct." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Depending on the extent of the data quality issues, be careful about where you deploy BI. Without a reasonable degree of confidence in the data quality, BI should be kept in the hands of knowledge workers and not extended to frontline workers and certainly not to customers and suppliers. Deploy BI in this limited fashion as data quality issues are gradually exposed, understood, and ultimately, addressed. Don’t wait for every last data quality issue to be resolved; if you do, you will never deliver any BI capabilities, business users will never see the problem, and quality will never improve." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Our culture, obsessed with numbers, has given us the idea that what we can measure is more important than what we can't measure. Think about that for a minute. It means that we make quantity more important than quality." (Donella Meadows, "Thinking in Systems: A Primer", 2008)

"The data architecture is the most important technical aspect of your business intelligence initiative. Fail to build an information architecture that is flexible, with consistent, timely, quality data, and your BI initiative will fail. Business users will not trust the information, no matter how powerful and pretty the BI tools. However, sometimes it takes displaying that messy data to get business users to understand the importance of data quality and to take ownership of a problem that extends beyond business intelligence, to the source systems and to the organizational structures that govern a company’s data." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Many new data scientists tend to rush past it to get their data into a minimally acceptable state, only to discover that the data has major quality issues after they apply their (potentially computationally intensive) algorithm and get a nonsense answer as output. (Sandy Ryza, "Advanced Analytics with Spark: Patterns for Learning from Data at Scale", 2009)

"Access to more information isn’t enough - the information needs to be correct, timely, and presented in a manner that enables the reader to learn from it. The current network is full of inaccurate, misleading, and biased information that often crowds out the valid information. People have not learned that 'popular' or 'available' information is not necessarily valid." (Gene Spafford, 2010)

"Are data quality and data governance the same thing? They share the same goal, essentially striving for the same outcome of optimizing data and information results for business purposes. Data governance plays a very important role in achieving high data quality. It deals primarily with orchestrating the efforts of people, processes, objectives, technologies, and lines of business in order to optimize outcomes around enterprise data assets. This includes, among other things, the broader cross-functional oversight of standards, architecture, business processes, business integration, and risk and compliance. Data governance is an organizational structure that oversees the compliance and standards of enterprise data." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Data governance is about putting people in charge of fixing and preventing data issues and using technology to help aid the process. Any time data is synchronized, merged, and exchanged, there have to be ground rules guiding this. Data governance serves as the method to organize the people, processes, and technologies for data-driven programs like data quality; they are a necessary part of any data quality effort." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise. Data quality is an elusive subject that can defy measurement and yet be critical enough to derail a single IT project, strategic initiative, or even an entire company." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"When we find data quality issues due to valid data during data exploration, we should note these issues in a data quality plan for potential handling later in the project. The most common issues in this regard are missing values and outliers, which are both examples of noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Metadata is the key to effective data governance. Metadata in this context is the data that defines the structure and attributes of data. This could mean data types, data privacy attributes, scale, and precision. In general, quality of data is directly proportional to the amount and depth of metadata provided. Without metadata, consumers will have to depend on other sources and mechanisms." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"The quality of data that flows within a data pipeline is as important as the functionality of the pipeline. If the data that flows within the pipeline is not a valid representation of the source data set(s), the pipeline doesn’t serve any real purpose. It’s very important to incorporate data quality checks within different phases of the pipeline. These checks should verify the correctness of data at every phase of the pipeline. There should be clear isolation between checks at different parts of the pipeline. The checks include checks like row count, structure, and data type validation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"First, from an ethos perspective, the success of your data story will be shaped by your own credibility and the trustworthiness of your data. Second, because your data story is based on facts and figures, the logos appeal will be integral to your message. Third, as you weave the data into a convincing narrative, the pathos or emotional appeal makes your message more engaging. Fourth, having a visualized insight at the core of your message adds the telos appeal, as it sharpens the focus and purpose of your communication. Fifth, when you share a relevant data story with the right audience at the right time (kairos), your message can be a powerful catalyst for change." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"The one unique characteristic that separates a data story from other types of stories is its fundamental basis in data. [...] The building blocks of every data story are quantitative or qualitative data, which are frequently the results of an analysis or insightful observation. Because each data story is formed from a collection of facts, each one represents a work of nonfiction. While some creativity may be used in how the story is structured and delivered, a true data story won’t stray too far from its factual underpinnings. In addition, the quality and trustworthiness of the data will determine how credible and powerful the data story is." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Data is dirty. Let's just get that out there. How is it dirty? In all sorts of ways. Misspelled text values, date format problems, mismatching units, missing values, null values, incompatible geospatial coordinate formats, the list goes on and on." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"Bad data makes bad models. Bad models instruct people to make ineffective or harmful interventions. Those bad interventions produce more bad data, which is fed into more bad models." (Cory Doctorow, "Machine Learning’s Crumbling Foundations", 2021)

"[...] data mesh introduces a fundamental shift that the owners of the data products must communicate and guarantee an acceptable level of quality and trustworthiness - specific to their domain - as an intrinsic characteristic of their data product. This means cleansing and running automated data integrity tests at the point of the creation of a data product." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Ensure you build into your data literacy strategy learning on data quality. If the individuals who are using and working with data do not understand the purpose and need for data quality, we are not sitting in a strong position for great and powerful insight. What good will the insight be, if the data has no quality within the model?" (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"[...] the governance function is accountable to define what constitutes data quality and how each data product communicates that in a standard way. It’s no longer accountable for the quality of each data product. The platform team is accountable to build capabilities to validate the quality of the data and communicate its quality metrics, and each domain (data product owner) is accountable to adhere to the quality standards and provide quality data products." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Bad data is costly to fix, and it’s more costly the more widespread it is. Everyone who has accessed, used, copied, or processed the data may be affected and may require mitigating action on their part. The complexity is further increased by the fact that not every consumer will “fix” it in the same way. This can lead to divergent results that are divergent with others and can be a nightmare to detect, track down, and rectify." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Data has historically been treated as a second-class citizen, as a form of exhaust or by-product emitted by business applications. This application-first thinking remains the major source of problems in today’s computing environments, leading to ad hoc data pipelines, cobbled together data access mechanisms, and inconsistent sources of similar-yet-different truths. Data mesh addresses these shortcomings head-on, by fundamentally altering the relationships we have with our data. Instead of a secondary by-product, data, and the access to it, is promoted to a first-class citizen on par with any other business service." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"In truth, no one knows how much bad data quality costs a company – even companies with mature data quality initiatives in place, who are measuring hundreds of data points for their quality struggle to accurately measure quantitative impact. This is often a deal-breaker for senior leaders when trying to get approval for a budget for data quality work. Data quality initiatives often seek substantial budgets and are up against projects with more tangible benefits." (Robert Hawker, "Practical Data Quality", 2023)

"The biggest mistake that can be made in a data quality initiative is focusing on the wrong data. If you fix data that does not impact a critical business process or drive important decisions, your initiative simply will not make the difference that you want it to." (Robert Hawker, "Practical Data Quality", 2023)

"The data should be monitored in the source, it should be corrected in the source, and it should then feed the secondary source(s) with high-quality data that can be used without workarounds. The reduction in workarounds will make the data engineers, scientists, and data visualization specialists much more productive." (Robert Hawker, "Practical Data Quality", 2023)

"The problem of bad data has existed for a very long time. Data copies diverge as their original source changes. Copies get stale. Errors detected in one data set are not fixed in duplicate ones. Domain knowledge related to interpreting and understanding data remains incomplete, as does support from the owners of the original data." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Errors using inadequate data are much less than those using no data at all." (Charles Babbage)