SQL Troubles

01 February 2018

🔬Data Science: Exploratory Data Analysis (Definitions)

"Exploratory data analysis (EDA) is a collection of techniques that reveal (or search for) structure in a data set before calculating any probabilistic model. Its purpose is to obtain information about the data distribution (univariate or multivariate), about the presence of outliers and clusters, to disclose relationships and correlations between objects and/or variables." (Ildiko E Frank & Roberto Todeschini, "The Data Analysis Handbook", 1994)

"Processes and methods for exploring patterns and trends in the data that are not known prior to the analysis. It makes heavy use of graphs, tables, and statistics." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"The process of analyzing data to suggest hypotheses using statistical tools, which can then be tested." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"In statistics, exploratory data analysis is an approach to analyzing datasets to summarize their main characteristics, often with visual methods." (Keith Holdaway, "Harness Oil and Gas Big Data with Analytics", 2014)

"Process in which data patterns guide the analysis or suggest revisions to the preliminary data analysis plan." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Exploratory Data Analysis is about taking a dataset and extracting the most important information from it, in such a way that it is possible to get an idea of what the data looks like." (Richard M Reese et al, Java: Data Science Made Easy, 2017)

🔬Data Science: MapReduce (Definitions)

"A data processing and aggregation paradigm consisting of a 'map' phase that selects data and a 'reduce' phase that transforms the data. In MongoDB, you can run arbitrary aggregations over data using map-reduce." (MongoDb, "Glossary", 2008)

"A divide-and-conquer strategy for processing large data sets in parallel. In the 'map' phase, the data sets are subdivided. The desired computation is performed on each subset. The 'reduce' phase combines the results of the subset calculations into a final result. MapReduce frameworks handle the details of managing the operations and the nodes they run on, including restarting operations that fail for some reason. The user of the framework only has to write the algorithms for mapping and reducing the data sets and computing with the subsets." (Dean Wampler & Alex Payne, "Programming Scala", 2009)

"A method by which computationally intensive problems can be processed on multiple computers in parallel. The method can be divided into a mapping step and a reducing step. In the mapping step, a master computer divides a problem into smaller problems that are distributed to other computers. In the reducing step, the master computer collects the output from the other computers. Although MapReduce is intended for Big Data resources, holding petabytes of data, most Big Data problems do not require MapReduce." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"An early Big Data (before this term became popular) programming solution originally developed by Google for parallel processing using very large data sets distributed across a number of computing and storage systems. A Hadoop implementation of MapReduce is now available." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode. The 'map' component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called 'reduce' aggregates all the elements back together to provide a result." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A programming model consisting of two logical steps - Map and Reduce - for processing massively parallelizable problems across extremely large datasets using a large cluster of commodity computers." (Haoliang Wang et al, "Accessing Big Data in the Cloud Using Mobile Devices", Handbook of Research on Cloud Infrastructures for Big Data Analytics, 2014)

"Algorithm that is used to split massive data sets among many commodity hardware pieces in an effort to reduce computing time." (Billie Anderson & J Michael Hardin, "Harnessing the Power of Big Data Analytics", Encyclopedia of Business Analytics and Optimization, 2014)

"MapReduce is a parallel programming model proposed by Google and is used to distribute computing on clusters of computers for processing large data sets." (Jyotsna T Wassan, "Emergence of NoSQL Platforms for Big Data Needs", Encyclopedia of Business Analytics and Optimization, 2014)

"A concept which is an abstraction of the primitives ‘map’ and ‘reduce’. Most of the computations are carried by applying a ‘map’ operation to each global record in order to generate key/value pairs and then apply the reduce operation in order to combine the derived data appropriately." (P S Shivalkar & B K Tripathy, "Rough Set Based Green Cloud Computing in Emerging Markets", Encyclopedia of Information Science and Technology 3rd Ed., 2015)

"A programming model that uses a divide and conquer method to speed-up processing large datasets, with a special focus on semi-structured data." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", Encyclopedia of Information Science and Technology 3rd Ed., 2015)

"MapReduce is a programming model for general-purpose parallelization of data-intensive processing. MapReduce divides the processing into two phases: a mapping phase, in which data is broken up into chunks that can be processed by separate threads - potentially running on separate machines; and a reduce phase, which combines the output from the mappers into the final result." (Guy Harrison, "Next Generation Databases: NoSQL, NewSQL, and Big Data", 2015)

"MapReduce is a technological framework for processing parallelize-able problems across huge data sets using a large number of computers (nodes). […] MapReduce consists of two major steps: 'Map' and 'Reduce'. They are similar to the original Fork and Join operations in distributed systems, but they can consider a large number of computers that can be constructed based on the Internet cloud. In the Map-step, the master computer (a node) first divides the input into smaller sub-problems and then distributes them to worker computers (worker nodes). A worker node may also be a sub-master node to distribute the sub-problem into even smaller problems that will form a multi-level structure of a task tree. The worker node can solve the sub-problem and report the results back to its upper level master node. In the Reduce-step, the master node will collect the results from the worker nodes and then combine the answers in an output (solution) of the original problem." (Li M Chen et al, "Mathematical Problems in Data Science: Theoretical and Practical Methods", 2015)

"A programming model which process massive amounts of unstructured data in parallel and distributed cluster of processors." (Fatma Mohamed et al, "Data Streams Processing Techniques Data Streams Processing Techniques", Handbook of Research on Machine Learning Innovations and Trends, 2017)

"A data processing framework of Hadoop which provides data intensive computation of large data sets by dividing tasks across several machines and finally combining the result." (Rupali Ahuja, "Hadoop Framework for Handling Big Data Needs", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"A high-level programming model, which uses the “map” and “reduce” functions, for processing high volumes of data." (Carson K.-S. Leung, "Big Data Analysis and Mining", Encyclopedia of Information Science and Technology 4th Ed., 2018)

"Is a computational paradigm for processing massive datasets in parallel if the computation fits a three-step pattern: map, shard and reduce. The map process is a parallel one. Each process executes on a different part of data and produces (key, value) pairs. The shard process collects the generated pairs, sorts and partitions them. Each partition is assigned to a different reduce process which produces a single result." (Venkat Gudivada et al, "Database Systems for Big Data Storage and Retrieval", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"Is a programming model or algorithm for the processing of data using a parallel programming implementation and was originally used for academic purposes associated with parallel programming techniques. (Soraya Sedkaoui, "Understanding Data Analytics Is Good but Knowing How to Use It Is Better!", Big Data Analytics for Entrepreneurial Success, 2019)

"MapReduce is a style of programming based on functional programming that was the basis of Hadoop." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Is a specific programming model, which as such represents a new approach to solving the problem of processing large amounts of differently structured data. It consists of two functions - Map (sorting and filtering data) and Reduce (summarizing intermediate results), and it is executed in parallel and distributed." (Savo Stupar et al, "Importance of Applying Big Data Concept in Marketing Decision Making", Handbook of Research on Applied AI for International Business and Marketing Applications, 2021)

"A software framework for processing vast amounts of data." (Analytics Insight)

29 January 2018

🔬Data Science: Data Products (Definitions)

"Broadly defined, data means events that are captured and made available for analysis. A data source is a consistent record of these events. And a data product translates this record of events into something that can easily be understood." (Richard Galentino; et al, "Data Fluency: Empowering Your Organization with Effective Data Communication", 2014)

"Self-adapting, broadly applicable economic engines that derive their value from data and generate more data by influencing human behavior or by making inferences or predictions upon new data." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"Data products are software applications that derive value from data and in turn generate new data." (Rebecca Bilbro et al, "Applied Text Analysis with Python", 2018)

"[...] a product that facilitates an end goal through the use of data." (Ulrika Jägare, "Data Science Strategy For Dummies", 2019)

"Any computer software that uses data as inputs, produces outputs, and provides feedback based on the output to control the environment is referred to as a data product. A data product is generally based on a model developed during data analysis, for example, a recommendation model that inputs user purchase history and recommends a related item that the user is highly likely to buy." (Suresh K Mukhiya; Usman Ahmed, Hands-On Exploratory Data Analysis with Python, 2020)

"A data product is a product or service whose value is derived from using algorithmic methods on data, and which in turn produces data to be used in the same product, or tangential data products." (Statistics.com)

"A data product, in general terms, is any tool or application that processes data and generates results. […] Data products have one primary objective: to manage, organize and make sense of the vast amount of data that organizations collect and generate. It’s the users’ job to put the insights to use that they gain from these data products, take actions and make better decisions based on these insights." (Sisense) [source]

" A data product is digital information that can be purchased." (Techtarget) [source]

"A strategy for monetizing an organization’s data by offering it as a product to other parties." (Izenda)

"An information product that is derived from observational data through any kind of computation or processing. This includes aggregation, analysis, modelling, or visualization processes." (Fixed-Point Open Ocean Observatories)

"Data set or data set series that conforms to a data product specification." (ISO 19131)

28 January 2018

🔬Data Science: Regularization (Definitions)

"It is a formal concept based on fuzzy topology that removes geometric anomalies on fuzzy regions." (Markus Schneider, "Fuzzy Spatial Data Types for Spatial Uncertainty Management in Databases", 2008)

"It is any method of preventing overfitting of data by a model and it is used for solving ill-conditioned parameter-estimation problems." (Cecilio Angulo & Luis Gonzalez-Abril, "Support Vector Machines", 2009)

"Optimization of both complexity and performance of a neural network following a linear aggregation or a multi-objective algorithm." (M P Cuéllar et al, "Multi-Objective Training of Neural Networks", 2009)

"Including a term in the error function such that the training process favours networks of moderate size and complexity, that is, networks with small weights and few hidden units. The goal is to avoid overfitting and support generalization." (Frank Padberg, "Counting the Hidden Defects in Software Documents", 2010)

"It refers to the procedure of bringing in additional knowledge to solve an ill-posed problem or to avoid overfitting. This information appears habitually as a penalty term for complexity, such as constraints for smoothness or bounds on the norm." (Vania V Estrela et al, "Total Variation Applications in Computer Vision", 2016)

"This is a general method to avoid overfitting by applying additional constraints to the model that is learned. A common approach is to make sure the model weights are, on average, small in magnitude." (Rayid Ghani & Malte Schierholz, "Machine Learning", 2017)

"Regularization is a method of penalizing complex models to reduce their variance. Specifically, a penalty term is added to the loss function we are trying to minimize [...]" (Chris Albon, "Machine Learning with Python Cookbook", 2018)

"Regularization, generally speaking, is a wide range of ML techniques aimed at reducing overfitting of the models while maintaining theoretical expressive power." (Jonas Teuwen & Nikita Moriakov, "Convolutional neural networks", 2020)

26 January 2018

🔬Data Science: Standard Deviation (Definitions)

"A commonly used measure that defines the variation in a data set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A measure of the variability in a set of data. It is calculated by taking the square root of the variance. Standard deviations are not additive; the variances are." (Clyde M Creveling, "Six Sigma for Technical Processes", 2006)

"The degree of dispersion of a group of scores around the average. If most scores are close to the average, the standard deviation is low. Conversely, if the scores are widely dispersed, the standard deviation is large." (Ruth C Clark, "Building Expertise: Cognitive Methods for Training and Performance Improvement", 2008)

"The measured range of economic volatility that can occur during the course of doing business." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)

"A measure of how distributed the values of a probability curve are, relative to the average." (Jon Radoff, "Game On: Energize Your Business with Social Media Games", 2011)

"The amount of dispersal among test scores or other outcome results. A larger standard deviation indicates greater spread among test scores, while a smaller standard deviation indicates greater consistency among scores." (Ruth C Clark & Richard E Mayer, "e-Learning and the Science of Instruction", 2011)

"Describes dispersion about the data set’s mean. You can think of a standard deviation as an average deviation from the mean. See also average; variance." (E C Nelson & Stephen L Nelson, "Excel Data Analysis For Dummies ", 2015)

"Square root of variance. The standard deviation is an index of variability in the distribution of scores." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"the square root of the variance of a sample or distribution. For well-behaved, reasonably symmetric data distributions without long tails, we would expect most of the observations to lie within two sample standard deviations from the sample mean." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

25 January 2018

🔬Data Science: Regression Analysis (Definitions)

"A set of statistical operations that helps to predict the value of the dependent variable from the values of one or more independent variables." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

"A statistical tool that measures the strength of relationship between one or more independent variables with a dependent variable. It builds upon the correlation concepts to develop an empirical, databased model. Correlation describes the X and Y relationship with a single number (the Pearson’s Correlation Coefficient (r)), whereas regression summarizes the relationship with a line - the regression line." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"A statistical procedure for estimating mathematically the average relationship between the dependent variable (e.g., sales) and one or more independent variables (e.g., price and advertising)." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"Regression analysis is a statistical technique for estimating the relationship between a set of predictors (independent variables) and an outcome variable (dependent variable). Linear least-squares regression, in which the relationship is expressed in a linear form, is the most common type of regression analysis. The mathematical model used in least-squares linear regression is often called the general linear model (GLM)." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"A statistical technique which seeks to find a line which best fits through a set of data as plotted on a graph, seeking to find the cleanest path which deviates the least from any instance within the set." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[regression] "Using one data set to predict the results of a second." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The statistical process of predicting one or more continuous variables, such as profit or loss, based on other attributes in the dataset." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A family of methods for fitting a line or curve to a dataset, used to simplify or make sense of a number of apparently random data points." (Meta S Brown, "Data Mining For Dummies", 2014)

"An analytic technique where a series of input variables are examined in relation to their corresponding output results in order to develop a mathematical or statistical relationship." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)

"A statistical technique for estimating relationships between variables." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Process to statistically estimate the relationship between different attributes." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Plotting pairs of independent and dependent variables in an XY chart and then finding a linear or exponential equation that best describes the plotted data." (E C Nelson & Stephen L Nelson, "Excel Data Analysis For Dummies", 2015)

"A statistical procedure that produces an equation for predicting a variable (the criterion measure) from one or more other variables (the predictor measures)." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical technique used to estimate the mathematical relationship between a dependent variable, such as quantity demanded, and one or more explanatory variables, such as price and income." (Jeffrey M Perloff & James A Brander, "Managerial Economics and Strategy" 2nd Ed., 2016)

"A statistical process for estimating the relationships between variables, often used to forecast the change in a variable based on changes in other variables. Linear regression is used to analyze continuous variables, and logistic regression is used for discrete variables." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"In a machine learning context, regression is the task of assigning scalar value to examples." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Algorithms used to predict values for new data based on training data fed into the system. Areas where regression in machine learning is used to predict future values include drug response modeling, marketing, real estate and financial forecasting." (Accenture)

"To define the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable." (Analytics Insight)

24 January 2018

🔬Data Science: Data Processing (Definitions)

"The act of turning raw data into meaningful output, generally associated with computers." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"Any process that converts data into information. The processing is usually assumed to be automated and running on an information system." (Eleutherios A Papathanassiou & Xenia J Mamakou, "Privacy Issues in Public Web Sites", Handbook of Research on Public Information Technology, 2008)

"Obtaining, recording or holding the data, or carrying out any operation on the data, including organising, adapting or altering it. Retrieval, consultation or use of the data, disclosure of the data, and alignment, combination, blocking, erasure or destruction of the data are all legally classed as processing." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The operation performed on data through capture, transformation, and storage, in order to derive new information according to a given set of rules." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Collection and elaboration of sensing data with the aim to derivate/infer new knowledge from original raw data." (Paolo Bellavista et al, "Crowdsensing in Smart Cities: Technical Challenges, Open Issues, and Emerging Solution Guidelines", 2015)

"The act of data manipulation through integration of mathematical tools, statistics, and computer application to generate information." (Babangida Zubairu, "Security Risks of Biomedical Data Processing in Cloud Computing Environment", 2018)

"Any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction." (Yordanka Ivanova, "Data Controller, Processor, or Joint Controller: Towards Reaching GDPR Compliance in a Data- and Technology-Driven World", 2020)

"Data processing is any action performed to turn raw data into useful information." (Xplenty) [source]

"Data processing occurs when data is collected and translated into usable information. […] Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organization." (Talend) [source]

19 January 2018

🔬Data Science: Structured Data (Definitions)

"Data that has a strict metadata defined, such as a SQL Server table’s column." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Data that has enforced composition to specified datatypes and relationships and is managed by technology that allows for querying and reporting." (Keith Gordon, "Principles of Data Management", 2007)

"Database data, such as OLTP (Online Transaction Processing System) data, which can be sorted." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"A collection of records or data that is stored in a computer; records maintained in a database or application." (Robert F Smallwood, "Managing Electronic Records: Methods, Best Practices, and Technologies", 2013)

"Data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, a customer’s name, address, and so on)." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Data that fits cleanly into a predefined structure." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Data that is described by a data model, for example, business data in a relational database" (Hasso Plattner, "A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases" 2nd Ed., 2014)

"Data that is managed by a database management system" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"In statistics and data mining, any type of data whose values have clearly defined meaning, such as numbers and categories." (Meta S Brown, "Data Mining For Dummies", 2014)

"Data that adheres to a strict definition." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, for a customer’s name, address, and so on)." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Data that resides in a fixed field within a file or individual record, such as a row & column database." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Information that sits in a database, file, or spreadsheet. It is generally organized and formatted. In retail, this data can be point-of-sale data, inventory, product hierarchies, or others." (Brittany Bullard, "Style and Statistics", 2016)

"A data field of a definable data type, usually of a specified size or range, that can be easily processed by a computer." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"Data that can be stored in a table. Every instance in the table has the same set of attributes. Contrast with unstructured data." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Data that is identifiable as it is organized in structure like rows and columns. The data resides in fixed fields within a record or file or the data is tagged correctly and can be accurately identified." (Analytics Insight)

"Refers to information with a high degree of organization, meaning that it can be seamlessly included in a relational database and quickly searched by straightforward search engine algorithms and/or other search operations. Structured data examples include dates, numbers, and groups of words and number 'strings'. Machine-generated structured data is on the increase and includes sensor data and financial data." (Accenture)

15 January 2018

🔬Data Science: Semi-Structured Data (Definitions)

"Data that has flexible metadata, such as XML." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"'Text' documents, such as e-mail, word processing, presentations, and spreadsheets, whose content can be searched." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"Data that, although unstructured, still has some degree of structure. A good example is e-mail: Even though it is predominantly text, it has logical blocks with different purposes." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Data that have already been processed to some extent." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management" 11th Ed., 2014)

"A structured data type that does not have a formal definition, like a document. It has tags or other markers to enforce a hierarchy of records within a particular object, but may be different from another object." (Jason Williamson, Getting a Big Data Job For Dummies, 2015)

"Semi-structured data has some structures that are often manifested in images and data from sensors." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"a form a structured data that does not have a formal structure like structured data. It does however have tags or other markers to enforce hierarchy of records." (Analytics Insight)

🔬Data Science: Big Data (Definitions)

"Big Data: when the size and performance requirements for data management become significant design and decision factors for implementing a data management and analysis system. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration." (Jimmy Guterman, 2009)

"A buzzword for the challenges of and approaches to working with data sets that are too big to manage with traditional tools, such as relational databases. So called NoSQL databases, clustered data processing tools like MapReduce, and other tools are used to gather, store, and analyze such data sets." (Dean Wampler, "Functional Programming for Java Developers", 2011)

"Big data: techniques and technologies that make handling data at extreme scale economical." (Brian Hopkins, "Big Data, Brewer, And A Couple Of Webinars", 2011) [source]

"Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value." (McKinsey & Co., "Big Data: The Next Frontier for Innovation, Competition, and Productivity", 2011)

"Data volumes that are exceptionally large, normally greater than 100 Terabyte and more commonly refer to the Petabyte and Exabyte range. Big data has begun to be used when discussing Data Warehousing and analytic solutions where the volume of data poses specific challenges that are unique to very large volumes of data including: data loading, modeling, cleansing, and analytics, and are often solved using massively parallel processing, or parallel processing and distributed data solutions." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it." (Edd Wilder-James, "What is big data?", 2012) [source]

"A collection of data whose very size, rate of accumulation, or increased complexity makes it difficult to analyze and comprehend in a timely and accurate manner." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"A colloquial term referring to exceedingly large datasets that are otherwise unwieldy to deal with in a reasonable amount of time in the absence of specialized tools. They are different from normal data in terms of volume, velocity, and variety and typically require unique approaches for capture, processing, analysis, search, and visualization." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Big data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information." (Microsoft, 2013) [source]

"Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (Tim O’Reilly, [email correspondence, 2013)

"The capability to manage a huge volume of disparate data, at the right speed and within the right time frame, to allow real-time analysis and reaction. Big data is typically broken down by three characteristics, including volume (how much data), velocity (how fast that data is processed), and variety (the various types of data)." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A colloquial term referring to datasets that are otherwise unwieldy to deal with in a reasonable amount of time in the absence of specialized tools. Common characteristics include large amounts of data (volume), different types of data (variety), and ever-increasing speed of generation (velocity). They typically require unique approaches for capture, processing, analysis, search, and visualization." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"An extremely large database which generally defies standard methods of analysis." (Owen P. Hall Jr., "Teaching and Using Analytics in Management Education", 2014)

"Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze." (Xiuli He et al, Supply Chain Analytics: Challenges and Opportunities, 2014)

"More data than can be processed by today's database systems, or acutely high volume, velocity, and variety of information assets that demand IG to manage and leverage for decision-making insights and cost management." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The term that refers to data that has one or more of the following dimensions, known as the four Vs: Volume, Variety, Velocity, and Veracity." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A collection of models, techniques and algorithms that aim at representing, managing, querying and mining large-scale amounts of data (mainly semi-structured data) in distributed environments (e.g., Clouds)." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", 2015)

"A process to deliver decision-making insights. The process uses people and technology to quickly analyze large amounts of data of different types (traditional table structured data and unstructured data, such as pictures, video, email, and Tweets) from a variety of sources to produce a stream of actionable knowledge." (James R Kalyvas & Michael R Overly, "Big Data: A Businessand Legal Guide", 2015)

"A relative term referring to data that is difficult to process with conventional technology due to extreme values in one or more of three attributes: volume (how much data must be processed), variety (the complexity of the data to be processed) and velocity (the speed at which data is produced or at which it arrives for processing). As data management technologies improve, the threshold for what is considered big data rises. For example, a terabyte of slow-moving simple data was once considered big data, but today that is easily managed. In the future, a yottabyte data set may be manipulated on desktop, but for now it would be considered big data as it requires extraordinary measures to process." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Big data is a discipline that deals with processing, storing, and analyzing heterogeneous (structured/semistructured/unstructured) large data sets that cannot be handled by traditional information management technologies that have been used to process structured data. Gartner defined big data based on the three Vs: volume, velocity, and variety." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"Records that are so large (terabytes and exabytes) and diverse (from sensors to social media data) that they require new, powerful technologies for storage, management, analysis and visualization." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Term used to describe the exponential growth, variety, and availability of data, both structured and unstructured." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A broad term for large and complex data sets that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set." (Suren Behari, "Data Science and Big Data Analytics in Financial Services: A Case Study", 2016)

"A combination of facts and artifacts drawn from a myriad of sources and stored without regard to rational or normalized disciplines or structures." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A term that describes a large dataset that grows in size over time. It refers to the size of dataset that exceeds the capturing, storage, management, and analysis of traditional databases. The term refers to the dataset that has large, more varied, and complex structure, accompanies by difficulties of data storage, analysis, and visualization. Big Data are characterized with their high-volume, -velocity and –variety information assets." (Kenneth C C Yang & Yowei Kang, "Real-Time Bidding Advertising: Challenges and Opportunities for Advertising Curriculum, Research, and Practice", 2016)

"Big data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems)." (Davy Cielen et al, "Introducing Data Science", 2016)

"For digital resources, inexpensive storage and high bandwidth have largely eliminated capacity as a constraint for organizing systems, with an exception for big data, which is defined as a collection of data that is too big to be managed by typical database software and hardware architectures." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)

"Large sets of data that are leveraged to make better business decisions. Retail data can be sales, product inventory, e-mail offers, customer information, competitor pricing, product descriptions, social media, and much more." (Brittany Bullard, "Style and Statistics", 2016)

"A term used to describe large sets of structured and unstructured data. Data sets are continually increasing in size and may grow too large for traditional storage and retrieval. Data may be captured and analyzed as it is created and then stored in files." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"Datasets of structured and unstructured information that are so large and complex that they cannot be adequately processed and analyzed with traditional data tools and applications. |" (Jonathan Ferrar et al, "The Power of People", 2017)

"Big data are often defined in terms of the three Vs: the extreme volume of data, the variety of the data types, and the velocity at which the data must be processed." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Very large data volumes that are complex and varied, and often collected and must be analyzed in real time." (Daniel J. Power & Ciara Heavin, "Data-Based Decision Making and Digital Transformation", 2018)

"A generic term that designates the massive volume of data that is generated by the increasing use of digital tools and information systems. The term big data is used when the amount of data that an organization has to manage reaches a critical volume that requires new technological approaches in terms of storage, processing, and usage. Volume, velocity, and variety are usually the three criteria used to qualify a database as 'big data'." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." (Thomas Ochs & Ute A Riemann, "IT Strategy Follows Digitalization", 2019)

"The capability to manage a huge volume of disparate data, at the right speed and within the right time frame, to allow real time analysis and reaction." (K Hariharanath, "BIG Data: An Enabler in Developing Business Models in Cloud Computing Environments", 2019)

"A term used to refer to the massive datasets generated in the digital age. Both the volume and speed at which data are generated is far greater than in the past and requires powerful computing technologies." (Osman Kandara & Eugene Kennedy, "Educational Data Mining: A Guide for Educational Researchers", 2020)

"Refers to data sets that are so voluminous and complex that traditional data processing application software is inadequate to deal with them." (James O Odia & Osaheni T Akpata, "Role of Data Science and Data Analytics in Forensic Accounting and Fraud Detection", 2021)

"The evolving term that describes a large volume of structured, semi-structured and unstructured data that has the potential to be mined for information and used in machine learning projects and other advanced analytics applications." (Nenad Stefanovic, "Big Data Analytics in Supply Chain Management", 2021)

"The term 'big data' is related to gathering and storing extra-large volume of structured, semi-structured and unstructured data with high Velocity and Variability to be used in advanced analytics applications." (Ahmad M Kabil, Integrating Big Data Technology Into Organizational Decision Support Systems, 2021)

"A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." (Board International)

"A collection of data so large that it cannot be stored, transmitted or processed by traditional means." (Open Data Handbook)

"an accumulation of data that is too large and complex for processing by traditional database management tools" (Merriam-Webster)

"Extremely large data sets that may be analyzed to reveal patterns and trends and that are typically too complex to be dealt with using traditional processing techniques." (Solutions Review)

"is a term for very large and complex datasets that exceed the ability of traditional data processing applications to deal with them. Big data technologies include data virtualization, data integration tools, and search and knowledge discovery tools." (Accenture)

"The practices and technology that close the gap between the data available and the ability to turn that data into business insight." (Forrester)

"Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Big data has one or more of the following characteristics: high volume, high velocity or high variety." (IBM) [source]

"Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves." (SAS) [source]

"Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications." (Techtarget)

"Big data is a term used for large data sets that include structured, semi-structured, and unstructured data." (Xplenty) [source]

"Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." (Gartner)

"Big data is the catch-all term used to describe gathering, analyzing, and storing massive amounts of digital information to improve operations." (Talend) [source]

"Big data refers to the 21st-century phenomenon of exponential growth of business data, and the challenges that come with it, including holistic collection, storage, management, and analysis of all the data that a business owns or uses." (Informatica) [source]

14 January 2018

🔬Data Science: Unstructured Data (Definitions)

"Data that does not neatly fit into a tabular structure with well-defined and bounded definitions. Examples of unstructured data are e-mail messages and video streams. Many customer databases contain comment fields where customer service reps put in additional notes about customers." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"Computerised information which does not have a data structure that is easily readable by a machine, including audio, video and unstructured text such as the body of a word-processed document - effectively this is the same as multimedia data." (Keith Gordon, "Principles of Data Management", 2007)

"Data that has no metadata, such as text files." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Natively bitmapped data, such as video, audio, pictures, and MRI scans, that can be sensed either visually, audibly, or both." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"Data that does not fit into a structured data model or does not fit well into relational tables. Common examples include binary information such as video or audio and free-text information." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Data that does not follow a specified data format. Unstructured data can be text, video, images, and so on." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Unstructured data has no real structure, such as the data in an email and a memo. Interestingly, estimates have 85% of all business information as unstructured data. There are now many products coming on the market that can put some structure into unstructured data so that it can be categorized or organized hierarchically." (Michael M David & Lee Fesperman, "Advanced SQL Dynamic Data Modeling and Hierarchical Processing", 2013)

"Data that exist in their original (raw) state; that is in the format in which they were collected." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management Ed. 11", 2014)

"Data whose logical organization is not apparent to the computer" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Information (typically stored digitally) that either does not have a predefined data model or is not organized in a predefined manner. Most unstructured data is created by humans and includes email, documents, text messages, tweets, blogs, and more." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Text, audio, video, and other types of complex data that won’t easily fit into a conventional relational database. Unstructured data isn’t as simple as the numbers and short strings that most data analysts use." (Meta S Brown, "Data Mining For Dummies", 2014)

"Data that cannot fit cleanly into a predefined structure." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Data without data model or that a computer program cannot easily use (in the sense of understanding its content). Examples are word processing documents or electronic mail" (Hasso Plattner, "A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases" 2nd Ed., 2014)

"Data (generally text-based) which is not presented in a structured form such as a database, ontology, table, etc. Newspaper articles, government reports, blogs, and e-mails are all examples of unstructured data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Data that doesn’t fit into a fixed and strict definition. Things like sound files, images, text, and web pages can be considered unstructured data." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Information that does not follow a specified data format. Unstructured data can be text, video, images, and such." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Data that does not have a specific format. It can be customer reviews, tweets, pictures, or even hashtags." (Brittany Bullard, "Style and Statistics", 2016)

"A type of data where each instance in the data set may have its own internal structure; that is, the structure is not necessarily the same in every instance. For example, text data are often unstructured and require a sequence of operations to be applied to them in order to extract a structured representation for each instance." (John D Kelleher & Brendan Tierney, "Data science", 2018)

03 January 2018

🔬Data Science: Models (Definitions)

"A model is essentially a calculating engine designed to produce some output for a given input." (Richard C Lewontin, "Models, Mathematics and Metaphors", Synthese, Vol. 15, No. 2, 1963)

"A model is an abstract description of the real world. It is a simple representation of more complex forms, processes and functions of physical phenomena and ideas." (Moshe F Rubinstein & Iris R Firstenberg, "Patterns of Problem Solving", 1975)

"A model is an attempt to represent some segment of reality and explain, in a simplified manner, the way the segment operates." (E Frank Harrison, "The managerial decision-making process" , 1975)

"A model is a representation containing the essential structure of some object or event in the real world." (David W Stockburger, "Introductory Statistics", 1996)

"A model is a deliberately simplified representation of a much more complicated situation." (Robert M Solow, "How Did Economics Get That Way and What Way Did It Get?", Daedalus Vol. 126 (1), 1997)

"Models are synthetic sets of rules, pictures, and algorithms providing us with useful representations of the world of our perceptions and of their patterns." (Burton G Malkiel, "A Random Walk Down Wall Street", 1999)

"A model is an imitation of reality" (Ian T Cameron & Katalin M Hangos, "Process Modelling and Model Analysis", 2001)

"Models are replicas or representations of particular aspects and segments of the real world" (Paulraj Ponniah, "Data Modeling Fundamentals", 2007)

"A model is a simplification of reality." (Alexey Voinov, "Systems Science and Modeling for Ecological Economics", 2008)

"a model is a representation of reality intended for some definite purpose." (Michael Pidd, "Tools for Thinking" 3rd Ed., 2009)

"A model is a representation of some subject matter." (Alec Sharp & Patrick McDermott, "Workflow Modeling" 2nd Ed, 2009)

"An abstract representation of how something is built (or is to be built), or how something works (or is observed as working)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A model is a simplified representation of a system. It can be conceptual, verbal, diagrammatic, physical, or formal (mathematical)." (Hiroki Sayama, "Introduction to the Modeling and Analysis of Complex Systems", 2015)

"A formal set of relationships that can be manipulated to test assumptions. A simulation that tests the number of units that can be processed each hour under a set of conditions is an example of a model. Models do not need to be graphical." (Appian)

"Model is simply a representation or simulation of some real-world phenomenon." (Accenture)

02 January 2018

🔬Data Science: Data (Definitions)

"Facts and figures used in computer programs." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"A representation of facts, concepts, or instructions suitable to permit communication, interpretation, or processing by humans or by automatic means. (2) Used as a synonym for documentation in U.S. government procurement regulations." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"A recording of facts, concepts, or instructions on a storage medium for communication, retrieval, and processing by automatic means and presentation as information that is understandable by human beings." (William H Inmon, "Building the Data Warehouse", 2005)

"An atomic element of information. Represented as bits within mass storage devices, memory, and pprocessors." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"Information documented by a language system representing facts, text, graphics, bitmapped images, sound, and analog or digital live-video segments. Data is the raw material of a system supplied by data producers and is used by information consumers to create information." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A term applied to organized information." (Gavin Powell, "Beginning Database Design", 2006)

"Numeric information or facts collected through surveys or polls, measurements or observations that need to be effectively organized for decision making." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Raw, unrelated numbers or entries, e.g., in a database; raw forms of transactional representations." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"Data is a representation of facts, concepts or instructions in a formalized manner suitable for communication, interpretation or processing by humans or automatic means." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Numeric information or facts collected through surveys or polls, measurements or observations that need to be effectively organized for decision making." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"Hub A common approach for a technical implementation of a service-oriented MDM solution. Data Hubs store and manage some data attributes and the metadata containing the location of data attributes in external systems in order to create a single physical or federated trusted source of information about customers, products, and so on." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"Raw facts, that is, facts that have not yet been processed to reveal their meaning to the end user." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"Facts represented as text, numbers, graphics, images, sound, or video (with no additional defining context); the raw material used to create information." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures 2nd Ed", 2012)

"Data are abstract representations of selected characteristics of real-world objects, events, and concepts, expressed and understood through explicitly definable conventions related to their meaning, collection, and storage. We also use the term data to refer to pieces of information, electronically captured, stored (usually in databases), and capable of being shared and used for a range of organizational purposes."(Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Data are abstract representations of selected characteristics of real-world objects, events, and concepts, expressed and understood through explicitly definable conventions related to their meaning, collection, and storage. We also use the term data to refer to pieces of information, electronically captured, stored (usually in databases), and capable of being shared and used for a range of organizational purposes." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013)

"A collection of values assigned to base measures, derived measures and/or indicators." (David Sutton, "Information Risk Management: A practitioner’s guide", 2014)

"Raw facts, that is, facts that have not yet been processed to reveal their meaning to the end user." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management" 11th Ed., 2014)

"A formalized (meaning suitable for further processing, interpretation and communication) representation of business objects or transactions." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Data is a collection of one or more pieces if information." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)

"Facts about events, objects, and associations. Example: data about a sale would include date, amount, and method of payment." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"Discrete, unorganized, unprocessed measurements or raw observations." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide )", 2017)

"Any values from an application that can be transformed into facts and eventually information.." (Piethein Strengholt, "Data Management at Scale", 2020)

"A set of collected facts. There are two basic kinds of numerical data: measured or variable data … and counted or attribute data." (ASQ)
"A representation of information as stored or transmitted." (NISTIR 4734)

"A representation of information, including digital and non-digital formats." (NIST Privacy Framework Version 1.0)

"A variable-length string of zero or more (eight-bit) bytes." (NIST SP 800-56B Rev. 2)

"Any piece of information suitable for use in a computer." (NISTIR 7693)

"(1) Anything observed in the documentation or operation of software that deviates from expectations based on previously verified software products or reference documents.(2) A representation of facts, concepts, or instructions in a manner suitable for communication, interpretation, or processing by humans or by automatic means." (IEEE 610.5-1990)

"Data may be thought of as unprocessed atomic statements of fact. It very often refers to systematic collections of numerical information in tables of numbers such as spreadsheets or databases. When data is structured and presented so as to be useful and relevant for a particular purpose, it becomes information available for human apprehension. See also knowledge." (Open Data Handbook)

"Distinct pieces of digital information that have been formatted in a specific way." (NIST SP 800-86)

"Information in a specific representation, usually as a sequence of symbols that have meaning." (CNSSI 4009-2015 IETF RFC 4949 Ver 2)

"Pieces of information from which “understandable information” is derived." (NIST SP 800-88 Rev. 1)

“re-interpretable representation of information in a formalized manner suitable for communication, interpretation, or processing” (ISO 11179)

01 January 2018

🔬Data Science: Data Science (Definitions)

"A set of quantitative and qualitative methods that support and guide the extraction of information and knowledge from data to solve relevant problems and predict outcomes." (Xiuli He et al, "Supply Chain Analytics: Challenges and Opportunities", 2014)

"A collection of models, techniques and algorithms that focus on the issues of gathering, pre-processing, and making sense-out of large repositories of data, which are seen as 'data products'." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", 2015)

"Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains. […] Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data produced today. It adds methods from computer science to the repertoire of statistics." (Davy Cielen et al, "Introducing Data Science", 2016)

"The workflows and processes involved in the creation and development of data products." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"The discipline of analysis that helps relate data to the events and processes that produce and consume it for different reasons." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"The extraction of knowledge from large volumes of unstructured data which is a continuation of the field data mining and predictive analytics, also known as knowledge discovery and data mining (KDD)." (Suren Behari, "Data Science and Big Data Analytics in Financial Services: A Case Study", 2016)

"A knowledge acquisition from data through scientific method that comprises systematic observation, experiment, measurement, formulation, and hypotheses testing with the aim of discovering new ideas and concepts." (Babangida Zubairu, "Security Risks of Biomedical Data Processing in Cloud Computing Environment", 2018)

"Data science is a collection of techniques used to extract value from data. It has become an essential tool for any organization that collects, stores, and processes data as part of its operations. Data science techniques rely on finding useful patterns, connections, and relationships within data. Being a buzzword, there is a wide variety of definitions and criteria for what constitutes data science. Data science is also commonly referred to as knowledge discovery, machine learning, predictive analytics, and data mining. However, each term has a slightly different connotation depending on the context." (Vijay Kotu & Bala Deshpande, "Data Science" 2nd Ed., 2018)

"A field that builds on and synthesizes a number of relevant disciplines and bodies of knowledge, including statistics, informatics, computing, communication, management, and sociology to translate data into information, knowledge, insight, and intelligence for improving innovation, productivity, and decision making." (Zhaohao Sun, "Intelligent Big Data Analytics: A Managerial Perspective", 2019)

"Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms, both structured and unstructured similar to data mining." (K Hariharanath, "BIG Data: An Enabler in Developing Business Models in Cloud Computing Environments", 2019)

"Is a broad field that refers to the collective processes, theories, concepts, tools and technologies that enable the review, analysis, and extraction of valuable knowledge and information from raw data. It is geared toward helping individuals and organizations make better decisions from stored, consumed and managed data." (Maryna Nehrey & Taras Hnot, "Data Science Tools Application for Business Processes Modelling in Aviation", 2019)

"It is a new discipline that combines elements of mathematics, statistics, computer science, and data visualization. The objective is to extract information from data sources. In this sense, data science is devoted to database exploration and analysis. This discipline has recently received much attention due to the growing interest in big data." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"the study and application of techniques for deriving insights from data, including constructing algorithms for prediction. Traditional statistical science forms part of data science, which also includes a strong element of coding and data management." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"A relatively new term applied to an interdisciplinary field of study focused on methods for collecting, maintaining, processing, analyzing and presenting results from large datasets." (Osman Kandara & Eugene Kennedy, "Educational Data Mining: A Guide for Educational Researchers", 2020)

"Data Science is the branch of science that uses technologies to predict the upcoming nature of different things such as a market or weather conditions. It shows a wide usage in today’s world." (Kirti R Bhatele, "Data Analysis on Global Stratification", 2020)

"Data science is a methodical form of integrating statistics, algorithms, scientific methods, models and visualization methods for interpretation of outcomes in organizational problem solving and fact based decision making." (Tanushri Banerjee & Arindam Banerjee, "Designing a Business Analytics Culture in Organizations in India", 2021)

"Data science is a multi-disciplinary field that follows scientific approaches, methods, and processes to extract knowledge and insights from structured, semi-structured and unstructured data." (Ahmad M Kabil, Integrating Big Data Technology Into Organizational Decision Support Systems, 2021)

Data Science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights." (R Suganya et al, "A Literature Review on Thyroid Hormonal Problems in Women Using Data Science and Analytics: Healthcare Applications", 2021)

"Data Science is the science and art of using computational methods to identify and discover influential patterns in data." (M Govindarajan, "Introduction to Data Science", 2021)

"Data science is the study of data. It involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data - both structured and unstructured." (Pankaj Pathak, "A Survey on Tools for Data Analytics and Data Science", 2021)

"It is a science of multiple disciplines used for exploring knowledge from data using complex scientific algorithms and methods." (Vandana Kalra et al, "Machine Learning and Its Application in Monitoring Diabetes Mellitus", 2021)

"The concept that utilizes scientific and software methods, IT infrastructure, processes, and software systems in order to gather, process, analyze and deliver useful information, knowledge and insights from various data sources." (Nenad Stefanovic, "Big Data Analytics in Supply Chain Management", 2021)

"This is an evolving field that deals with the gathering, preparation, exploration, visualization, organisation, and storage of large groups of data and the extraction of valuable information from large volumes of data that may exist in an unorganised or unstructured form." (James O Odia & Osaheni T Akpata, "Role of Data Science and Data Analytics in Forensic Accounting and Fraud Detection", 2021)

"A field of study involving the processes and systems used to extract insights from data in all of its forms. The profession is seen as a continuation of the other data analysis fields, such as statistics." (Solutions Review)

"The discipline of using data and advanced statistics to make predictions. Data science is also focused on creating understanding among messy and disparate data. The “what” a scientist is tackling will differ greatly by employer." (KDnuggets)

"Unites statistical systems and processes with computer and information science to mine insights with structured and/or unstructured data analytics." (Accenture)

"Data science is a multidisciplinary approach to finding, extracting, and surfacing patterns in data through a fusion of analytical methods, domain expertise, and technology. This approach generally includes the fields of data mining, forecasting, machine learning, predictive analytics, statistics, and text analytics." (Tibco) [source]

"Data science is an interdisciplinary field that combines social sciences, advanced statistics, and computer engineering skills to acquire, store, organize, and analyze information across a variety of sources." (TDWI)

"Data science is the multidisciplinary field that focuses on finding actionable information in large, raw or structured data sets to identify patterns and uncover other insights. The field primarily seeks to discover answers for areas that are unknown and unexpected." (Sisense) [source]

"Data science is the practical application of advanced analytics, statistics, machine learning, and the associated activities involved in those areas in a business context, like data preparation for example." (RapidMiner) [source]

"Data Science unites statistical systems and processes with computer and information science to mine insights with structured and/or unstructured data analytics." (Accenture)

29 December 2017

🗃️Data Management: Numeracy (Just the Quotes)

"The great body of physical science, a great deal of the essential fact of financial science, and endless social and political problems are only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex world-wide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write." (Herbert G Wells, "Mankind in the Making", 1903)

"[…] statistical literacy. That is, the ability to read diagrams and maps; a 'consumer' understanding of common statistical terms, as average, percent, dispersion, correlation, and index number." (Douglas Scates, "Statistics: The Mathematics for Social Problems", 1943)

"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." (Samuel S Wilks, 1951 [paraphrasing Herber Wells] )

"Just as by ‘literacy’, in this context, we mean much more than its dictionary sense of the ability to read and write, so by ‘numeracy’ we mean more than mere ability to manipulate the rule of three. When we say that a scientist is ‘illiterate’, we mean that he is not well enough read to be able to communicate effectively with those who have had a literary education. When we say that a historian or a linguist is ‘innumerate’ we mean that he cannot even begin to understand what scientists and mathematicians are talking about." (Sir Geoffrey Crowther, "A Report of the Central Advisory Committee for Education", 1959)

"It is perhaps possible to distinguish two different aspects of numeracy […]. On the one hand is an understanding of the scientific approach to the study of phenomena - observation, hypothesis, experiment, verification. On the other hand, there is the need in the modern world to think quantitatively, to realise how far our problems are problems of degree even when they appear as problems of kind." (Sir Geoffrey Crowther, "A Report of the Central Advisory Committee for Education", 1959)

"Numeracy has two facets - reading and writing, or extracting numerical information and presenting it. The skills of data presentation may at first seem ad hoc and judgemental, a matter of style rather than of technology, but certain aspects can be formalized into explicit rules, the equivalent of elementary syntax." (Andrew Ehrenberg, "Rudiments of Numeracy", Journal of Royal Statistical Society, 1977)

"People often feel inept when faced with numerical data. Many of us think that we lack numeracy, the ability to cope with numbers. […] The fault is not in ourselves, but in our data. Most data are badly presented and so the cure lies with the producers of the data. To draw an analogy with literacy, we do not need to learn to read better, but writers need to be taught to write better." (Andrew Ehrenberg, "The problem of numeracy", American Statistician 35(2), 1981)

"We would wish ‘numerate’ to imply the possession of two attributes. The first of these is an ‘at-homeness’ with numbers and an ability to make use of mathematical skills which enable an individual to cope with the practical mathematical demands of his everyday life. The second is ability to have some appreciation and understanding of information which is presented in mathematical terms, for instance in graphs, charts or tables or by reference to percentage increase or decrease." (Cockcroft Committee, "Mathematics Counts: A Report into the Teaching of Mathematics in Schools", 1982)

"To function in today's society, mathematical literacy - what the British call ‘numeracy' - is as essential as verbal literacy […] Numeracy requires more than just familiarity with numbers. To cope confidently with the demands of today's society, one must be able to grasp the implications of many mathematical concepts - for example, change, logic, and graphs - that permeate daily news and routine decisions - mathematical, scientific, and cultural - provide a common fabric of communication indispensable for modern civilized society. Mathematical literacy is especially crucial because mathematics is the language of science and technology." (National Research Council, "Everybody counts: A report to the nation on the future of mathematics education", 1989)

"Illiteracy and innumeracy are social ills created in part by increased demand for words and numbers. As printing brought words to the masses and made literacy a prerequisite for productive life, so now computing has made numeracy an essential feature of today's society. But it is innumeracy, not numeracy, that dominates the headlines: ignorance of basic quantitative tools is endemic […] and is approaching epidemic levels […]." (Lynn A Steen, "Numeracy", Daedalus Vol. 119 No. 2, 1990)

"[…] data analysis in the context of basic mathematical concepts and skills. The ability to use and interpret simple graphical and numerical descriptions of data is the foundation of numeracy […] Meaningful data aid in replacing an emphasis on calculation by the exercise of judgement and a stress on interpreting and communicating results." (David S Moore, "Statistics for All: Why, What and How?", 1990)

"To be numerate is more than being able to manipulate numbers, or even being able to ‘succeed’ in school or university mathematics. Numeracy is a critical awareness which builds bridges between mathematics and the real world, with all its diversity. […] in this sense […] there is no particular ‘level’ of Mathematics associated with it: it is as important for an engineer to be numerate as it is for a primary school child, a parent, a car driver or gardener. The different contexts will require different Mathematics to be activated and engaged in […] "(Betty Johnston, "Critical Numeracy", 1994)

"We believe that numeracy is about making meaning in mathematics and being critical about maths. This view of numeracy is very different from numeracy just being about numbers, and it is a big step from numeracy or everyday maths that meant doing some functional maths. It is about using mathematics in all its guises - space and shape, measurement, data and statistics, algebra, and of course, number - to make sense of the real world, and using maths critically and being critical of maths itself. It acknowledges that numeracy is a social activity. That is why we can say that numeracy is not less than maths but more. It is why we don’t need to call it critical numeracy being numerate is being critical." (Dave Tout & Beth Marr, "Changing practice: Adult numeracy professional development", 1997)

"To be numerate means to be competent, confident, and comfortable with one’s judgements on whether to use mathematics in a particular situation and if so, what mathematics to use, how to do it, what degree of accuracy is appropriate, and what the answer means in relation to the context." (Diana Coben, "Numeracy, mathematics and adult learning", 2000)

"Numeracy is the ability to process, interpret and communicate numerical, quantitative, spatial, statistical, even mathematical information, in ways that are appropriate for a variety of contexts, and that will enable a typical member of the culture or subculture to participate effectively in activities that they value." (Jeff Evans, "Adults´ Mathematical Thinking and Emotion", 2000)

"Ignorance of relevant risks and miscommunication of those risks are two aspects of innumeracy. A third aspect of innumeracy concerns the problem of drawing incorrect inferences from statistics. This third type of innumeracy occurs when inferences go wrong because they are clouded by certain risk representations. Such clouded thinking becomes possible only once the risks have been communicated." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"In my view, the problem of innumeracy is not essentially 'inside' our minds as some have argued, allegedly because the innate architecture of our minds has not evolved to deal with uncertainties. Instead, I suggest that innumeracy can be traced to external representations of uncertainties that do not match our mind’s design - just as the breakdown of color constancy can be traced to artificial illumination. This argument applies to the two kinds of innumeracy that involve numbers: miscommunication of risks and clouded thinking. The treatment for these ills is to restore the external representation of uncertainties to a form that the human mind is adapted to." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Overcoming innumeracy is like completing a three-step program to statistical literacy. The first step is to defeat the illusion of certainty. The second step is to learn about the actual risks of relevant events and actions. The third step is to communicate the risks in an understandable way and to draw inferences without falling prey to clouded thinking. The general point is this: Innumeracy does not simply reside in our minds but in the representations of risk that we choose." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Statistical innumeracy is the inability to think with numbers that represent uncertainties. Ignorance of risk, miscommunication of risk, and clouded thinking are forms of innumeracy. Like illiteracy, innumeracy is curable. Innumeracy is not simply a mental defect 'inside' an unfortunate mind, but is in part produced by inadequate 'outside' representations of numbers. Innumeracy can be cured from the outside." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Mathematics is often thought to be difficult and dull. Many people avoid it as much as they can and as a result much of the population is mathematically illiterate. This is in part due to the relative lack of importance given to numeracy in our culture, and to the way that the subject has been presented to students." (Julian Havil , "Gamma: Exploring Euler's Constant", 2003)

"One can be highly functionally numerate without being a mathematician or a quantitative analyst. It is not the mathematical manipulation of numbers (or symbols representing numbers) that is central to the notion of numeracy. Rather, it is the ability to draw correct meaning from a logical argument couched in numbers. When such a logical argument relates to events in our uncertain real world, the element of uncertainty makes it, in fact, a statistical argument." (Eric R Sowey, "The Getting of Wisdom: Educating Statisticians to Enhance Their Clients' Numeracy", The American Statistician 57(2), 2003)

"Mathematics and numeracy are not congruent. Nor is numeracy an accidental or automatic by-product of mathematics education at any level. When the goal is numeracy some mathematics will be involved but mathematical skills alone do not constitute numeracy." (Theresa Maguire & John O'Donoghue, "Numeracy concept sophistication - an organizing framework, a useful thinking tool", 2003)

"Mathematical literacy is an individual’s capacity to identify and understand the role that mathematics plays in the world, to make well-founded judgements and to use and engage with mathematics in ways that meet the needs of that individual’s life as a constructive, concerned and reflective citizen." (OECD, "Assessing scientific, reading and mathematical literacy: a framework for PISA 2006", 2006)

"Statistical literacy is more than numeracy. It includes the ability to read and communicate the meaning of data. This quality makes people literate as opposed to just numerate. Wherever words (and pictures) are added to numbers and data in your communication, people need to be able to understand them correctly." (United Nations, "Making Data Meaningful" Part 4: "A guide to improving statistical literacy", 2012)

"When a culture is founded on the principle of immediacy of experience, there is no need for numeracy. It is impossible to consume more than one thing at a time, so differentiating between 'a small amount', 'a larger amount' and 'many' is enough for survival." (The Open University, "Understanding the environment: learning and communication", 2016)