SQL Troubles

11 February 2018

🔬Data Science: Gaussian Distribution (Definitions)

"Represents a conventional scale for a normally distributed bell-shaped curve that has a central tendency of zero and a standard deviation of one unit, wherein the units are called sigma (σ)." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"Also called the standard normal distribution, is the normal distribution with mean zero and variance one." (Dimitrios G Tsalikakis et al, "Segmentation of Cardiac Magnetic Resonance Images", 2009)

"A normal distribution with the parameters μ = 0 and σ = 1. The random variable for this distribution is denoted by Z. The z-tables (values of the random variable Z and the corresponding probabilities) are widely used for normal distributions." (Peter Oakander et al, "CPM Scheduling for Construction: Best Practices and Guidelines", 2014)

🔬Data Science: K-nearest neighbors (Definitions)

"A modeling technique that assigns values to points based on the values of the k nearby points, such as average value, or most common value." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A simple and popular classifier algorithm that assigns a class (in a preexisting classification) to an object whose class is unknown. [...] From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to k (a number chosen by the user) objects of known class. The most common class (i.e., the class that is assigned most often to the nearest k objects) is assigned to the object of unknown class." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A method used for classification and regression. Cases are analyzed, and class membership is assigned based on similarity to other cases, where cases that are similar (or 'near' in characteristics) are known as neighbors." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A prediction method, which uses a function of the k most similar observations from the training set to generate a prediction, such as the mean." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"K-Nearest Neighbors classification is an instance-based supervised learning method that works well with distance-sensitive data." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"An algorithm that estimates an unknown data item as being like the majority of the k-closest neighbors to that item." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"K-nearest neighbourhood is a algorithm which stores all available cases and classifies new cases based on a similarity measure. It is used in statistical estimation and pattern recognition." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)

10 February 2018

🔬Data Science: Data Mining (Definitions)

"The non-trivial extraction of implicit, previously unknown, and potentially useful information from data" (Frawley et al., "Knowledge discovery in databases: An overview", 1991)

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Data mining is the process of examining large amounts of aggregated data. The objective of data mining is to either predict what may happen based on trends or patterns in the data or to discover interesting correlations in the data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals", 2001)

"A class of undirected queries, often against the most atomic data, that seek to find unexpected patterns in the data. The most valuable results from data mining are clustering, classifying, estimating, predicting, and finding things that occur together. There are many kinds of tools that play a role in data mining. The principal tools include decision trees, neural networks, memory- and cased-based reasoning tools, visualization tools, genetic algorithms, fuzzy logic, and classical statistics. Generally, data mining is a client of the data warehouse." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"The discovery of information hidden within data." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"the process of extracting valid, authentic, and actionable information from large databases." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Advanced analysis or data mining is the analysis of detailed data to detect patterns, behaviors, and relationships in data that were previously only partially known or at times totally unknown." (Margaret Y Chu, "Blissful Data", 2004)

"Analysis of detail data to discover relationships, patterns, or associations between values." (Margaret Y Chu, "Blissful Data ", 2004)

"An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"the process of analyzing large amounts of data in search of previously undiscovered business patterns." (William H Inmon, "Building the Data Warehouse", 2005)

"A type of advanced analysis used to determine certain patterns within data. Data mining is most often associated with predictive analysis based on historical detail, and the generation of models for further analysis and query." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"Refers to the process of identifying nontrivial facts, patterns and relationships from large databases. The databases have often been put together for a different purpose from the data mining exercise." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Data mining is the process of discovering implicit patterns in data stored in data warehouse and using those patterns for business advantage such as predicting future trends." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Digging through data (usually in a data warehouse or data mart) to identify interesting patterns." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"Intelligently analyzing data to extract hidden trends, patterns, and information. Commonly used by statisticians, data analysts and Management Information Systems communities." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The process of extracting valid, authentic, and actionable information from large databases." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The process of retrieving relevant data to make intelligent decisions." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"A process that minimally has four stages: (1) data preparation that may involve 'data cleaning' and even 'data transformation', (2) initial exploration of the data, (3) model building or pattern identification, and (4) deployment, which means subjecting new data to the 'model' to predict outcomes of cases found in the new data." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Automatically searching large volumes of data for patterns or associations." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The use of machine learning algorithms to find faint patterns of relationship between data elements in large, noisy, and messy data sets, which can lead to actions to increase benefit in some form (diagnosis, profit, detection, etc.)." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A way of extracting knowledge from a database by searching for correlations in the data and presenting promising hypotheses to the user for analysis and consideration." (Toby J Teorey, "Database Modeling and Design" 4th Ed., 2010)

"The process of using mathematical algorithms (usually implemented in computer software) to attempt to transform raw data into information that is not otherwise visible (for example, creating a query to forecast sales for the future based on sales from the past)." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A process that employs automated tools to analyze data in a data warehouse and other sources and to proactively identify possible relationships and anomalies." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"Process of analyzing data from different perspectives and summarizing it into useful information (e.g., information that can be used to increase revenue, cuts costs, or both)." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The process of sifting through large amounts of data using pattern recognition, fuzzy logic, and other knowledge discovery statistical techniques to identify previously unknown, unsuspected, and potentially meaningful data content relationships and trends." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data mining, a branch of computer science, is the process of extracting patterns from large data sets by combining statistical analysis and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage." (T T Wong & Loretta K W Sze, "A Neuro-Fuzzy Partner Selection System for Business Social Networks", 2012)

"Field of analytics with structured data. The model inference process minimally has four stages: data preparation, involving data cleaning, transformation and selection; initial exploration of the data; model building or pattern identification; and deployment, putting new data through the model to obtain their predicted outcomes." (Gary Miner et al, "Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications", 2012)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The process of exploring and analyzing large amounts of data to find patterns." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"An umbrella term for analytic techniques that facilitate fast pattern discovery and model building, particularly with large datasets." (Meta S Brown, "Data Mining For Dummies", 2014)

"Analysis of large quantities of data to find patterns such as groups of records, unusual records, and dependencies" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The practice of analyzing big data using mathematical models to develop insights, usually including machine learning algorithms as opposed to statistical methods."(Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Data mining is the analysis of data for relationships that have not previously been discovered." (Piyush K Shukla & Madhuvan Dixit, "Big Data: An Emerging Field of Data Engineering", Handbook of Research on Security Considerations in Cloud Computing, 2015)

"A methodology used by organizations to better understand their customers, products, markets, or any other phase of the business." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK" 4th Ed., 2015)

"Extracting information from a database to zero in on certain facts or summarize a large amount of data." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"It refers to the process of identifying and extracting patterns in large data sets based on artificial intelligence, machine learning, and statistical techniques." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"The process of exploring and analyzing large amounts of data to find patterns." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Term used to describe analyzing large amounts of data to find patterns, correlations, and similarities." (Brittany Bullard, "Style and Statistics", 2016)

"The process of extracting meaningful knowledge from large volumes of data contained in data warehouses." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A class of analytical applications that help users search for hidden patterns in a data set. Data mining is a process of analyzing large amounts of data to identify data–content relationships. Data mining is one tool used in decision support special studies. This process is also known as data surfing or knowledge discovery." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"The process of collecting, searching through, and analyzing a large amount of data in a database to discover patterns or relationships." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Data mining involves finding meaningful patterns and deriving insights from large data sets. It is closely related to analytics. Data mining uses statistics, machine learning, and artificial intelligence techniques to derive meaningful patterns." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"The analysis of the data held in data warehouses in order to produce new and useful information." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

"The process of collecting critical business information from a data source, correlating the information, and uncovering associations, patterns, and trends." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems." (Dmitry Korzun et al, "Semantic Methods for Data Mining in Smart Spaces", 2019)

"A technique using software tools geared for the user who typically does not know exactly what he's searching for, but is looking for particular patterns or trends. Data mining is the process of sifting through large amounts of data to produce data content relationships. It can predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. This is also known as data surfing." (Information Management)

"An analytical process that attempts to find correlations or patterns in large data sets for the purpose of data or knowledge discovery." (NIST SP 800-53)

"Extracting previously unknown information from databases and using that data for important business decisions, in many cases helping to create new insights." (Solutions Review)

"is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"the process of analyzing large batches of data to find patterns and instances of statistical significance. By utilizing software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective strategies for acquisition, as well as increase sales and decrease overall costs." (Insight Software)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft)

"The process of pulling actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating those connections in a way that helps decision-makers improve their product or organization." (KDnuggets)

"Data mining is the process of analyzing hidden patterns of data according to different perspectives for categorization into useful information, which is collected and assembled in common areas, such as data warehouses, for efficient analysis, data mining algorithms, facilitating business decision making and other information requirements to ultimately cut costs and increase revenue. Data mining is also known as data discovery and knowledge discovery." (Techopedia)

"Data mining is an automated analytical method that lets companies extract usable information from massive sets of raw data. Data mining combines several branches of computer science and analytics, relying on intelligent methods to uncover patterns and insights in large sets of information." (Sisense) [source]

"Data mining is the process of analyzing data from different sources and summarizing it into relevant information that can be used to help increase revenue and decrease costs. Its primary purpose is to find correlations or patterns among dozens of fields in large databases." (Logi Analytics) [source]

"Data mining is the process of analyzing massive volumes of data to discover business intelligence that helps companies solve problems, mitigate risks, and seize new opportunities." (Talend) [source]

"Data Mining is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"Data mining is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques." (Gartner)

"Data mining is the process of extracting relevant patterns, deviations and relationships within large data sets to predict outcomes and glean insights. Through it, companies convert big data into actionable information, relying upon statistical analysis, machine learning and computer science." (snowflake) [source]

"Data mining is the work of analyzing business information in order to discover patterns and create predictive models that can validate new business insights. […] Unlike data analytics, in which discovery goals are often not known or well defined at the outset, data mining efforts are usually driven by a specific absence of information that can’t be satisfied through standard data queries or reports. Data mining yields information from which predictive models can be derived and then tested, leading to a greater understanding of the marketplace." (Informatica) [source]

09 February 2018

🔬Data Science: Normalization (Definitions)

"Mathematical transformations to generate a new set of values that map onto a different range." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

[Min–max normalization:] "Normalizing a variable value to a predetermine range." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

[function point normalization:] "Dividing a metric by the project’s function points to allow you to compare projects of different sizes and complexities." (Rod Stephens, "Beginning Software Engineering", 2015)

"For metrics, performing some calculation on a metric to account for possible differences in project size or complexity. Two general approaches are size normalization and function point normalization." (Rod Stephens, "Beginning Software Engineering", 2015)

[size normalization:] "For metrics, dividing a metric by an indicator of size such as lines of code or days of work. For example, bugs/KLOC tells you how buggy the code is normalized for the size of the project." (Rod Stephens, "Beginning Software Engineering", 2015)

07 February 2018

🔬Data Science: Hadoop (Definitions)

"An Apache-managed software framework derived from MapReduce and Bigtable. Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency. Two major components of Hadoop exist: a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"An open-source software platform developed by Apache Software Foundation for data-intensive applications where the data are often widely distributed across different hardware systems and geographical locations." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Technology designed to house Big Data; a framework for managing data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"an Apache-managed software framework derived from MapReduce. Big Table Hadoop enables applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed up computations and hide latency. The two major components of Hadoop are a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"An open-source framework that is built to process and store huge amounts of data across a distributed file system." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A batch processing infrastructure that stores fi les and distributes work across a group of servers. The infrastructure is composed of HDFS and MapReduce components. Hadoop is an open source software platform designed to store and process quantities of data that are too large for just one particular device or server. Hadoop’s strength lies in its ability to scale across thousands of commodity servers that don’t share memory or disk space." (Benoy Antony et al, "Professional Hadoop®", 2016)

"Apache Hadoop is an open-source framework for processing large volume of data in a clustered environment. It uses simple MapReduce programming model for reliable, scalable and distributed computing. The storage and computation both are distributed in this framework." (Kaushik Pal, 2016)

"A framework that allow for the distributed processing for large datasets." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A batch processing infrastructure that stores files and distributes work across a group of servers." (Oracle)

"an open-source framework that is built to enable the process and storage of big data across a distributed file system." (Analytics Insight)

"Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process both structured and unstructured data, and scale up reliably from a single server to thousands of machines." (Databricks) [source]

"Hadoop is an open source software framework for storing and processing large volumes of distributed data. It provides a set of instructions that organizes and processes data on many servers rather than from a centralized management nexus." (Informatica) [source]

🔬Data Science: Semantics (Definitions)

"The meaning of a model that is well-formed according to the syntax of a language." (Anneke Kleppe et al, "MDA Explained: The Model Driven Architecture: Practice and Promise", 2003)

"The part of language concerned with meaning. For example, the phrases 'my mother’s brother' and 'my uncle' are two ways of saying the same thing and, therefore, have the same semantic value." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The study of meaning (often the meaning of words). In business systems we are concerned with making the meaning of data explicit (structuring unstructured data), as well as making it explicit enough that an agent could reason about it." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The branch of philosophy concerned with describing meaning." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Having to do with meaning, usually of words and/or symbols (the syntax). Part of semiotic theory." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The study of the meaning behind the syntax (signs and symbols) of a language or graphical expression of something. The semantics can only be understood through the syntax. The syntax is like the encoded representation of the semantics." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The study of meaning. In the context of Big Data, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (i.e., making triples), combining assertions about data objects (i.e., merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but I would suggest that most definitions for semantics would be functionally equivalent to the definition offered here." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"Set of mappings forming a representation in order to define the meaningful information of the data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Semantics is a branch of linguistics focused on the meaning communicated by language." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

06 February 2018

🔬Data Science: Data Profiling (Definitions)

"A process focused on generating data metrics and measuring data quality. The data metrics can be collected at the column level, e.g., value frequency, nullity measurements, and uniqueness/match quality measurements; at the table level, e.g., primary key violations; or cross-table relationships, e.g., foreign key violations." (Alex Berson & Lawrence Dubov, "Master Data Management and Customer Data Integration for a Global Enterprise", 2007)

"A set of techniques for searching through data looking for potential errors and anomalies, such as similar data with different spellings, data outside boundaries and missing values." (Keith Gordon, "Principles of Data Management", 2007)

"Data profiling (and analysis services) provides functionality to understand the quality, structure, and relationships of data across enterprise systems, from which data cleansing and standardization rules can be determined for improving the overall data quality and consistency." (Martin Oberhofer et al,"Enterprise Master Data Management", 2008)

"A process for looking at the data within the source systems and understanding the data elements and the anomalies." (Tony Fisher, "The Data Asset", 2009)

"An approach to data quality analysis, using statistics to show patterns of usage, and patterns of contents, and automated as much as possible. Some profiling activities must be done manually, but most can be automated." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data profiling is used to assess the existing state of data quality. It is also used to understand the duplicates in the master data or the gaps in linkages. It can be used to understand the scope of data enrichment to enhance the value of customer data assets." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"An automated method of analyzing large amounts of data to determine its quality and integrity." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"Data profiling assesses a set of data and provides information on the values, the length of strings, the level of completeness, and the distribution patterns of each column." (Robert Hawker, "Practical Data Quality", 2023)

"The process of examining the data available in different data sources and collecting statistics and information about this data. Data profiling helps to assess the quality level of the data according to a defined goal." (Talend)

"Data profiling, a critical first step in data migration, automates the identification of problematic data and metadata and enables companies to correct inconsistencies, redundancies and inaccuracies in corporate databases." (Information Management)

"Data profiling is the act of examining, cleansing and analyzing an existing data source to generate actionable summaries. Proper techniques of data profiling verify the accuracy and validity of data, leading to better data-driven decision making that customers can use to their advantage." (snowflake) [source]

🔬Data Science: Pig (Definitions)

"A programming interface for programmers to create MapReduce jobs within Hadoop." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"A programming language designed to handle any type of data. Pig helps users to focus more on analyzing large datasets and less time writing map programs and reduce programs. Like Hive and Impala, Pig is a high-level platform used for creating MapReduce programs more easily. The programming language Pig uses is called Pig Latin, and it allows you to extract, transform, and load (ETL) data at a very high level. This greatly reduces the effort if this was written in JAVA code; PIG is only a fraction of that." (Benoy Antony et al, "Professional Hadoop®", 2016)

"An open-source platform for analyzing large data sets that consists of the following: (1) Pig Latin scripting language; (2) Pig interpreter that converts Pig Latin scripts into MapReduce jobs. Pig runs as a client application." (Oracle)

05 February 2018

🔬Data Science: Machine Learning [ML] (Definitions)

"Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed." (Arthur Samuel, 1959) [attributed]

"Computer methods for accumulating, changing, and updating knowledge in an AI computer system." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"A term often used to denote the application of generic model-fitting or classification algorithms for predictive data mining. This differs from traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference and p-values. The emphasis in data mining machine learning algorithms is usually on the accuracy of the prediction as opposed to discovering the relationship and influences of different variables." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"A discipline grounded in computer science, statistics, and psychology that includes algorithms that learn or improve their performance based on exposure to patterns in data, rather than by explicit programming." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Machine learning is the intersection between theoretically sound computer science and practically noisy data. Essentially, it’s about machines making sense out of data in much the same way that humans do." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Computer programs that have the ability to learn over time as new data becomes available. This type of analytical programming can learn more about a customer’s online shopping behavior over time and start to predict which items the customer will likely click on and purchase." (Brittany Bullard, "Style and Statistics", 2016)

"Machine learning is home to numerous techniques for creating classifiers by training them with already correctly categorized examples. This training is called supervised learning; it is supervised because it starts with instances labeled by category, and it involves learning because over time the classifier improves its performance by adjusting the weights for features that distinguish the categories. But strictly speaking, supervised learning techniques do not learn the categories; they implement and apply categories that they inherit or are given to them." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"A subdiscipline of computer science that addresses similar challenges to traditional statistical modeling, but with different techniques and a stronger focus on predictive accuracy." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Machine learning describes a broad set of methods for extracting meaningful patterns from existing data and applying those patterns to make decisions or predictions on future data." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Machine learning is a method of designing systems that can learn, adjust, and improve based on the data fed to them. Machine learning works based on predictive and statistical algorithms that are provided to these machines. The algorithms are designed to learn and improve as more data flows through the system." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"The field of computer science research that focuses on developing and evaluating algorithms that can extract useful patterns from data sets. A machine learning algorithm takes a data set as input and returns a model that encodes the patterns the algorithm extracted from the data." (John D Kelleher & Brendan Tierney, "Data science", 2018)

[In-Database Machine Learning:] "Using machine-learning algorithms that are built into the database solution. The benefit of in-database machine learning is that it reduces the time spent on moving data in and out of databases for analysis." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"The science of developing techniques to give the computer inference and deduction capabilities to achieve diverse processing tasks autonomously." (Jorge Manjarrez-Sanchez, "In-Memory Analytics", 2018)

"A facet of AI that focuses on algorithms, allowing machines to learn without being programmed and change when exposed to new data." (Kirti R Bhatele et al, "The Role of Artificial Intelligence in Cyber Security", 2019)

"A field of artificial intelligence that uses statistical techniques to give computer systems the ability to learn." (Nil Goksel & Aras Bozkurt, "Artificial Intelligence in Education: Current Insights and Future Perspectives", 2019)

"A method of designing a sequence of actions to solve a problem that optimizes automatically through experience and with limited or no human intervention." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"The methods used to understand the patterns in the data and to obtain results from these patterns using various algorithms." (Tolga Ensari et al, "Overview of Machine Learning Approaches for Wireless Communication", 2019)

"A branch of artificial intelligence that focuses on data analysis methods that allow for automation of the process of analytical model building." (Timofei Bogomolov et al, "Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques", 2020)

"A discipline focused on the development and evaluation of algorithms that permit computers to use patterns, trends, and associations in data to perform tasks without being programmed by a human." (Osman Kandara & Eugene Kennedy, "Educational Data Mining: A Guide for Educational Researchers", 2020)

"A field of study of algorithms and statistical methods that allows software application to predict the accurate result." (S Kayalvizhi & D Thenmozhi, "Deep Learning Approach for Extracting Catch Phrases from Legal Documents", 2020)

"Is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed." (Rajandeep Kaur & Rajneesh Rani, "Comparative Study on ASD Identification Using Machine and Deep Learning", 2020)

"Is one of many subfields of artificial intelligence concerning the ways that computers learn from experience to improve their ability to think, plan, decide and act." (Lejla Banjanović-Mehmedović & Fahrudin Mehmedović, "Intelligent Manufacturing Systems Driven by Artificial Intelligence in Industry 4.0", 2020)

"It is an application of the artificial intelligence in which machines can automatically learn and solve problems using the learned experience." (Shouvik Chakraborty & Kalyani Mali, "An Overview of Biomedical Image Analysis From the Deep Learning Perspective", 2020)

"It refers to an application of artificial intelligence focusing on algorithms which can be used for building models (e.g., based on statistics) from input data. Such automatic analytical models need to provide outputs based on the learning relations between input and output values. The algorithms are often categorized as supervised, semi-supervised or unsupervised." (Ana Gavrovska & Andreja Samčović, "Intelligent Automation Using Machine and Deep Learning in Cybersecurity of Industrial IoT", 2020)

"Machine learning, in the simplest terms, is the analysis of statistics to help computers make decisions base on repeatable characteristics found in the data." (Vardhan K Agrawal, "Mastering Machine Learning with Core ML and Python", 2020)

"Machine learning is a field of computer science and mathematics that focuses on algorithms for building and using models “learned” from data." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed." (Mohammad Haroon et al, Application of Machine Learning In Forensic Science, 2020)

"Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves." (R Murugan, "Implementation of Deep Learning Neural Network for Retinal Images", 2020)

"Machine learning is branch of data science which has concern with the design and development of algorithm to develop a system that can learn from data, identify the complex patterns and provide intelligent, reliable, repeatable decisions and results with minimal human interaction based on the provided input." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"A computer program having the capability to learn and adapt to new data without human assistance." (Sue Milton, "Data Privacy vs. Data Security", 2021)

"A rising area in computer science, where the computer systems are programmed to learn information from rich data sets to produce reliable results to a given problem." (Jinnie Shin et al, "Automated Essay Scoring Using Deep Learning Algorithms", 2021)

"Ability of a machine to learn from the data it is presented using different techniques that are supervised or non-supervised." (Sujata Ramnarayan, "Marketing and Artificial Intelligence: Personalization at Scale", 2021)

"Is a type of artificial intelligence where computer teaches itself the solution to a query discovering patterns in sets of data and matching fresh parts of data the based on probability." (James O Odia & Osaheni T Akpata, "Role of Data Science and Data Analytics in Forensic Accounting and Fraud Detection", 2021)

"It is again a sub set of AI in which we classify the data with the help of input data set, ANN, SVM, Random Forest are some of the algorithm used in this case." (Ajay Sharma, "Smart Agriculture Services Using Deep Learning, Big Data, and IoT", 2021)

"It refers to developing the ability in computers to use available data to train themselves automatically, and to learn from its own experiences without being explicitly programmed." (Shatakshi Singhet al, "A Survey on Intelligence Tools for Data Analytics", 2021)

"Machine learning is a scientific approach to analyse available data using algorithms and statistical models to accomplish a specific task by utilizing the patterns evolved." (Vandana Kalra et al, "Machine Learning and Its Application in Monitoring Diabetes Mellitus", 2021)

"Machine Learning is a statistical or mathematical model that performs data analysis, prediction, and clustering. This science is a subfield of Artificial Intelligence." (Sayani Ghosal & Amita Jain, "Research Journey of Hate Content Detection From Cyberspace", 2021)

"Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed." (Sercan Demirci et al, "Detection of Diabetic Retinopathy With Mobile Application Using Deep Learning", 2021)

"Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed and in the process developing computer programs that can access data and use it to learn for themselves." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Set of knowledge discovery techniques for intelligent data analysis in order to find hidden patterns and associations, devise rules and make predictions." (Nenad Stefanovic, "Big Data Analytics in Supply Chain Management", 2021)

"The study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as 'training data', to make predictions or decisions without being explicitly programmed to do so." (Jan Bosch et al, "Engineering AI Systems: A Research Agenda", Artificial Intelligence Paradigms for Smart Cyber-Physical Systems, 2021)

"This can be regarded as a subset of AI which refers to analyzing structured data and identifying trends (correlations) for specific outcomes and using that information to predict future values (causation)." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"A discipline that studies methods and algorithms of automated learning from data through which computer systems can adjust their operations according to feedback they receive. A term strongly related to artificial intelligence, data mining, statistical methods." (KDnuggets)

"A process where a computer uses an algorithm to gain understanding about a set of data, then makes predictions based on its understanding." (KDnuggets)

"A type of artificial intelligence that provides computers with the ability to learn without being specifically programmed to do so, focusing on the development of computer applications that can teach themselves to change when exposed to new data." (Solutions Review)

"is a type of artificial intelligence that enable systems to learn patterns from data and subsequently improve from experience. It is an interdisciplinary field that includes information theory, control theory, statistics, and computer science. As it gathers and sorts more information, machine learning constantly gets better at identifying types and forms of data with little or no hard coded rules." (Accenture)

"Machine learning is a branch of artificial intelligence that deals with self-improving algorithms. The algorithms 'learn' by recording the results of vast quantities of data processing actions. Over time, the algorithm improves its functionality without being explicitly programmed." (Xplenty) [source]

"Machine learning is a subset of artificial intelligence (AI) that deals with the extracting of patterns from data, and then uses those patterns to enable algorithms to improve themselves with experience. This type of learning can be used to help computers recognize patterns and associations in massive amounts of data, and make predictions and forecasts based on its findings." (RapidMiner) [source]

"Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values." (Techtarget) [source]

"Machine Learning is a type of artificial intelligence that enable systems to learn patterns from data and subsequently improve from experience. It is an interdisciplinary field that includes information theory, control theory, statistics, and computer science. As it gathers and sorts more information, machine learning constantly gets better at identifying types and forms of data with little or no hard coded rules." (Accenture)

"Machine learning is a cutting-edge programming technique used to automate the construction of analytical models and enable applications to perform specified tasks more efficiently without being explicitly programmed. Machine learning allows system to automatically learn and increase their accuracy in task performance through experience." (Sumo Logic) [source]

"[Machine Learning is] a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. It focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. The process of machine learning is similar to that of data mining. Both systems search through data to look for patterns. However, instead of extracting data for human comprehension - as is the case in data mining applications - machine learning uses that data to improve the program's own understanding. Machine learning programs detect patterns in data and adjust program actions accordingly." (Teradata) [source]

"Machine learning is the field of study that enables computers the ability to learn without being explicitly programmed." (Adobe)

"Machine learning is the subset of artificial intelligence (AI) that focuses on building systems that learn - or improve performance - based on the data they consume." (Oracle)

"Part of artificial intelligence where machines learn from what they are doing and become better over time." (Analytics Insight)

04 February 2018

🔬Data Science: Artificial Intelligence [AI] (Definitions)

"A computer would deserve to be called intelligent if it could deceive a human into believing that it was human." (Alan Turing, "Computing Machinery and Intelligence", 1950)

"Artificial intelligence is the science of making machines do things that would require intelligence if done by men." (Marvin Minsky, 1968)

"Artificial intelligence comprises methods, tools, and systems for solving problems that normally require the intelligence of humans. The term intelligence is always defined as the ability to learn effectively, to react adaptively, to make proper decisions, to communicate in language or images in a sophisticated way, and to understand." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"AI views the mind as a type of logical symbol processor that works with strings of text or symbols much as a computer works with strings of Os and Is. In practice, AI means expert systems or decision support systems." (Guido Deboeck & Teuvo Kohonen (Eds), "Visual Explorations in Finance with Self-Organizing Maps" 2nd Ed., 2000)

"Software that performs a function previously ascribed only to human beings, such as natural language processing." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The branch of computer science that is concerned with making computers behave and 'think' like humans." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"A field of computer science focused on the development of intelligent-acting agents. Often guided by the theory of how humans solve problems. Has a reputation for overpromising. Wryly definable as all computational problems not yet solved." (Gary Miner et al, "Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications", 2012)

"Artificial intelligence is the mimicking of human thought and cognitive processes to solve complex problems automatically. AI uses techniques for writing computer code to represent and manipulate knowledge." (Radian Belu, "Artificial Intelligence Techniques for Solar Energy and Photovoltaic Applications", 2013)

"It is the investigation exploring whether intelligence can be replicated in machines, to perform tasks that humans can successfully carry out." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"The study of computer systems that model and apply the intelligence of the human mind" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"Machines that are designed to evaluate and respond to situations in an appropriate manner. Most artificial intelligence machines are computer based and many of them have achieved remarkable levels of performance in specific areas." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A discipline with the goal to develop technology that solves complex problems with skill and creativity that rivals that of the human brain." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"A machine’s ability to make decisions and perform tasks that simulate human intelligence and behavior." (Kirti R Bhatele et al, "The Role of Artificial Intelligence in Cyber Security", 2019)

"An attempt to recreate a living intellect, especially human intelligence, in a computer environment." (Tolga Ensari et al, "Overview of Machine Learning Approaches for Wireless Communication", 2019)

"The theory governing the development of computer systems that are able to perform tasks which normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages." (Nil Goksel & Aras Bozkurt, "Artificial Intelligence in Education: Current Insights and Future Perspectives", 2019)

"Algorithms which make machines learn from experience, adjust to new inputs and perform human-like tasks." (Lejla Banjanović-Mehmedović & Fahrudin Mehmedović, "Intelligent Manufacturing Systems Driven by Artificial Intelligence in Industry 4.0", 2020)

"It is the method of mimicking the human intelligence by the machines." (Shouvik Chakraborty & Kalyani Mali, "An Overview of Biomedical Image Analysis From the Deep Learning Perspective", 2020)

"AI is a simulation of human intelligence through the progress of intelligent machines that think and work like humans carrying out such human activities as speech recognition, problem-solving, learning, and planning." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Artificial intelligence (AI) refers to the ability of machines to have cognitive capabilities similar to humans using advanced algorithms and quality data." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"Domain of science that deals with the development of computer systems to perform actions like speech-recognition, decision-making, understanding human’s natural language, etc., like humans." (Shatakshi Singhet al, "A Survey on Intelligence Tools for Data Analytics", 2021)

"It is a set of software and hardware systems with many capabilities such as behaving human-like or numerical logic, motion, speech, and sound perception. In other words, AI makes machines think and percept like humans." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

"Machines that work and react like humans using computer programs known as algorithms Algorithms must remain current for AI to work properly, so they rely on machine learning to update them with changes in the worldwide economy and society." (Sue Milton, "Data Privacy vs. Data Security", Global Business Leadership Development for the Fourth Industrial Revolution, 2021)

"Science of simulating intelligence in machines and program them to mimic human actions." (Revathi Rajendran et al, "Convergence of AI, ML, and DL for Enabling Smart Intelligence: Artificial Intelligence, Machine Learning, Deep Learning, Internet of Things", 2021)

"The theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages." (Jan Bosch et al, "Engineering AI Systems: A Research Agenda", Artificial Intelligence Paradigms for Smart Cyber-Physical Systems, 2021)

"AI is any set of concepts, applications or technologies that allow a computer to perform tasks that mimic human behavior." (RapidMiner) [source]

"Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. Specific applications of AI include expert systems, natural language processing (NLP), speech recognition and machine vision." (Techtarget) [source]

"A discipline involving research and development of machines that are aware of their surroundings. Most work in A.I. centers on using machine awareness to solve problems or accomplish some task." (KDnuggets)

"An area of computer science which refers to the creation of intelligent machines that can react to scenarios and make decisions as a human would." (Board International)

"A set of sciences, theories and techniques whose purpose is to reproduce by a machine the cognitive abilities of a human being." (Council of Europe)

"The theory and capabilities that strive to mimic human intelligence through experience and learning." (Forrester)

"Artificial Intelligence (AI) is the broad term used to describe the set of technologies that enable machines to sense, comprehend, act and learn." (Accenture)

"Artificial intelligence (AI) applies advanced analysis and logic-based techniques, including machine learning, to interpret events, support and automate decisions, and take actions." (Gartner)

🔬Data Science: Metamodel (Definitions)

"Model of a model that dictates the rules for creation of modeling mechanisms like the UML" (Bhuvan Unhelkar, "Process Quality Assurance for UML-Based Projects", 2002)

"A description or definition of a well-defined language in the form of a model." (Anneke Kleppe et al, "MDA Explained: The Model Driven Architecture™: Practice and Promise", 2003)

"A model that defines other models. The UML metamodel defines the element types of the UML, such as Classifier." (Craig Larman, "Applying UML and Patterns", 2004)

"A description of a model. A meta model refers to the rules that define the structure a model can have. In other words, a meta model defines the formal structure and elements of a model." (Nicolai M Josuttis, "SOA in Practice", 2007)

"The model of a language used to develop systems. In the case of UML, the definition of UML itself is the metamodel." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)

"A description of a model. A meta-model refers to the rules that define the structure a model can have. In other words, a meta-model defines the formal structure and elements of a model." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"1.Generally, a model that specifies one or more other models. 2.In Meta-data Management, a model of a meta-data system or a data model for a meta-data repository." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Model that describes how and with what the architecture will be described in a structural way (model of the model)." (Gilbert Raymond & Philippe Desfray, "Modeling Enterprise Architecture with TOGAF", 2014)

"When common sets of design decisions can be identified that are not specific to any one domain, they often become systematized in textbooks and in design practices, and may eventually be designed into standard formats and architectures for creating organizing systems. These formally recognized sets of design decisions are known as abstract models or metamodels. Metamodels describe structures commonly found in resource descriptions and other information resources, regardless of the specific domain." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

02 February 2018

🔬Data Science: Sensitivity Analysis (Definitions)

"The practice of changing a variable in a financial model or forecast to determine how a change in that variable affects the overall outcome. For example, to consider the way in which a change in price might affect the gross profit in a product forecast, one might vary the price in small increments and recompute the figures to see how gross profit changes." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"Sensitivity analysis is a methodology for assessing whether an empirical effect is a valid causal effect. The basic idea is to simulate the change in the empirical effect that would result under plausible assumptions about the possible impact of the most likely sources of bias." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"Use of quantitative and qualitative information to study changes in results that would occur with changes in various assumptions. Also see best-case and worst-case scenario." (Leslie G Eldenburg & Susan K Wolcott, "Cost Management 2nd Ed", 2011)

"Study of the impact that changes in one or more parts of a model have on other parts or the outcome." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"A quantitative risk analysis and modeling technique used to help determine which risks have the most potential impact on the project. It examines the extent to which the uncertainty of each project element affects the objective being examined when all other uncertain elements are held at their baseline values. The typical display of results is in the form of a tornado diagram." (Cynthia Stackpole, "PMP® Certification All-in-One For Dummies®", 2011)

"A form of simulation modeling that focuses specifically on identifying the upper and lower bounds of model outputs given a series of inputs with specific variance." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"An analysis used in mathematical modelling, where the sensitivity of model results to variations in a particular variable is studied." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"An analysis technique to determine which individual project risks or other sources of uncertainty have the most potential impact on project outcomes, by correlating variations in project outcomes with variations in elements of a quantitative risk analysis model." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide )", 2017)

"An analysis that involves calculating a decision model multiple times with different inputs so a modeler can analyze the alternative results." (Ciara Heavin & Daniel J Power, "Decision Support, Analytics, and Business Intelligence 3rd Ed.", 2017)

"A technique used to determine how different values of an independent variable will impact a particular dependent variable under a given set of assumptions. It allows an analyst to determine whether a statistical finding will remain consistent under a variety of conditions. |" (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

01 February 2018

🔬Data Science: Data Analysis (Definitions)

"Obtaining information from measured or observed data." (Ildiko E Frank & Roberto Todeschini, "The Data Analysis Handbook", 1994)

"Refers to the process of organizing, summarizing and visualizing data in order to draw conclusions and make decisions." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A combination of human activities and computer processes that answer a research question or confirm a research hypotheses. It answers the question from data files, using empirical methods such as correlation, t-test, content analysis, or Mill’s method of agreement." (Jens Mende, "Data Flow Diagram Use to Plan Empirical Research Projects", 2009)

"The study and presentation of data to create information and knowledge." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Process of applying statistical techniques to evaluate data." (Sally-Anne Pitt, "Internal Audit Quality", 2014)

"Research phase in which data gathered from observing participants are analysed, usually with statistical procedures." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Data analysis is the process of creating meaning from data. […] Data analysis is the process of creating information from data through the creation of data models and mathematics to find patterns." (Michael Heydt, "Learning Pandas" 2nd Ed, 2017)

"Data analysis is the process of organizing, cleaning, transforming, and modeling data to obtain useful information and ultimately, new knowledge." (John R. Hubbard, Java Data Analysis, 2017)

"Techniques used to organize, assess, and evaluate data and information." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide )", 2017)

"This is a class of statistical methods that make it possible to process a very large volume of data and identify the most interesting aspects of its structure. Some methods help to extract relations between different sets of data, and thus, draw statistical information that makes it possible to describe the most important information contained in the data in the most succinct manner possible. Other techniques make it possible to group data in order to identify its common denominators clearly, and thereby understand them better." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"The process and techniques for transforming and evaluating information using qualitative or quantitative tools to discover findings or inform conclusions." (Tiffany J Cresswell-Yeager & Raymond J Bandlow, "Transformation of the Dissertation: From an End-of-Program Destination to a Program-Embedded Process", 2020)

"Data Analysis is a process of gathering and extracting information from the data already present in different ways and order to study the pattern occurs." (Kirti R Bhatele, "Data Analysis on Global Stratification", 2020)

"A data lifecycle stage that involves the techniques that produce synthesized knowledge from organized information. A process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains." (CODATA)

"is the process of inspecting, cleansing, transforming, and modeling data to discover useful information, and support decision-making. The many different types of data analysis include data mining, a predictive technique used for modeling and knowledge discovery, and business intelligence, which relies on aggregation and focuses on business information." (Accenture)

"This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries to identify patterns that can improve an organization." (KDnuggets)

"Data Analysis is the process of inspecting, cleansing, transforming, and modeling data to discover useful information, and support decision-making. The many different types of data analysis include data mining, a predictive technique used for modeling and knowledge discovery, and business intelligence, which relies on aggregation and focuses on business information." (Accenture)

🔬Data Science: Exploratory Data Analysis (Definitions)

"Exploratory data analysis (EDA) is a collection of techniques that reveal (or search for) structure in a data set before calculating any probabilistic model. Its purpose is to obtain information about the data distribution (univariate or multivariate), about the presence of outliers and clusters, to disclose relationships and correlations between objects and/or variables." (Ildiko E Frank & Roberto Todeschini, "The Data Analysis Handbook", 1994)

"Processes and methods for exploring patterns and trends in the data that are not known prior to the analysis. It makes heavy use of graphs, tables, and statistics." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"The process of analyzing data to suggest hypotheses using statistical tools, which can then be tested." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"In statistics, exploratory data analysis is an approach to analyzing datasets to summarize their main characteristics, often with visual methods." (Keith Holdaway, "Harness Oil and Gas Big Data with Analytics", 2014)

"Process in which data patterns guide the analysis or suggest revisions to the preliminary data analysis plan." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Exploratory Data Analysis is about taking a dataset and extracting the most important information from it, in such a way that it is possible to get an idea of what the data looks like." (Richard M Reese et al, Java: Data Science Made Easy, 2017)

🔬Data Science: MapReduce (Definitions)

"A data processing and aggregation paradigm consisting of a 'map' phase that selects data and a 'reduce' phase that transforms the data. In MongoDB, you can run arbitrary aggregations over data using map-reduce." (MongoDb, "Glossary", 2008)

"A divide-and-conquer strategy for processing large data sets in parallel. In the 'map' phase, the data sets are subdivided. The desired computation is performed on each subset. The 'reduce' phase combines the results of the subset calculations into a final result. MapReduce frameworks handle the details of managing the operations and the nodes they run on, including restarting operations that fail for some reason. The user of the framework only has to write the algorithms for mapping and reducing the data sets and computing with the subsets." (Dean Wampler & Alex Payne, "Programming Scala", 2009)

"A method by which computationally intensive problems can be processed on multiple computers in parallel. The method can be divided into a mapping step and a reducing step. In the mapping step, a master computer divides a problem into smaller problems that are distributed to other computers. In the reducing step, the master computer collects the output from the other computers. Although MapReduce is intended for Big Data resources, holding petabytes of data, most Big Data problems do not require MapReduce." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"An early Big Data (before this term became popular) programming solution originally developed by Google for parallel processing using very large data sets distributed across a number of computing and storage systems. A Hadoop implementation of MapReduce is now available." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode. The 'map' component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called 'reduce' aggregates all the elements back together to provide a result." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A programming model consisting of two logical steps - Map and Reduce - for processing massively parallelizable problems across extremely large datasets using a large cluster of commodity computers." (Haoliang Wang et al, "Accessing Big Data in the Cloud Using Mobile Devices", Handbook of Research on Cloud Infrastructures for Big Data Analytics, 2014)

"Algorithm that is used to split massive data sets among many commodity hardware pieces in an effort to reduce computing time." (Billie Anderson & J Michael Hardin, "Harnessing the Power of Big Data Analytics", Encyclopedia of Business Analytics and Optimization, 2014)

"MapReduce is a parallel programming model proposed by Google and is used to distribute computing on clusters of computers for processing large data sets." (Jyotsna T Wassan, "Emergence of NoSQL Platforms for Big Data Needs", Encyclopedia of Business Analytics and Optimization, 2014)

"A concept which is an abstraction of the primitives ‘map’ and ‘reduce’. Most of the computations are carried by applying a ‘map’ operation to each global record in order to generate key/value pairs and then apply the reduce operation in order to combine the derived data appropriately." (P S Shivalkar & B K Tripathy, "Rough Set Based Green Cloud Computing in Emerging Markets", Encyclopedia of Information Science and Technology 3rd Ed., 2015)

"A programming model that uses a divide and conquer method to speed-up processing large datasets, with a special focus on semi-structured data." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", Encyclopedia of Information Science and Technology 3rd Ed., 2015)

"MapReduce is a programming model for general-purpose parallelization of data-intensive processing. MapReduce divides the processing into two phases: a mapping phase, in which data is broken up into chunks that can be processed by separate threads - potentially running on separate machines; and a reduce phase, which combines the output from the mappers into the final result." (Guy Harrison, "Next Generation Databases: NoSQL, NewSQL, and Big Data", 2015)

"MapReduce is a technological framework for processing parallelize-able problems across huge data sets using a large number of computers (nodes). […] MapReduce consists of two major steps: 'Map' and 'Reduce'. They are similar to the original Fork and Join operations in distributed systems, but they can consider a large number of computers that can be constructed based on the Internet cloud. In the Map-step, the master computer (a node) first divides the input into smaller sub-problems and then distributes them to worker computers (worker nodes). A worker node may also be a sub-master node to distribute the sub-problem into even smaller problems that will form a multi-level structure of a task tree. The worker node can solve the sub-problem and report the results back to its upper level master node. In the Reduce-step, the master node will collect the results from the worker nodes and then combine the answers in an output (solution) of the original problem." (Li M Chen et al, "Mathematical Problems in Data Science: Theoretical and Practical Methods", 2015)

"A programming model which process massive amounts of unstructured data in parallel and distributed cluster of processors." (Fatma Mohamed et al, "Data Streams Processing Techniques Data Streams Processing Techniques", Handbook of Research on Machine Learning Innovations and Trends, 2017)

"A data processing framework of Hadoop which provides data intensive computation of large data sets by dividing tasks across several machines and finally combining the result." (Rupali Ahuja, "Hadoop Framework for Handling Big Data Needs", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"A high-level programming model, which uses the “map” and “reduce” functions, for processing high volumes of data." (Carson K.-S. Leung, "Big Data Analysis and Mining", Encyclopedia of Information Science and Technology 4th Ed., 2018)

"Is a computational paradigm for processing massive datasets in parallel if the computation fits a three-step pattern: map, shard and reduce. The map process is a parallel one. Each process executes on a different part of data and produces (key, value) pairs. The shard process collects the generated pairs, sorts and partitions them. Each partition is assigned to a different reduce process which produces a single result." (Venkat Gudivada et al, "Database Systems for Big Data Storage and Retrieval", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"Is a programming model or algorithm for the processing of data using a parallel programming implementation and was originally used for academic purposes associated with parallel programming techniques. (Soraya Sedkaoui, "Understanding Data Analytics Is Good but Knowing How to Use It Is Better!", Big Data Analytics for Entrepreneurial Success, 2019)

"MapReduce is a style of programming based on functional programming that was the basis of Hadoop." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Is a specific programming model, which as such represents a new approach to solving the problem of processing large amounts of differently structured data. It consists of two functions - Map (sorting and filtering data) and Reduce (summarizing intermediate results), and it is executed in parallel and distributed." (Savo Stupar et al, "Importance of Applying Big Data Concept in Marketing Decision Making", Handbook of Research on Applied AI for International Business and Marketing Applications, 2021)

"A software framework for processing vast amounts of data." (Analytics Insight)