SQL Troubles

06 May 2018

🔬Data Science: Variance (Definitions)

"The mean squared deviation of the measured response values from their average value." (Clyde M Creveling, "Six Sigma for Technical Processes: An Overview for R Executives, Technical Leaders, and Engineering Managers", 2006)

"The variance reflects the amount of variation in a set of observations." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Describes dispersion about the data set’s mean. The variance is the square of the standard deviation. Conversely, the standard deviation is the square root of the variance." (E C Nelson & Stephen L Nelson, "Excel Data Analysis For Dummies ", 2015)

"Summary statistic that indicates the degree of variability among participants for a given variable. The variance is essentially the average squared deviation from the mean and is the square of the standard deviation." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical measure of how spread (or varying) the values of a variable are around a central value such as the mean." (Jonathan Ferrar et al, "The Power of People", 2017)

🔬Data Science: Swarm Intelligence (Definitions)

"Swarm systems generate novelty for three reasons: (1) They are 'sensitive to initial conditions' - a scientific shorthand for saying that the size of the effect is not proportional to the size of the cause - so they can make a surprising mountain out of a molehill. (2) They hide countless novel possibilities in the exponential combinations of many interlinked individuals. (3) They don’t reckon individuals, so therefore individual variation and imperfection can be allowed. In swarm systems with heritability, individual variation and imperfection will lead to perpetual novelty, or what we call evolution." (Kevin Kelly, "Out of Control: The New Biology of Machines, Social Systems and the Economic World", 1995)

"Dumb parts, properly connected into a swarm, yield smart results." (Kevin Kelly, "New Rules for the New Economy", 1999)

"It is, however, fair to say that very few applications of swarm intelligence have been developed. One of the main reasons for this relative lack of success resides in the fact that swarm-intelligent systems are hard to 'program', because the paths to problem solving are not predefined but emergent in these systems and result from interactions among individuals and between individuals and their environment as much as from the behaviors of the individuals themselves. Therefore, using a swarm-intelligent system to solve a problem requires a thorough knowledge not only of what individual behaviors must be implemented but also of what interactions are needed to produce such or such global behavior." (Eric Bonabeau et al, "Swarm Intelligence: From Natural to Artificial Systems", 1999)

"Just what valuable insights do ants, bees, and other social insects hold? Consider termites. Individually, they have meager intelligence. And they work with no supervision. Yet collectively they build mounds that are engineering marvels, able to maintain ambient temperature and comfortable levels of oxygen and carbon dioxide even as the nest grows. Indeed, for social insects teamwork is largely self-organized, coordinated primarily through the interactions of individual colony members. Together they can solve difficult problems (like choosing the shortest route to a food source from myriad possible pathways) even though each interaction might be very simple (one ant merely following the trail left by another). The collective behavior that emerges from a group of social insects has been dubbed 'swarm intelligence'." (Eric Bonabeau & Christopher Meyer, Swarm Intelligence: A Whole New Way to Think About Business, Harvard Business Review, 2001)

"[…] swarm intelligence is becoming a valuable tool for optimizing the operations of various businesses. Whether similar gains will be made in helping companies better organize themselves and develop more effective strategies remains to be seen. At the very least, though, the field provides a fresh new framework for solving such problems, and it questions the wisdom of certain assumptions regarding the need for employee supervision through command-and-control management. In the future, some companies could build their entire businesses from the ground up using the principles of swarm intelligence, integrating the approach throughout their operations, organization, and strategy. The result: the ultimate self-organizing enterprise that could adapt quickly - and instinctively - to fast-changing markets." (Eric Bonabeau & Christopher Meyer, "Swarm Intelligence: A Whole New Way to Think About Business", Harvard Business Review, 2001)

"Swarm Intelligence can be defined more precisely as: Any attempt to design algorithms or distributed problem-solving methods inspired by the collective behavior of the social insect colonies or other animal societies. The main properties of such systems are flexibility, robustness, decentralization and self-organization." (Ajith Abraham et al, "Swarm Intelligence in Data Mining", 2006)

"Swarm intelligence can be effective when applied to highly complicated problems with many nonlinear factors, although it is often less effective than the genetic algorithm approach discussed later in this chapter. Swarm intelligence is related to swarm optimization […]. As with swarm intelligence, there is some evidence that at least some of the time swarm optimization can produce solutions that are more robust than genetic algorithms. Robustness here is defined as a solution’s resistance to performance degradation when the underlying variables are changed." (Michael J North & Charles M Macal, "Managing Business Complexity: Discovering Strategic Solutions with Agent-Based Modeling and Simulation", 2007)

[swarm intelligence] "Refers to a class of algorithms inspired by the collective behaviour of insect swarms, ant colonies, the flocking behaviour of some bird species, or the herding behaviour of some mammals, such that the behaviour of the whole can be considered as exhibiting a rudimentary form of 'intelligence'." (John Fulcher, "Intelligent Information Systems", 2009)

"The property of a system whereby the collective behaviors of unsophisticated agents interacting locally with their environment cause coherent functional global patterns to emerge." (M L Gavrilova, "Adaptive Algorithms for Intelligent Geometric Computing", 2009)

[swarm intelligence] "Is a discipline that deals with natural and artificial systems composed of many individuals that coordinate using decentralized control and self-organization. In particular, SI focuses on the collective behaviors that result from the local interactions of the individuals with each other and with their environment." (Elina Pacini et al, "Schedulers Based on Ant Colony Optimization for Parameter Sweep Experiments in Distributed Environments", 2013).

"Swarm intelligence (SI) is a branch of computational intelligence that discusses the collective behavior emerging within self-organizing societies of agents. SI was inspired by the observation of the collective behavior in societies in nature such as the movement of birds and fish. The collective behavior of such ecosystems, and their artificial counterpart of SI, is not encoded within the set of rules that determines the movement of each isolated agent, but it emerges through the interaction of multiple agents." (Maximos A Kaliakatsos-Papakostas et al, "Intelligent Music Composition", 2013)

"Collective intelligence of societies of biological (social animals) or artificial (robots, computer agents) individuals. In artificial intelligence, it gave rise to a computational paradigm based on decentralisation, self-organisation, local interactions, and collective emergent behaviours." (D T Pham & M Castellani, "The Bees Algorithm as a Biologically Inspired Optimisation Method", 2015)

"It is the field of artificial intelligence in which the population is in the form of agents which search in a parallel fashion with multiple initialization points. The swarm intelligence-based algorithms mimic the physical and natural processes for mathematical modeling of the optimization algorithm. They have the properties of information interchange and non-centralized control structure." (Sajad A Rather & P Shanthi Bala, "Analysis of Gravitation-Based Optimization Algorithms for Clustering and Classification", 2020)

"It [swarm intelligence] is the discipline dealing with natural and artificial systems consisting of many individuals who coordinate through decentralized monitoring and self-organization." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

Resources:
More quotes on "Swarm Intelligence" at the-web-of-knowledge.blogspot.com.

05 May 2018

🔬Data Science: Clustering (Definitions)

"Grouping of similar patterns together. In this text the term 'clustering' is used only for unsupervised learning problems in which the desired groupings are not known in advance." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"The process of grouping similar input patterns together using an unsupervised training algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Clustering attempts to identify groups of observations with similar characteristics." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"The process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects, which are 'similar' between them and are 'dissimilar' to the objects belonging to other clusters." (Juan R González et al, "Nature-Inspired Cooperative Strategies for Optimization", 2008)

"Grouping the nodes of an ad hoc network such that each group is a self-organized entity having a cluster-head which is responsible for formation and management of its cluster." (Prayag Narula, "Evolutionary Computing Approach for Ad-Hoc Networks", 2009)

"The process of assigning individual data items into groups (called clusters) so that items from the same cluster are more similar to each other than items from different clusters. Often similarity is assessed according to a distance measure." (Alfredo Vellido & Iván Olie, "Clustering and Visualization of Multivariate Time Series", 2010)

"Verb. To output a smaller data set based on grouping criteria of common attributes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The process of partitioning the data attributes of an entity or table into subsets or clusters of similar attributes, based on subject matter or characteristic (domain)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A data mining technique that analyzes data to group records together according to their location within the multidimensional attribute space." (SQL Server 2012 Glossary, "Microsoft", 2012)

"Clustering aims to partition data into groups called clusters. Clustering is usually unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't." (Ivan Idris, "Python Data Analysis", 2014)

"Form of data analysis that groups observations to clusters. Similar observations are grouped in the same cluster, whereas dissimilar observations are grouped in different clusters. As opposed to classification, there is not a class attribute and no predefined classes exist." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"Organization of data in some semantically meaningful way such that each cluster contains related data while the unrelated data are assigned to different clusters. The clusters may not be predefined." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

[cluster analysis:] "A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Clustering is a classification technique where similar kinds of objects are grouped together. The similarity between the objects maybe determined in different ways depending upon the use case. Therefore, clustering in measurement space may be an indicator of similarity of image regions, and may be used for segmentation purposes." (Shiwangi Chhawchharia, "Improved Lymphocyte Image Segmentation Using Near Sets for ALL Detection", 2016)

"Clustering techniques share the goal of creating meaningful categories from a collection of items whose properties are hard to directly perceive and evaluate, which implies that category membership cannot easily be reduced to specific property tests and instead must be based on similarity. The end result of clustering is a statistically optimal set of categories in which the similarity of all the items within a category is larger than the similarity of items that belong to different categories." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

[cluster analysis:]"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People", 2017)

"Clustering or cluster analysis is a set of techniques of multivariate data analysis aimed at selecting and grouping homogeneous elements in a data set. Clustering techniques are based on measures relating to the similarity between the elements. In many approaches this similarity, or better, dissimilarity, is designed in terms of distance in a multidimensional space. Clustering algorithms group items on the basis of their mutual distance, and then the belonging to a set or not depends on how the element under consideration is distant from the collection itself." (Crescenzio Gallo, "Building Gene Networks by Analyzing Gene Expression Profiles", 2018)

"Unsupervised learning or clustering is a way of discovering hidden structure in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The term clustering refers to the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters." (Satyadhyan Chickerur et al, "Forecasting the Demand of Agricultural Crops/Commodity Using Business Intelligence Framework", 2019)

"In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A cluster is a group of data objects which have similarities among them. It's a group of the same or similar elements gathered or occurring closely together." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Clustering describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"Describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

🔬Data Science: Classification (Definitions)

"Classification is the process of arranging data into sequences and groups according to their common characteristics, or separating them into different but related parts." (Horace Secrist, "An Introduction to Statistical Methods", 1917)

"A classification is a scheme for breaking a category into a set of parts, called classes, according to some precisely defined differing characteristics possessed by all the elements of the category." (Alva M Tuttle, "Elementary Business and Economic Statistics", 1957)

"The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"1.Generally, a set of discrete, exhaustive, and mutually exclusive observations that can be assigned to one or more variables to be measured in the collation and/or presentation of data. 2.In data modeling, the arrangement of entities into supertypes and subtypes. 3.In object-oriented design, the arrangement of objects into classes, and the assignment of objects to these categories." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Form of data analysis that models the relationships between a number of variables and a target feature. The target feature contains nominal values that indicate the class to which each observation belongs." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"Systematic identification and arrangement of business activities and/or records into categories according to logically structured conventions, methods, and procedural rules represented in a classification system. A coding of content items as members of a group for the purposes of cataloging them or associating them with a taxonomy." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"Techniques for organizing data into groups associated with a particular outcome, such as the likelihood to purchase a product or earn a college degree." (Meta S Brown, "Data Mining For Dummies", 2014)

"The systematic assignment of resources to a system of intentional categories, often institutional ones. Classification is applied categorization - the assignment of resources to a system of categories, called classes, using a predetermined set of principles." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"A systematic arrangement of objects into groups or categories according to a set of established criteria. Data and resources can be assigned a level of sensitivity as they are being created, amended, enhanced, stored, or transmitted. The classification level then determines the extent to which the resource needs to be controlled and secured, and is indicative of its value in terms of information assets." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

"In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2018)

"It is task of classifying the data into predefined number of classes. It is a supervised approach. The tagged data is used to create classification model that will be used for classification on unknown data." (Siddhartha Kumar Arjaria & Abhishek S Rathore, "Heart Disease Diagnosis: A Machine Learning Approach", 2019)

"In a machine learning context, classification is the task of assigning classes to examples. The simplest form is the binary classification task where each example can have one of two classes. The binary classification task is a special case of the multiclass classification task where each example can have one of a fixed set of classes. There is also the multilabel classification task where each example can have zero or more labels from a fixed set of labels." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"The act of assigning a category to something" (ITIL)

29 April 2018

🔬Data Science: Data Standardization (Definitions)

"The process of reaching agreement on common data definitions, formats, representation and structures of all data layers and elements." (United Nations, "Handbook on Geographic Information Systems and Digital Mapping", Studies in Methods No. 79, 2000)

[value standardization:] "Refers to the establishment and adherence of data to standard formatting practices, ensuring a consistent interpretation of data values." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"Converting data into standard formats to facilitate parsing and thus matching, linking, and de-duplication. Examples include: “Avenue” as “Ave.” in addresses; “Corporation” as “Corp.” in business names; and variations of a specific company name as one version." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Normalizes data values to meet format and semantic definitions. For example, data standardization of address information may ensure that an address includes all of the required pieces of information and normalize abbreviations (for example Ave. for Avenue)." (Martin Oberhofer et al, "Enterprise Master Data Management", 2008)

"Using rules to conform data that is similar into a standard format or structure. Example: taking similar data, which originates in a variety of formats, and transforming it into a single, clearly defined format." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"a process in information systems where data values for a data element are transformed to a consistent representation." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"Data standardization is the process of converting data to a common format to enable users to process and analyze it." (Sisense) [source]

"In the context of data analysis and data mining: Where “V” represents the value of the variable in the original datasets: Transformation of data to have zero mean and unit variance. Techniques used include: (a) Data normalization; (b) z-score scaling; (c) Dividing each value by the range: recalculates each variable as V /(max V – min V). In this case, the means, variances, and ranges of the variables are still different, but at least the ranges are likely to be more similar; and, (d) Dividing each value by the standard deviation. This method produces a set of transformed variables with variances of 1, but different means and ranges." (CODATA)

27 April 2018

🔬Data Science: Validity (Definitions)

"An argument that explains the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of decisions made from an assessment." (Asao B Inoue, "The Technology of Writing Assessment and Racial Validity", 2009)

[external *]: "The extent to which the results obtained can be generalized to other individuals and/or contexts not studied." (Joan Hawthorne et al, "Method Development for Assessing a Diversity Goal", 2009)

[external *:] "A study has external validity when its results are generalizable to the target population of interest. Formally, external validity means that the causal effect based on the study population equals the causal effect in the target population. In counterfactual terms, external validity requires that the study population be exchangeable with the target population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal *:] "A study has internal validity when it provides an unbiased estimate of the causal effect of interest. Formally, internal validity means that the empirical effect from the study is equal to the causal effect in the study population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"Construct validity is a term developed by psychometricians to describe the ability of a variable to represent accurately an underlying characteristic of interest." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[operational validity:] "is defined as a model result behavior has enough correctness for a model intended aim over the area of system intended applicability." (Sattar J Aboud et al, "Verification and Validation of Simulation Models", 2010)

"Validity is the ability of the study to produce correct results. There are various specific types of validity (see internal validity, external validity, construct validity). Threats to validity include primarily what we have termed bias, but encompass a wider range of methodological problems, including random error and lack of construct validity." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal validity:] "Accuracy of the research study in determining the relationship between independent and the dependent variables. Internal validity can be assured only if all potential confounding variables have been properly controlled." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

[external *:] "Extent to which the results of a study accurately indicate the true nature of a relationship between variables in the real world. If a study has external validity, the results are said to be generalisable to the real world." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"The degree to which inferences made from data are appropriate to the context being examined. A variety of evidence can be used to support interpretation of scores." (Anne H Cash, "A Call for Mixed Methods in Evaluating Teacher Preparation Programs", 2016)

[construct *:] "Validity of a theory is also known as construct validity. Most theories in science present broad conceptual explanations of relationship between variables and make many different predictions about the relationships between particular variables in certain situations. Construct validity is established by verifying the accuracy of each possible prediction that might be made from the theory. Because the number of predictions is usually infinite, construct validity can never be fully established. However, the more independent predictions for the theory verified as accurate, the stronger the construct validity of the theory." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

23 April 2018

🔭Data Science: Independence (Just the Quotes)

"To apply the category of cause and effect means to find out which parts of nature stand in this relation. Similarly, to apply the gestalt category means to find out which parts of nature belong as parts to functional wholes, to discover their position in these wholes, their degree of relative independence, and the articulation of larger wholes into sub-wholes." (Kurt Koffka, 1931)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"[A] sequence is random if it has every property that is shared by all infinite sequences of independent samples of random variables from the uniform distribution." (Joel N Franklin, 1962)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important that uncertainty components that are independent of each other are added quadratically. This is also true for correlated uncertainty components, provided they are independent of each other, i.e., as long as there is no correlation between the components." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The fact that the same uncertainty (e.g., scale uncertainty) is uncorrelated if we are dealing with only one measurement, but correlated (i.e., systematic) if we look at more than one measurement using the same instrument shows that both types of uncertainties are of the same nature. Of course, an uncertainty keeps its characteristics (e.g., Poisson distributed), independent of the fact whether it occurs only once or more often." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks use a graph-based representation to encode the structural relationships - such as direct influence and conditional independence - between subsets of features in a domain. Consequently, a Bayesian network representation is generally more compact than a full joint distribution (because it can encode conditional independence relationships), yet it is not forced to assert a global conditional independence between all descriptive features. As such, Bayesian network models are an intermediary between full joint distributions and naive Bayes models and offer a useful compromise between model compactness and predictive accuracy." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"The main differences between Bayesian networks and causal diagrams lie in how they are constructed and the uses to which they are put. A Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables) and that this relation is sufficient. That is, knowing additional ancestors of the child will not change the formula. Likewise, a missing arrow between any two nodes means that they are independent, once we know the values of their parents. [...] If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

16 April 2018

🔬Data Science: Classification Tree (Definitions)

"A decision tree that is used for prediction of categorical data." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"One of the main 'workhorse' techniques in data mining; used to predict membership of cases in the classes of a categorical dependent variable from their measurements predictor variables. Classification trees typically split the sample on simple rules and then resplit the subsamples, etc., until the data can’t sustain further complexity." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"A machine learning approach that uses training data to create a model that can then be used for assigning cases (for example, workers) in a dataset to different possible groupings (for example, leavers or stayers)." (Jonathan Ferrar et al, "The Power of People", 2017)

"a form of classification algorithm in which features are examined in sequence, with the response indicating the next feature to examine, until a classification is made." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"A tree showing equivalence partitions hierarchically ordered, which is used to design test cases in the classification tree method. See also classification tree method." (SQA)

13 April 2018

🔬Data Science: Text Mining (Definitions)

"The application of data mining techniques to discover actionable and meaningful patterns, profiles, and trends from documents or other text data." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"The process of evaluating unstructured text for patterns, extract actionable data and sentiment via semantic analysis, statistical methods, etc." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Performing detailed full–text searches on the content of document." (Robert F Smallwood, "Managing Electronic Records: Methods, Best Practices, and Technologies", 2013)

"Data-mining techniques applied to text. Because these rely on the same underlying analytic approaches as text analysis, text mining is synonymous with text analysis, and the use of the term mining is primarily a matter of style and context." (Meta S Brown, "Data Mining For Dummies", 2014)

"Performing detailed full-text searches on the content of document." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"It is the process of extracting information from textual sources, via their grammatical and statistical properties. Applications of text mining include security monitoring and analysis of online texts such as blogs, web-pages, web-posts, etc." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"The analysis of raw data to produce results specific to a particular inquiry (e.g., how often a particular word is used, whether a particular product is in demand, how a particular consumer reacts to advertisements)." (James R Kalyvas & Michael R Overly, "Big Data: A Businessand Legal Guide", 2015)

"Performing detailed full-text searches on the content of document." (Robert F Smallwood, "Information Governance for Healthcare Professionals", 2018)

"The search and extraction of text, and its possible conversion to numerical data that is used for data analysis." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"The process of extracting information from collections of textual data and utilizing it for business objectives." (Gartner)

10 April 2018

🔬Data Science: Abstraction (Definitions)

"A broad and general term indicating (1) a less detailed model that conforms to (defines a subset of the properties of) another model, and (2) the process through which a less detailed but conforming model is made, that is, the process of removing details that are not relevant to the purpose of the model." (Anneke Kleppe et al, "MDA Explained: The Model Driven Architecture™: Practice and Promise", 2003)

"The process of ignoring or suppressing levels of detail to provide a simpler, more generalized view." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The process of moving from the specific to the general by neglecting minor differences or stressing common elements. Also used as a synonym for summarization." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"Data abstraction means the storage details of the data are hidden from the user and the user is provided with the conceptual view of the database." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"In data modeling, the redefinition of data entities, attributes, and relationships by removing details to broaden the applicability of data structures to a wider class of situations, often by implementing supertypes rather than subtypes." (DAMA International, "The DAMA Dictionary of Data Management" 1st Ed., 2010)

[horizontal abstraction:] "The process of partitioning a model into smaller subparts for presentation. Used in data modeling to show related areas in a more readable scale." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[vertical abstraction:] "The presentation of all or part of a model detail. Used in data modeling to show higher levels of entities and relationships to illustrate the basic subject area contents." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The separation of the logical view of data from its implementation." (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"The separation of a data type’s logical properties from its implementation." (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

05 April 2018

🔬Data Science: Genetic Algorithms [GA] (Definitions)

"A method for solving optimization problems using parallel search, based on the biological paradigm of natural selection and 'survival of the fittest'." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Algorithms for solving complex combinatorial and organizational problems with many variants, by employing analogy with nature's evolution. The general steps a genetic algorithm cycles through are: generate a new population (crossover) starting at the beginning with initial one; select the best individuals; mutate, if necessary; repeat the same until a satisfactory solution is found according to a goodness (fitness) function." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"The type of algorithm that locates optimal binary strings by processing an initially random population of strings using artificial mutation, crossover, and selection operators, in an analogy with the process of natural selection." (David E Goldberg, "Genetic Algorithms", 1989)

"A technique for estimating computer models (e.g., Machine Learning) based on methods adapted from the field of genetics in biology. To use this technique, one encodes possible model behaviors into a 'genes'. After each generation, the current models are rated and allowed to mate and breed based on their fitness. In the process of mating, the genes are exchanged, and crossovers and mutations can occur. The current population is discarded and its offspring forms the next generation." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"Genetic algorithms are problem-solving techniques that solve problems by evolving solutions as nature does, rather than by looking for solutions in a more principled way. Genetic algorithms, sometimes hybridized with other optimization algorithms, are the best optimization algorithms available across a wide range of problem types." (Guido Deboeck & Teuvo Kohonen (Eds), "Visual Explorations in Finance with Self-Organizing Maps" 2nd Ed., 2000)

"learning principle, in which learning results are foully from generations of solutions by crossing and eliminating their members. An improved behavior usually ensues from selective stochastic replacements in subsets of system parameters." (Teuvo Kohonen, "Self-Organizing Maps 3rd Ed.", 2001)

"A genetic algorithm is a search method used in computational intelligence to find true or approximate solutions to optimization and search problems." (Omar F El-Gayar et al, "Current Issues and Future Trends of Clinical Decision Support Systems", 2008)

"A method of evolutionary computation for problem solving. There are states also called sequences and a set of possibility final states. Methods of mutation are used on genetic sequences to achieve better sequences." (Attila Benko & Cecília S Lányi, "History of Artificial Intelligence", 2009)

"Genetic algorithms are derivative free, stochastic optimization methods based on the concepts of natural selection and evolutionary processes." (Yorgos Goletsis et al, Bankruptcy Prediction through Artificial Intelligence, 2009)

"Genetic Algorithms (GAs) are algorithms that use operations found in natural genetics to guide their way through a search space and are increasingly being used in the field of optimisation. The robust nature and simple mechanics of genetic algorithms make them inviting tools for search learning and optimization. Genetic algorithms are based on computational models of fundamental evolutionary processes such as selection, recombination and mutation." (Masoud Mohammadian, Supervised Learning of Fuzzy Logic Systems, 2009)

"The algorithms that are modelled on the natural process of evolution. These algorithms employ methods such as crossover, mutation and natural selection and provide the best possible solutions after analyzing a group of sub-optimal solutions which are provided as inputs." (Prayag Narula, "Evolutionary Computing Approach for Ad-Hoc Networks", 2009)

"These algorithms mimic the process of natural evolution and perform explorative search. The main component of this method is chromosomes that represent solutions to the problem. It uses selection, crossover, and mutation to obtain chromosomes of highest quality." (Indranil Bose, "Data Mining in Tourism", 2009)

"Search algorithms used in machine learning which involve iteratively generating new candidate solutions by combining two high scoring earlier (or parent) solutions in a search for a better solution." (Radian Belu, "Artificial Intelligence Techniques for Solar Energy and Photovoltaic Applications", 2013)

"Genetic algorithms (GAs) is a stochastic search methodology belonging to the larger family of artificial intelligence procedures and evolutionary algorithms (EA). They are used to generate useful solutions to optimization and search problems mimicking Darwinian evolution." (Niccolò Gordini, "Genetic Algorithms for Small Enterprises Default Prediction: Empirical Evidence from Italy", 2014)

"Genetic algorithms are based on the biological theory of evolution. This type of algorithms is useful for searching and optimization." (Ivan Idris, "Python Data Analysis", 2014)

"A Stochastic optimization algorithms based on the principles of natural evolution." (Harish Garg, "A Hybrid GA-GSA Algorithm for Optimizing the Performance of an Industrial System by Utilizing Uncertain Data", 2015)

"It is a stochastic but not random method of search used for optimization or learning. Genetic algorithm is basically a search technique that simulates biological evolution during optimization process." (Salim Lahmir, "Prediction of International Stock Markets Based on Hybrid Intelligent Systems", 2016)

"Machine learning algorithms inspired by genetic processes, for example, an evolution where classifiers with the greatest accuracy are trained further." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

04 April 2018

🔬Data Science: Fuzzy Logic (Definitions)

"[Fuzzy logic is] a logic whose distinguishing features are (1) fuzzy truth-values expressed in linguistic terms, e. g., true, very true, more or less true, or somewhat true, false, nor very true and not very false, etc.; (2) imprecise truth tables; and (3) rules of inference whose validity is relative to a context rather than exact." (Lotfi A. Zadeh, "Fuzzy logic and approximate reasoning", 1975)

"A logic using fuzzy sets, that is, in which elements can have partial set membership." (Bruce P Douglass, "Real-Time Agility", 2009)

"A mathematical technique that classifies subjective reasoning and assigns data to a particular group, or cluster, based on the degree of possibility the data has of being in that group." (Mary J Lenard & Pervaiz Alam, "Application of Fuzzy Logic to Fraud Detection", 2009)

"A type of logic that recognizes more than simple true and false values. With fuzzy logic, propositions can be represented with degrees of truthfulness and falsehood thus it can deal with imprecise or ambiguous data. Boolean logic is considered to be a special case of fuzzy logic." (Lior Rokach, "Incorporating Fuzzy Logic in Data Mining Tasks", 2009)

"Fuzzy logic is an application area of fuzzy set theory dealing with uncertainty in reasoning. It utilizes concepts, principles, and methods developed within fuzzy set theory for formulating various forms of sound approximate reasoning. Fuzzy logic allows for set membership values to range (inclusively) between 0 and 1, and in its linguistic form, imprecise concepts like 'slightly', 'quite' and 'very'. Specifically, it allows partial membership in a set." (Larbi Esmahi et al, Adaptive Neuro-Fuzzy Systems, 2009)

"It is a Knowledge representation technique and computing framework whose approach is based on degrees of truth rather than the usual 'true' or 'false' of classical logic." (Juan C González-Castolo & Ernesto López-Mellado, "Fuzzy Approximation of DES State", 2009)

"Fuzzy logic is a theory that deals with reasoning that is approximate rather than precisely deduced from classical predicate logic. In other words, fuzzy logic deals with well thought out real world expert values in relation to a complex problem." (Goh B Hua, "A BIM Based Application to Support Cost Feasible ‘Green Building' Concept Decisions", 2010)

"We use the term fuzzy logic to refer to all aspects of representing and manipulating knowledge that employ intermediary truth-values. This general, commonsense meaning of the term fuzzy logic encompasses, in particular, fuzzy sets, fuzzy relations, and formal deductive systems that admit intermediary truth-values, as well as the various methods based on them." (Radim Belohlavek & George J Klir, "Concepts and Fuzzy Logic", 2011)

"Fuzzy logic is a form of many-valued logic derived from fuzzy set theory to deal with uncertainty in subjective belief. In contrast with 'crisp logic', where binary sets have two-valued logic, fuzzy logic variables can have a value that ranges between 0 and 1. Furthermore, when linguistic variables are used, these unit-interval numerical values may be described by specific functions." (T T Wong & Loretta K W Sze, "A Neuro-Fuzzy Partner Selection System for Business Social Networks", 2012)

"Fuzzy logic is a problem-solving methodology that is inspired by human decision-making, taking advantage of our ability to reason with vague or approximate data." (Filipe Quinaz et al, Soft Methods for Automatic Drug Infusion in Medical Care Environment, 2013)

"Approach of using approximate reasoning based on degrees of truth for computation analysis." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"It is a type of reasoning designed to mathematically represent uncertainty and vagueness where logical statements are not only true or false. Fuzzy logic is a formalized mathematical tool which is useful to deal with imprecise problems." (Salim Lahmir, "Prediction of International Stock Markets Based on Hybrid Intelligent Systems", 2016)

"Fuzzy logic is a problem solving tool of artificial intelligence which deals with approximate reasoning rather than fixed and exact reasoning." (Narendra K Kamila & Pradeep K Mallick, "A Novel Fuzzy Logic Classifier for Classification and Quality Measurement of Apple Fruit", 2016)

"A form of many-valued logic. Fuzzy logic deals with reasoning that is approximate rather than fixed and exact. Compared to traditional true or false values, fuzzy logic variables may have a truth value that ranges in degree from 0 to 1. Fuzzy logic has been extended to handle the concept of partial truth, where the truth value may range between completely true and completely false." (Roanna Lun & Wenbing Zhao, "Kinect Applications in Healthcare", 2018)

'Fuzzy logic is a computing approach based on multi-valued logic where the variable can take any real number between 0 and 1 as a value based on degree of truthness." (Kavita Pandey & Shikha Jain, A Fuzzy-Based Sustainable Solution for Smart Farming, 2020)

"Fuzzy Logic is a form of mathematical logic in which the truth values of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completely false. By contrast, in Boolean logic, the truth values of variables may only be the integer values 0 or 1." (Alexander P Ryjov & Igor F Mikhalevich, "Hybrid Intelligence Framework for Improvement of Information Security of Critical Infrastructures", 2021)

"Fuzzy Logic is a form of logic system, where the distinction between truth and false values is not binary but multi valued, therefore allowing for a richer expression of logical statements. " (Accenture)

🔬Data Science: Normal Distribution (Definitions)

"A frequency distribution for a continuous variable, which exhibits a bell-shaped curve." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"The symmetric distribution of data about an average point. The normal distribution takes on the form of a bell-shaped curve. It is a graphic illustration of how randomly selected data points from a product or process response will mostly fall close to the average response, with fewer and fewer data points falling farther and farther away from the mean. The normal distribution can also be expressed as a mathematical function and is often called a Gaussian distribution." (Clyde M Creveling, "Six Sigma for Technical Processes: An Overview for R Executives, Technical Leaders, and Engineering Managers", 2006)

"A probability distribution forming a symmetrical bell-shaped curve." (Peter Oakander et al, "CPM Scheduling for Construction: Best Practices and Guidelines", 2014)

"Distribution of scores that are characterised by a bell-shaped curve in which the probability of a score drops off rapidly from the midpoint to the tails of the distribution. A true normal curve is defined by a mathematical equation and is a function of two variables (the mean and variance of the distribution)." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Also known as a bell-shaped curve or Gaussian curve, this is a distribution of data that is symmetrical around the mean: The mean, median, and mode are all equal, with more density in the center and less in the tails." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Also known as normal or the bell curve, is a type of continuous probability distribution which is defined by two parameters, the mean µ, and the standard deviation s." (Accenture)

🔬Data Science: Graph (Definitions)

"Informally, a graph is a finite set of dots called vertices (or nodes) connected by links called edges (or arcs). More formally: a simple graph is a (usually finite) set of vertices V and set of unordered pairs of distinct elements of V called edges." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"A computation object that is used to model relationships among things. A graph is defined by two finite sets: a set of nodes and a set of edges. Each node has a label to identify it and distinguish it from other nodes. Edges in a graph connect exactly two nodes and are denoted by the pair of labels of nodes that are related." (Clay Breshears, "The Art of Concurrency", 2009)

"A graph in mathematics is a set of nodes and a set of edges between pairs of those nodes; the edges are ordered or nonordered pairs, or a relation, that defines the pairs of nodes for which the relation being examined is valid. […] The edges can either be undirected or directed; directed edges depict a relation that requires the nodes to be ordered while an undirected edge defines a relation in which no ordering of the edges is implied." (Dennis M Buede, "The Engineering Design of Systems: Models and methods", 2009)

[undirected graph:] "A graph in which the nodes of an edge are unordered. This implies that the edge can be thought of as a two-way path." (Clay Breshears, "The Art of Concurrency", 2009)

[directed graph:] "A graph whose edges are ordered pairs of nodes; this allows connections between nodes in one direction. When drawn, the edges of a directed graph are commonly shown as arrows to indicate the “direction” of the edge." (Clay Breshears, "The Art of Concurrency", 2009)

"1.Generally, a set of homogeneous nodes (vertices) and edges (arcs) between pairs of nodes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[directed acyclic graph:] "A graph that defines a partial order so that nodes can be sorted into a linear sequence with references only going in one direction. A directed acyclic graph has, as its name suggests, directed edges and no cycles." (Michael McCool et al, "Structured Parallel Programming", 2012)

"A data structure that consists of a set of nodes and a set of edges that relate the nodes to each other" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

[directed graph:] "A directed graph is one in which the edges have a specified direction from one vertex to another." (Dan Sullivan, "NoSQL for Mere Mortals", 2015)

[directed graph (digraph):] "A graph in which each edge is directed from one vertex to another (or the same) vertex" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

[undirected graph:] "A graph in which the edges have no direction" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

[undirected graph:] "An undirected graph is one in which the edges do not indicate a direction (such as from-to) between two vertices." (Dan Sullivan, "NoSQL for Mere Mortals®", 2015)

"Like a tree, a graph consists of a set of nodes connected by edges. These edges may or may not have a direction. If they do, the graph is referred to as a 'directed graph'. If a graph is directed, it may be possible to start at a node and follow edges in a path that leads back to the starting node. Such a path is called a 'cycle'. If a directed graph has no cycles, it is referred to as an 'acyclic graph'." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"In a computer science or mathematics context, a graph is a set of nodes and edges that connect the nodes." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

Undirected graph "A graph in which the edges have no direction" (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

🔬Data Science: Heuristic (Definitions)

"Problem solving or analysis by experimental and especially trial-and-error methods." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The mode of analysis in which the next step is determined by the results of the current step. Used for decision support processing." (Margaret Y Chu, "Blissful Data ", 2004)

"A type of analysis in which the next step is determined by the results of the current step of analysis." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

"The mode of analysis in which the next step is determined by the results of the current step of analysis. Used for decision-support processing." (William H Inmon, "Building the Data Warehouse", 2005)

"An algorithmic technique designed to solve a problem that ignores whether the solution can be proven to be correct." (Omar F El-Gayar et al, "Current Issues and Future Trends of Clinical Decision Support Systems", 2008)

"General advice that is usually efficient but sometimes cannot be used; also it is a validate function that adds a number to the state of the problem." (Attila Benko & Cecília S Lányi, "History of Artificial Intelligence", 2009)

"These methods, found through discovery and observation, are known to produce incorrect or inexact results at times but likely to produce correct or sufficiently exact results when applied in commonly occurring conditions." (Vineet R Khare & Frank Z Wang, "Bio-Inspired Grid Resource Management", Handbook of Research on Grid Technologies and Utility Computing, 2009)

"Refers to a search and discovery approach, in which we proceed gradually, without trying to find out immediately whether the partial result, which is only adopted on a provisional basis, is true or false. This method is founded on a gradual approach to a given question, using provisional hypotheses and successive evaluations." (Humbert Lesca & Nicolas Lesca, "Weak Signals for Strategic Intelligence: Anticipation Tool for Managers", 2011)

"'Rules of thumb' and approximation methods for obtaining a goal, a high quality solution, or improved performance. It sacrifices completeness to increase efficiency, as some potential solutions would not be practicable or acceptable due to their 'rareness' or 'complexity'. This method may not always find the best solution, but it will find an acceptable solution within a reasonable timeframe for problems that will require almost infinite or longer than acceptable times to compute." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An experience-based technique for solving problems that emphasizes personal knowledge and quick decision making." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

[heuristic process:] "An iterative process, where the next step of analysis depends on the results attained in the current level of analysis" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"An algorithm that gives a good solution for a problem but that doesn’t guarantee to give you the best solution possible." (Rod Stephens, "Beginning Software Engineering", 2015)

"Rules of thumb derived by experience, intuition, and simple logic." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Problem-solving technique that yields a sub-optimal solution judged to be sufficient." (Karl Beecher, "Computational Thinking - A beginner's guide to problem-solving and programming", 2017)

"An algorithm to solve a problem simply and quickly with an approximate solution, as compared to a complex algorithm that provides a precise solution, but may take a prohibitively long time." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)