SQL Troubles

10 May 2018

🔬Data Science: Support Vector Machines [SVM] (Definitions)

"A supervised machine learning classification approach with the objective to find the hyperplane maximizing the minimum distance between the plane and the training data points." (Xiaoyan Yu et al, "Automatic Syllabus Classification Using Support Vector Machines", 2009)

"Support vector machines [SVM] is a methodology used for classification and regression. SVMs select a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible." (Yorgos Goletsis et al, "Bankruptcy Prediction through Artificial Intelligence", 2009)

"SVM is a data mining method useful for classification problems. It uses training data and kernel functions to build a model that can appropriately predict the class of an unclassified observation." (Indranil Bose, "Data Mining in Tourism", 2009)

"A modeling technique that assigns points to classes based on the assignment of previous points, and then determines the gap dividing the classes where the gap is furthest from points in both classes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A machine-learning technique that classifies objects. The method starts with a training set consisting of two classes of objects as input. The SVA computes a hyperplane, in a multidimensional space, that separates objects of the two classes. The dimension of the hyperspace is determined by the number of dimensions or attributes associated with the objects. Additional objects (i.e., test set objects) are assigned membership in one class or the other, depending on which side of the hyperplane they reside." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A machine learning algorithm that works with labeled training data and outputs results to an optimal hyperplane. A hyperplane is a subspace of the dimension minus one (that is, a line in a plane)." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"A classification algorithm that finds the hyperplane dividing the training data into given classes. This division by the hyperplane is then used to classify the data further." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"Machine learning techniques that are used to make predictions of continuous variables and classifications of categorical variables based on patterns and relationships in a set of training data for which the values of predictors and outcomes for all cases are known." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"It is a supervised machine learning tool utilized for data analysis, regression, and classification." (Shradha Verma, "Deep Learning-Based Mobile Application for Plant Disease Diagnosis", 2019)

"It is a supervised learning algorithm in ML used for problems in both classification and regression. This uses a technique called the kernel trick to transform the data and then determines an optimal limit between the possible outputs, based on those transformations." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

"Support Vector Machines (SVM) are supervised machine learning algorithms used for classification and regression analysis. Employed in classification analysis, support vector machines can carry out text categorization, image classification, and handwriting recognition." (Accenture)

🔬Data Science: Cross-validation (Definitions)

"A method for assessing the accuracy of a regression or classification model. A data set is divided up into a series of test and training sets, and a model is built with each of the training set and is tested with the separate test set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A method for assessing the accuracy of a regression or classification model." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"A statistical method derived from cross-classification which main objective is to detect the outlying point in a population set." (Tomasz Ciszkowski & Zbigniew Kotulski, "Secure Routing with Reputation in MANET", 2008)

"Process by which an original dataset d is divided into a training set t and a validation set v. The training set is used to produce an effort estimation model (if applicable), later used to predict effort for each of the projects in v, as if these projects were new projects for which effort was unknown. Accuracy statistics are then obtained and aggregated to provide an overall measure of prediction accuracy." (Emilia Mendes & Silvia Abrahão, "Web Development Effort Estimation: An Empirical Analysis", 2008)

"A method of estimating predictive error of inducers. Cross-validation procedure splits that dataset into k equal-sized pieces called folds. k predictive function are built, each tested on a distinct fold after being trained on the remaining folds." (Gilles Lebrun et al, EA Multi-Model Selection for SVM, 2009)

"Method to estimate the accuracy of a classifier system. In this approach, the dataset, D, is randomly split into K mutually exclusive subsets (folds) of equal size (D1, D2, …, Dk) and K classifiers are built. The i-th classifier is trained on the union of all Dj ¤ j¹i and tested on Di. The estimate accuracy is the overall number of correct classifications divided by the number of instances in the dataset." (M Paz S Lorente et al, "Ensemble of ANN for Traffic Sign Recognition" [in "Encyclopedia of Artificial Intelligence"], 2009)

"The process of assessing the predictive accuracy of a model in a test sample compared to its predictive accuracy in the learning or training sample that was used to make the model. Cross-validation is a primary way to assure that over learning does not take place in the final model, and thus that the model approximates reality as well as can be obtained from the data available." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Validating a scoring procedure by applying it to another set of data." (Dougal Hutchison, "Automated Essay Scoring Systems", 2009)

"A method for evaluating the accuracy of a data mining model." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Cross-validation is a method of splitting all of your data into two parts: training and validation. The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"A technique used for validation and model selection. The data is randomly partitioned into K groups. The model is then trained K times, each time with one of the groups left out, on which it is evaluated." (Simon Rogers & Mark Girolami, "A First Course in Machine Learning", 2017)

"A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set." (Adrian Carballal et al, "Approach to Minimize Bias on Aesthetic Image Datasets", 2019)

09 May 2018

🔬Data Science: Meta-Analysis (Definitions)

"A set of statistical procedures designed to accumulate experimental and correlational results across independent studies that address related sets of research questions." (Ying-Chieh Liu et al, "Meta-Analysis Research on Virtual Team Performance", 2008)

"A statistical technique in which the outcomes from multiple experimental comparisons are synthesized by evaluating effect sizes. Because the recommendations are based on multiple experiments, practitioners can have greater confidence in the results from an effective meta-analysis." (Ruth C Clark, "Building Expertise: Cognitive Methods for Training and Performance Improvement", 2008)

"Study characteristics can be thought of as the independent variable." (Ernest W Brewer, "Using Meta-Analysis as a Research Tool in Making Educational and Organizational Decisions", 2009)

"The exhaustive search process which comprises numerous and versatile algorithmic procedures to exploit the gene expression results by combining or further processing them with sophisticated statistical learning and data mining techniques coupled with annotated information concerning functional properties of these genes residing in large databases." (Aristotelis Chatziioannou & Panagiotis Moulos, "DNA Microarrays: Analysis and Interpretation", 2009)

"The statistical analysis of a group of relevantly similar experimental studies, in order to summarize their results considered as a whole." (Saul Fisher, "Cost-Effectiveness", 2009)

"A quantitative research review that applies statistical techniques to examine, standardize and combine the results of different empirical studies that investigate a set of related research hypotheses." (Olusola O Adesope & John C Nesbit, "A Systematic Review of Research on Collaborative Learning with Concept Maps", 2010)

"Analysis of a number of comparable studies with the aim to combine those studies in a statistically valid way to test hypotheses (about the effect of an intervention)." (Cor van Dijkum & Laura Vegter, "A Client Perspective on E-Health: Illustrated with an Example from The Netherlands", 2010)

"A computation of average effect sizes among many experiments. Data based on a meta-analysis give us greater confidence in the results because they reflect many research studies." (Ruth C Clark & Richard E Mayer, "e-Learning and the Science of Instruction", 2011)

"Analysis of previously analyzed data relating to the same or similar biological phenomena or treatment studied across the same or similar technology platforms." (Padmalatha S Reddy et al, "Knowledge-Driven, Data-Assisted Integrative Pathway Analytics", 2011)

"A set of techniques for the quantitative analysis of results from two or more studies on the same or similar issues." (Geoff Cumming, "Understanding The New Statistics", 2013)

"A method of combining effect sizes from individual studies into a single composite effect size." (Jonathan van‘t Riet et al, "The Effects of Active Videogames on BMI among Young People: A Meta-Analysis", 2016)

"A procedure that allows the statistical averaging of results from independent studies of the same phenomena. Meta-analysis essentially combines studies on the same topic into a single large study, providing an index of how strongly the independent variable affected the dependent variable on an average in the set of studies." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A research design that combines and synthesize different types of data from multiple sources." (Mzoli Mncanca & Chinedu Okeke, "Early Exposure to Domestic Violence and Implications for Early Childhood Education Services", 2019)

"A quantitative, formal, epidemiological study design used to systematically assess the results of previous research to derive conclusions about that body of research." (Helena H Borba et al, "Challenges in Evidence-Based Practice Education: From Teaching Concepts Towards Decision-Making Learning", 2021)

08 May 2018

📦Data Migrations (DM): Facts, Principles and Practices

Data Migrations Series

Introduction

Ask a person who is familiar with cars ‘how a car works‘ - you’ll get an answer even if it doesn’t entirely reflect the reality. Of course, the deeper one's knowledge about cars, the more elaborate or exact is the answer. One doesn't have to be a mechanic to give an acceptable explanation, though in order to design, repair or build a car one needs extensive knowledge about a car’s inner workings.

Keeping the proportions, the same statements are valid for the inner workings of Data Migrations (DM) – almost everybody in IT knows what a DM is, though to design or perform one you might need an expert.

The good news about DMs is that their inner workings are less complex than the ones of cars. Basically, a DM requires some understanding of data architecture, data modelling and data manipulation, and some knowledge of business data and processes. A data architect, a database developer, a data modeler or any other data specialist can approach such an endeavor. In theory, with some guidance also a person with knowledge about business data and processes can do the work. Even if DMs imply certain complexity, they are not rocket science! In fact, there are tools that can be used to do most the work, there some general principles and best practices about the architecture, planning and execution that can help in the process.

Principles and Best Practices

It might be useful to explain the difference between principles and best practices, because they’ll more likely lead you to success if you understood and incorporated them in your solutions. Principles as patterns of advice are general or fundamental ideas, truths or values stated in a context-independent manner. Practices on the other side are specific actions or applications of these principles stated in a context-dependent way. The difference between them is relatively thin, and therefore, they are easy to confound, though by looking at their generality, one can easily identify which is which.

For example, in the 60’s become known the "keep it simple, stupid" (aka KISS) principle, which states that a simple solution works better than a complex one, and therefore as key goal one should search the simplicity in design. Even if kind of pejorative, it’s a much simpler restatement of Occam’s razor –do something in the simplest manner possible because simpler is usually better. To apply it one must understand what simplicity means, and how it can be translated in designs. According to Hans Hofmann "the ability to simplify means to eliminate the unnecessary so that the necessary may speak" or in a quote quote attributed to Einstein: "everything should be made as simple as possible, but not simpler". This is the range within which the best practices derived from KISS can be defined.

There are multiple practices that allow reducing the complexity of DM solutions: start with a Proof-of-Concept (PoC), start small and build incrementally, use off-the-shelf software, use the best tool for the purpose, use incremental data loads, split big data files into smaller ones, and so on. As can be seen all of them are direct actions that address specific aspects of the DM architecture or process.

Data Migration Truths

When looking at principles and best practices they seem to be further rooted in some basic truths or facts common to most DMs. When considered together, they offer a broader view and understanding of what a DM is about. Here are some of the most important facts about DMs:

DM as a project:

A DM is a subproject with specific characteristics
A DM is typically a one-time activity before Go live
A DM’s success is entirely dependent or an organization’s capability of running projects
Responsibilities are not always clear
Requirements change as the project progresses
Resources aren't available when needed
Parallel migrations require a common strategy
A successful DM can be used as recipe for further migrations
A DM's success is a matter of perception
The volume of work increases toward the end

DM Architecture

A DM is more complex and messier than initially thought
A DM needs to be repeatable
A DM requires experts from various areas
There are several architectures to be considered
The migration approach is dependent on the future architecture
Management Systems have their own requirements
No matter how detailed the planning something is always forgotten
Knowledge of the source and target systems aren't always available
DM are too big to be performed manually
Some tasks are easier to be performed manually
Steps in the migration needs to be rerun
It takes several iterations before arriving to the final solution
Several data regulations apply
Fall-back is always an alternative
IT supports the migration project/processes
Technologies are enablers and not guarantees for success
Tools address only a set of the needed functionality
Troubleshooting needs to be performed before, during and after migrations
Failure/nonconformities need to be documented
A DM is an opportunity to improve the quality of the data
A DM needs to be transparent for the business

DM implications for the Business:

A DM requires a downtime for the system involved
The business has several expectations/assumptions
Some expectations are considered as self-evident
The initial assumptions are almost always wrong
A DM's success/failure depends on business' perception
Business' knowledge about the data and processes is relative
The business is involved for whole project’s duration
Business needs continuous communication
Data migration is mainly a business rather than a technical challenge
Business’ expertize in every data area is needed
DM and Data Quality (DQ) need to be part of a Data Management strategy
Old legacy system data have further value for the business
Reporting requirements come with their own data requirements

DM and Data Quality:

Not all required data are available
Data don't match the expectations
Quality of the data needs to be judged based on the target system
DQ is usually performed as a separate project with different timelines
Data don't have the same importance for the business
Improving DQ is a collective effort
Data cleaning needs to be done at the source (when possible)
Data cleaning is a business activity
The business is responsible for the data
Quality improvement is governed by 80-20 rule
No organization is willing paying for perfect data quality
If can’t be counted, it isn’t visible

More to come, stay tuned…

🔬Data Science: Cluster Analysis (Definitions)

"Generally, cluster analysis, or clustering, comprises a wide array of mathematical methods and algorithms for grouping similar items in a sample to create classifications and hierarchies through statistical manipulation of given measures of samples from the population being clustered. (Hannu Kivijärvi et al, "A Support System for the Strategic Scenario Process", 2008)

"Defining groups based on the 'degree' to which an item belongs in a category. The degree may be determined by indicating a percentage amount." (Mary J Lenard & Pervaiz Alam, "Application of Fuzzy Logic to Fraud Detection", 2009)

"A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

"A statistical technique whereby data or objects are classified into groups (clusters) that are similar to one another but different from data or objects in other clusters." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2018)

"Clustering or cluster analysis is a set of techniques of multivariate data analysis aimed at selecting and grouping homogeneous elements in a data set. Clustering techniques are based on measures relating to the similarity between the elements. In many approaches this similarity, or better, dissimilarity, is designed in terms of distance in a multidimensional space. Clustering algorithms group items on the basis of their mutual distance, and then the belonging to a set or not depends on how the element under consideration is distant from the collection itself." (Crescenzio Gallo, "Building Gene Networks by Analyzing Gene Expression Profiles", 2018)

"A type of an unsupervised learning that aims to partition a set of objects in such a way that objects in the same group (called a cluster) are more similar, whereas characteristics of objects assigned into different clusters are quite distinct." (Timofei Bogomolov et al, "Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques", 2020)

"Cluster analysis is the process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

🔬Data Science: Simulation Model (Definitions)

"A 'what-if' model that attempts to simulate the effects of alternative management policies and assumptions about the firm's external environment. It is basically a tool for management's laboratory." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"Simulation models are formal representations of a portion of reality. Simulation models allow managers to share and test assumptions about problem causes and solutions." (Luis F Luna-Reyes, "System Dynamics to Understand Public Information Technology", 2008)

"A simplified, computer, simulation-based construction (model) of some real world phenomenon (or the problem task)." (Hassan Qudrat-Ullah, "System Dynamics Based Technology for Decision Support", 2009)

"A model that shows the expected operation of a system based solely on the model." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An analytical technique that often involves running models repeatedly using a variety of inputs to determine the upper and lower bounds of possible outcomes. This simulation process is also sometimes used to identify the likely distribution of outputs given a series of assumptions around how the inputs are distributed." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"A representation of a system that can be used to mimic the processes of the system under varying circumstances. It is usually operated subject to stochastic disturbances." (Özgür Yalçınkaya, "A General Simulation Modelling Framework for Train Timetabling Problem", 2016)

"A model that represents an actual procedure over time." (Rania Tegou, "Excess Inventories and Stock Out Events Through Advanced Demand Analysis and Emergency Deliveries", 2018)

"technique that created a detailed model to predict the behavior of CI/service" (ITIL)

07 May 2018

🔬Data Science: Fuzzy Set (Definitions)

"A fuzzy set is a class of objects with a continuum of grades of membership. Such a set is characterized by a membership (characteristic) function which assigns to each object a grade of membership ranging between zero and one. The notions of inclusion, union, intersection, complement, relation, convexity, etc., are extended to such sets, and various properties of these notions in the context of fuzzy sets are established. In particular, a separation theorem for convex fuzzy sets is proved without requiring that the fuzzy sets be disjoint." (Lotfi A Zadeh, "Fuzzy Sets", 1965)

"A fuzzy set can be defined mathematically by assigning to each possible individual in the universe of discourse a value representing its grade of membership in the fuzzy set. This grade corresponds to the degree to which that individual is similar or compatible with the concept represented by the fuzzy set. Thus, individuals may belong in the fuzzy act to a greater or lesser degree as indicated by a larger or smaller membership grade. As already mentioned, these membership grades are very often represented by real-number values ranging in the closed interval between 0 and 1." (George J Klir & Bo Yuan, "Fuzzy Sets and Fuzzy Logic: Theory and Applications", 1995)

"A set of items whose degree of membership in the set may range from 0 to l." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"A set whose members belong to it to some degree. In contrast, a standard or nonfuzzy set contains its members all or none. The set of even numbers has no fuzzy members." (Guido Deboeck & Teuvo Kohonen (Eds), "Visual Explorations in Finance with Self-Organizing Maps" 2nd Ed., 2000)

"A fuzzy set is any set that allows its members to have different grades of membership (membership function) in the interval [0,1]. A numerical value between 0 and 1 that represents the degree to which an element belongs to a particular set, also referred to as membership value." (Harish Garg, "Predicting Uncertain Behavior and Performance Analysis of the Pulping System in a Paper Industry using PSO and Fuzzy Methodology", 2014)

"A set whose elements have degrees of membership, as opposed to a classical set." (Michael Mutingi et al, Fuzzy System Dynamics of Manpower Systems, 2014)

"Any set that allows its members to have different grades of membership (membership function) in the interval [0,1]. A numerical value between 0 and 1 that represents the degree to which an element belongs to a particular set, also referred to as membership value." (Harish Garg, "A Hybrid GA-GSA Algorithm for Optimizing the Performance of an Industrial System by Utilizing Uncertain Data", 2015)

"It is a set of elements which have no strict boundaries." (Alexander P Ryjov & Igor F Mikhalevich, "Hybrid Intelligence Framework for Improvement of Information Security of Critical Infrastructures", 2021)

06 May 2018

🔬Data Science: Precision (Definitions)

"Precision is the ‘spread’ or variability of repeated measures of the same value." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Defines the variation in repeated measurements of the same item. There are two major ways to measure precision - repeatability and reproducibility." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"An inherent quality characteristic that is a measure of an attribute’s having the right level of granularity in the data values." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Largest likely estimation error, measured by MOE." (Geoff Cumming, "Understanding The New Statistics", 2013)

"The level of detail included in information, such as the number of decimal places in a number, the number of pixels/inch in an image (resolution), or other measure reflecting how closely information is observed. Not to be confused with Accuracy defined elsewhere in this glossary." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Within the quality management system, precision is a measure of exactness. |" (For Dummies, "PMP Certification All-in-One For Dummies, 2nd Ed.", 2013)

"Precision easures the accuracy of a result set, that is, how many of the retrieved resources for a query are relevant." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)

🔬Data Science: Variance (Definitions)

"The mean squared deviation of the measured response values from their average value." (Clyde M Creveling, "Six Sigma for Technical Processes: An Overview for R Executives, Technical Leaders, and Engineering Managers", 2006)

"The variance reflects the amount of variation in a set of observations." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Describes dispersion about the data set’s mean. The variance is the square of the standard deviation. Conversely, the standard deviation is the square root of the variance." (E C Nelson & Stephen L Nelson, "Excel Data Analysis For Dummies ", 2015)

"Summary statistic that indicates the degree of variability among participants for a given variable. The variance is essentially the average squared deviation from the mean and is the square of the standard deviation." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical measure of how spread (or varying) the values of a variable are around a central value such as the mean." (Jonathan Ferrar et al, "The Power of People", 2017)

🔬Data Science: Swarm Intelligence (Definitions)

"Swarm systems generate novelty for three reasons: (1) They are 'sensitive to initial conditions' - a scientific shorthand for saying that the size of the effect is not proportional to the size of the cause - so they can make a surprising mountain out of a molehill. (2) They hide countless novel possibilities in the exponential combinations of many interlinked individuals. (3) They don’t reckon individuals, so therefore individual variation and imperfection can be allowed. In swarm systems with heritability, individual variation and imperfection will lead to perpetual novelty, or what we call evolution." (Kevin Kelly, "Out of Control: The New Biology of Machines, Social Systems and the Economic World", 1995)

"Dumb parts, properly connected into a swarm, yield smart results." (Kevin Kelly, "New Rules for the New Economy", 1999)

"It is, however, fair to say that very few applications of swarm intelligence have been developed. One of the main reasons for this relative lack of success resides in the fact that swarm-intelligent systems are hard to 'program', because the paths to problem solving are not predefined but emergent in these systems and result from interactions among individuals and between individuals and their environment as much as from the behaviors of the individuals themselves. Therefore, using a swarm-intelligent system to solve a problem requires a thorough knowledge not only of what individual behaviors must be implemented but also of what interactions are needed to produce such or such global behavior." (Eric Bonabeau et al, "Swarm Intelligence: From Natural to Artificial Systems", 1999)

"Just what valuable insights do ants, bees, and other social insects hold? Consider termites. Individually, they have meager intelligence. And they work with no supervision. Yet collectively they build mounds that are engineering marvels, able to maintain ambient temperature and comfortable levels of oxygen and carbon dioxide even as the nest grows. Indeed, for social insects teamwork is largely self-organized, coordinated primarily through the interactions of individual colony members. Together they can solve difficult problems (like choosing the shortest route to a food source from myriad possible pathways) even though each interaction might be very simple (one ant merely following the trail left by another). The collective behavior that emerges from a group of social insects has been dubbed 'swarm intelligence'." (Eric Bonabeau & Christopher Meyer, Swarm Intelligence: A Whole New Way to Think About Business, Harvard Business Review, 2001)

"[…] swarm intelligence is becoming a valuable tool for optimizing the operations of various businesses. Whether similar gains will be made in helping companies better organize themselves and develop more effective strategies remains to be seen. At the very least, though, the field provides a fresh new framework for solving such problems, and it questions the wisdom of certain assumptions regarding the need for employee supervision through command-and-control management. In the future, some companies could build their entire businesses from the ground up using the principles of swarm intelligence, integrating the approach throughout their operations, organization, and strategy. The result: the ultimate self-organizing enterprise that could adapt quickly - and instinctively - to fast-changing markets." (Eric Bonabeau & Christopher Meyer, "Swarm Intelligence: A Whole New Way to Think About Business", Harvard Business Review, 2001)

"Swarm Intelligence can be defined more precisely as: Any attempt to design algorithms or distributed problem-solving methods inspired by the collective behavior of the social insect colonies or other animal societies. The main properties of such systems are flexibility, robustness, decentralization and self-organization." (Ajith Abraham et al, "Swarm Intelligence in Data Mining", 2006)

"Swarm intelligence can be effective when applied to highly complicated problems with many nonlinear factors, although it is often less effective than the genetic algorithm approach discussed later in this chapter. Swarm intelligence is related to swarm optimization […]. As with swarm intelligence, there is some evidence that at least some of the time swarm optimization can produce solutions that are more robust than genetic algorithms. Robustness here is defined as a solution’s resistance to performance degradation when the underlying variables are changed." (Michael J North & Charles M Macal, "Managing Business Complexity: Discovering Strategic Solutions with Agent-Based Modeling and Simulation", 2007)

[swarm intelligence] "Refers to a class of algorithms inspired by the collective behaviour of insect swarms, ant colonies, the flocking behaviour of some bird species, or the herding behaviour of some mammals, such that the behaviour of the whole can be considered as exhibiting a rudimentary form of 'intelligence'." (John Fulcher, "Intelligent Information Systems", 2009)

"The property of a system whereby the collective behaviors of unsophisticated agents interacting locally with their environment cause coherent functional global patterns to emerge." (M L Gavrilova, "Adaptive Algorithms for Intelligent Geometric Computing", 2009)

[swarm intelligence] "Is a discipline that deals with natural and artificial systems composed of many individuals that coordinate using decentralized control and self-organization. In particular, SI focuses on the collective behaviors that result from the local interactions of the individuals with each other and with their environment." (Elina Pacini et al, "Schedulers Based on Ant Colony Optimization for Parameter Sweep Experiments in Distributed Environments", 2013).

"Swarm intelligence (SI) is a branch of computational intelligence that discusses the collective behavior emerging within self-organizing societies of agents. SI was inspired by the observation of the collective behavior in societies in nature such as the movement of birds and fish. The collective behavior of such ecosystems, and their artificial counterpart of SI, is not encoded within the set of rules that determines the movement of each isolated agent, but it emerges through the interaction of multiple agents." (Maximos A Kaliakatsos-Papakostas et al, "Intelligent Music Composition", 2013)

"Collective intelligence of societies of biological (social animals) or artificial (robots, computer agents) individuals. In artificial intelligence, it gave rise to a computational paradigm based on decentralisation, self-organisation, local interactions, and collective emergent behaviours." (D T Pham & M Castellani, "The Bees Algorithm as a Biologically Inspired Optimisation Method", 2015)

"It is the field of artificial intelligence in which the population is in the form of agents which search in a parallel fashion with multiple initialization points. The swarm intelligence-based algorithms mimic the physical and natural processes for mathematical modeling of the optimization algorithm. They have the properties of information interchange and non-centralized control structure." (Sajad A Rather & P Shanthi Bala, "Analysis of Gravitation-Based Optimization Algorithms for Clustering and Classification", 2020)

"It [swarm intelligence] is the discipline dealing with natural and artificial systems consisting of many individuals who coordinate through decentralized monitoring and self-organization." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

Resources:
More quotes on "Swarm Intelligence" at the-web-of-knowledge.blogspot.com.

05 May 2018

🔬Data Science: Clustering (Definitions)

"Grouping of similar patterns together. In this text the term 'clustering' is used only for unsupervised learning problems in which the desired groupings are not known in advance." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"The process of grouping similar input patterns together using an unsupervised training algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Clustering attempts to identify groups of observations with similar characteristics." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"The process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects, which are 'similar' between them and are 'dissimilar' to the objects belonging to other clusters." (Juan R González et al, "Nature-Inspired Cooperative Strategies for Optimization", 2008)

"Grouping the nodes of an ad hoc network such that each group is a self-organized entity having a cluster-head which is responsible for formation and management of its cluster." (Prayag Narula, "Evolutionary Computing Approach for Ad-Hoc Networks", 2009)

"The process of assigning individual data items into groups (called clusters) so that items from the same cluster are more similar to each other than items from different clusters. Often similarity is assessed according to a distance measure." (Alfredo Vellido & Iván Olie, "Clustering and Visualization of Multivariate Time Series", 2010)

"Verb. To output a smaller data set based on grouping criteria of common attributes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The process of partitioning the data attributes of an entity or table into subsets or clusters of similar attributes, based on subject matter or characteristic (domain)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A data mining technique that analyzes data to group records together according to their location within the multidimensional attribute space." (SQL Server 2012 Glossary, "Microsoft", 2012)

"Clustering aims to partition data into groups called clusters. Clustering is usually unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't." (Ivan Idris, "Python Data Analysis", 2014)

"Form of data analysis that groups observations to clusters. Similar observations are grouped in the same cluster, whereas dissimilar observations are grouped in different clusters. As opposed to classification, there is not a class attribute and no predefined classes exist." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"Organization of data in some semantically meaningful way such that each cluster contains related data while the unrelated data are assigned to different clusters. The clusters may not be predefined." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

[cluster analysis:] "A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Clustering is a classification technique where similar kinds of objects are grouped together. The similarity between the objects maybe determined in different ways depending upon the use case. Therefore, clustering in measurement space may be an indicator of similarity of image regions, and may be used for segmentation purposes." (Shiwangi Chhawchharia, "Improved Lymphocyte Image Segmentation Using Near Sets for ALL Detection", 2016)

"Clustering techniques share the goal of creating meaningful categories from a collection of items whose properties are hard to directly perceive and evaluate, which implies that category membership cannot easily be reduced to specific property tests and instead must be based on similarity. The end result of clustering is a statistically optimal set of categories in which the similarity of all the items within a category is larger than the similarity of items that belong to different categories." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

[cluster analysis:]"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People", 2017)

"Unsupervised learning or clustering is a way of discovering hidden structure in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The term clustering refers to the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters." (Satyadhyan Chickerur et al, "Forecasting the Demand of Agricultural Crops/Commodity Using Business Intelligence Framework", 2019)

"In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A cluster is a group of data objects which have similarities among them. It's a group of the same or similar elements gathered or occurring closely together." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Clustering describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"Describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

🔬Data Science: Classification (Definitions)

"Classification is the process of arranging data into sequences and groups according to their common characteristics, or separating them into different but related parts." (Horace Secrist, "An Introduction to Statistical Methods", 1917)

"A classification is a scheme for breaking a category into a set of parts, called classes, according to some precisely defined differing characteristics possessed by all the elements of the category." (Alva M Tuttle, "Elementary Business and Economic Statistics", 1957)

"The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"1.Generally, a set of discrete, exhaustive, and mutually exclusive observations that can be assigned to one or more variables to be measured in the collation and/or presentation of data. 2.In data modeling, the arrangement of entities into supertypes and subtypes. 3.In object-oriented design, the arrangement of objects into classes, and the assignment of objects to these categories." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Form of data analysis that models the relationships between a number of variables and a target feature. The target feature contains nominal values that indicate the class to which each observation belongs." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"Systematic identification and arrangement of business activities and/or records into categories according to logically structured conventions, methods, and procedural rules represented in a classification system. A coding of content items as members of a group for the purposes of cataloging them or associating them with a taxonomy." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"Techniques for organizing data into groups associated with a particular outcome, such as the likelihood to purchase a product or earn a college degree." (Meta S Brown, "Data Mining For Dummies", 2014)

"The systematic assignment of resources to a system of intentional categories, often institutional ones. Classification is applied categorization - the assignment of resources to a system of categories, called classes, using a predetermined set of principles." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"A systematic arrangement of objects into groups or categories according to a set of established criteria. Data and resources can be assigned a level of sensitivity as they are being created, amended, enhanced, stored, or transmitted. The classification level then determines the extent to which the resource needs to be controlled and secured, and is indicative of its value in terms of information assets." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

"In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2018)

"It is task of classifying the data into predefined number of classes. It is a supervised approach. The tagged data is used to create classification model that will be used for classification on unknown data." (Siddhartha Kumar Arjaria & Abhishek S Rathore, "Heart Disease Diagnosis: A Machine Learning Approach", 2019)

"In a machine learning context, classification is the task of assigning classes to examples. The simplest form is the binary classification task where each example can have one of two classes. The binary classification task is a special case of the multiclass classification task where each example can have one of a fixed set of classes. There is also the multilabel classification task where each example can have zero or more labels from a fixed set of labels." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"The act of assigning a category to something" (ITIL)

29 April 2018

🔬Data Science: Data Standardization (Definitions)

"The process of reaching agreement on common data definitions, formats, representation and structures of all data layers and elements." (United Nations, "Handbook on Geographic Information Systems and Digital Mapping", Studies in Methods No. 79, 2000)

[value standardization:] "Refers to the establishment and adherence of data to standard formatting practices, ensuring a consistent interpretation of data values." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"Converting data into standard formats to facilitate parsing and thus matching, linking, and de-duplication. Examples include: “Avenue” as “Ave.” in addresses; “Corporation” as “Corp.” in business names; and variations of a specific company name as one version." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"Normalizes data values to meet format and semantic definitions. For example, data standardization of address information may ensure that an address includes all of the required pieces of information and normalize abbreviations (for example Ave. for Avenue)." (Martin Oberhofer et al, "Enterprise Master Data Management", 2008)

"Using rules to conform data that is similar into a standard format or structure. Example: taking similar data, which originates in a variety of formats, and transforming it into a single, clearly defined format." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"a process in information systems where data values for a data element are transformed to a consistent representation." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"Data standardization is the process of converting data to a common format to enable users to process and analyze it." (Sisense) [source]

"In the context of data analysis and data mining: Where “V” represents the value of the variable in the original datasets: Transformation of data to have zero mean and unit variance. Techniques used include: (a) Data normalization; (b) z-score scaling; (c) Dividing each value by the range: recalculates each variable as V /(max V – min V). In this case, the means, variances, and ranges of the variables are still different, but at least the ranges are likely to be more similar; and, (d) Dividing each value by the standard deviation. This method produces a set of transformed variables with variances of 1, but different means and ranges." (CODATA)

27 April 2018

🔬Data Science: Validity (Definitions)

"An argument that explains the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of decisions made from an assessment." (Asao B Inoue, "The Technology of Writing Assessment and Racial Validity", 2009)

[external *]: "The extent to which the results obtained can be generalized to other individuals and/or contexts not studied." (Joan Hawthorne et al, "Method Development for Assessing a Diversity Goal", 2009)

[external *:] "A study has external validity when its results are generalizable to the target population of interest. Formally, external validity means that the causal effect based on the study population equals the causal effect in the target population. In counterfactual terms, external validity requires that the study population be exchangeable with the target population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal *:] "A study has internal validity when it provides an unbiased estimate of the causal effect of interest. Formally, internal validity means that the empirical effect from the study is equal to the causal effect in the study population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"Construct validity is a term developed by psychometricians to describe the ability of a variable to represent accurately an underlying characteristic of interest." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[operational validity:] "is defined as a model result behavior has enough correctness for a model intended aim over the area of system intended applicability." (Sattar J Aboud et al, "Verification and Validation of Simulation Models", 2010)

"Validity is the ability of the study to produce correct results. There are various specific types of validity (see internal validity, external validity, construct validity). Threats to validity include primarily what we have termed bias, but encompass a wider range of methodological problems, including random error and lack of construct validity." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal validity:] "Accuracy of the research study in determining the relationship between independent and the dependent variables. Internal validity can be assured only if all potential confounding variables have been properly controlled." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

[external *:] "Extent to which the results of a study accurately indicate the true nature of a relationship between variables in the real world. If a study has external validity, the results are said to be generalisable to the real world." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"The degree to which inferences made from data are appropriate to the context being examined. A variety of evidence can be used to support interpretation of scores." (Anne H Cash, "A Call for Mixed Methods in Evaluating Teacher Preparation Programs", 2016)

[construct *:] "Validity of a theory is also known as construct validity. Most theories in science present broad conceptual explanations of relationship between variables and make many different predictions about the relationships between particular variables in certain situations. Construct validity is established by verifying the accuracy of each possible prediction that might be made from the theory. Because the number of predictions is usually infinite, construct validity can never be fully established. However, the more independent predictions for the theory verified as accurate, the stronger the construct validity of the theory." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

23 April 2018

🔭Data Science: Independence (Just the Quotes)

"To apply the category of cause and effect means to find out which parts of nature stand in this relation. Similarly, to apply the gestalt category means to find out which parts of nature belong as parts to functional wholes, to discover their position in these wholes, their degree of relative independence, and the articulation of larger wholes into sub-wholes." (Kurt Koffka, 1931)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"[A] sequence is random if it has every property that is shared by all infinite sequences of independent samples of random variables from the uniform distribution." (Joel N Franklin, 1962)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"In error analysis the so-called 'chi-squared' is a measure of the agreement between the uncorrelated internal and the external uncertainties of a measured functional relation. The simplest such relation would be time independence. Theory of the chi-squared requires that the uncertainties be normally distributed. Nevertheless, it was found that the test can be applied to most probability distributions encountered in practice." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is important that uncertainty components that are independent of each other are added quadratically. This is also true for correlated uncertainty components, provided they are independent of each other, i.e., as long as there is no correlation between the components." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The fact that the same uncertainty (e.g., scale uncertainty) is uncorrelated if we are dealing with only one measurement, but correlated (i.e., systematic) if we look at more than one measurement using the same instrument shows that both types of uncertainties are of the same nature. Of course, an uncertainty keeps its characteristics (e.g., Poisson distributed), independent of the fact whether it occurs only once or more often." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks use a graph-based representation to encode the structural relationships - such as direct influence and conditional independence - between subsets of features in a domain. Consequently, a Bayesian network representation is generally more compact than a full joint distribution (because it can encode conditional independence relationships), yet it is not forced to assert a global conditional independence between all descriptive features. As such, Bayesian network models are an intermediary between full joint distributions and naive Bayes models and offer a useful compromise between model compactness and predictive accuracy." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"The main differences between Bayesian networks and causal diagrams lie in how they are constructed and the uses to which they are put. A Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables) and that this relation is sufficient. That is, knowing additional ancestors of the child will not change the formula. Likewise, a missing arrow between any two nodes means that they are independent, once we know the values of their parents. [...] If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)