SQL Troubles

16 May 2018

🔬Data Science: Supervised Learning (Definitions)

"A training paradigm where the neural network is presented with an input pattern and a desired output pattern. The desired output is compared with the neural network output, and the error information is used to adjust the connection weights." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Learning in which a system is trained by using a teacher to show the system the desired response to an input stimulus, usually in the form of a desired output." (Guido J Deboeck and Teuvo Kohonen, "Visual explorations in finance with self-organizing maps", 2000)

"learning with a teacher; learning scheme in which the average expected difference between wanted output for training samples, and the true output, respectively, is decreased." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"Supervised learning, or learning from examples, refers to systems that are trained instead of programmed with a set of examples, that is, a set of input-output pairs." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"Methods, which use a response variable to guide the analysis." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A learning method in which there are two distinct phases to the operation. In the first phase each possible solution to a problem is assessed based on the input signal that is propagated through the system producing output respond. The actual respond produced is then compared with a desired response, generating error signals that are then used as a guide to solve the given problems using supervised learning algorithms". (Masoud Mohammadian, "Supervised Learning of Fuzzy Logic Systems", 2009)

"The set of learning algorithms in which the samples in the training dataset are all labelled." (Jun Jiang & Horace H S Ip, "Active Learning with SVM", Encyclopedia of Artificial Intelligence, 2009)

"type of learning where the objective is to learn a function that associates a desired output (‘label’) to each input pattern. Supervised learning techniques require a training dataset of examples with their respective desired outputs. Supervised learning is traditionally divided into regression (the desired output is a continuous variable) and classification (the desired output is a class label)." (Óscar Pérez & Manuel Sánchez-Montañés, "Class Prediction in Test Sets with Shifted Distributions", 2009)

"Supervised learning is a type of machine learning that requires labeled training data." (Ivan Idris, "Python Data Analysis", 2014)

"Supervised learning refers to an approach that teaches the system to detect or match patterns in data based on examples it encounters during training with sample data." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"The knowledge is obtained through a training which includes a data set called the training sample which is structured according to the knowledge base supported by human experts as physicians in medical context, and databases. It is assumed that the user knows beforehand the classes and the instances of each class." (Nuno Pombo et al, "Machine Learning Approaches to Automated Medical Decision Support Systems", 2015)

"In supervised learning, a machine learning program is trained with sample items or documents that are labeled by category, and the program learns to assign new items to the correct categories." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"A form of machine learning in which the goal is to learn a function that maps from a set of input attribute values for an instance to an estimate of the missing value for the target attribute of the same instance." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"A supervised learning algorithm applies a known set of input data and drives a model to produce reasonable predictions for responses to new data. Supervised learning develops predictive models using classification and regression techniques." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2018)

"It consists in learning from data with a known-in-advance outcome that is predicted based on a set of inputs, referred to as 'features'." (Iva Mihaylova, "Applications of Artificial Neural Networks in Economics and Finance", 2018)

"Supervised learning is the data mining task of inferring a function from labeled training data." (Dharmendra S Rajput et al, "Investigation on Deep Learning Approach for Big Data: Applications and Challenges", 2018)

"A particular form of learning process that takes place under supervision and that affects the training of an artificial neural networks." (Gaetano B Ronsivalle & Arianna Boldi, "Artificial Intelligence Applied: Six Actual Projects in Big Organizations", 2019)

"A type of machine learning in which output datasets train the machine to generate the desired algorithms, like a teacher supervising a student." (Kirti R Bhatele et al, "The Role of Artificial Intelligence in Cyber Security", 2019)

"In this learning, the model needs a labeled data for training. The model knows in advance the answer to the questions it must predict and tries to learn the relationship between input and output." (Aman Kamboj et al, "EarLocalizer: A Deep-Learning-Based Ear Localization Model for Side Face Images in the Wild", 2019)

"A machine learning task designed to learn a function that maps an input onto an output based on a set of training examples (training data). Each training example is a pair consisting of a vector of inputs and an output value. A supervised learning algorithm analyzes the training data and infers a mapping function. A simple example of supervised learning is a regression model." (Timofei Bogomolov et al, "Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques", 2020)

"Supervised algorithms mean that a system is developed or modeled on predetermined set of sample data." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"A machine learning technique that involves providing a machine with data that is labeled." (Sujata Ramnarayan, "Marketing and Artificial Intelligence: Personalization at Scale", 2021)

"It is machine learning algorithm in which the model learns from ample amount of available labeled data to predict the class of unseen instances." (Gunjan Ansari et al, "Natural Language Processing in Online Reviews", 2021)

"Supervised learning aims at developing a function for a set of labeled data and outputs." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"The supervised learning algorithms are trained with a complete set of data and thus, the supervised learning algorithms are used to predict/forecast." (M Govindarajan, "Big Data Mining Algorithms", 2021)

"Supervised Learning is a type of machine learning in which an algorithm takes a labelled data set (data that’s been organized and described), deduces key features characterizing each label, and learns to recognize them in new unseen data." (Accenture)

15 May 2018

🔬Data Science: Artificial Neural Network [ANN] (Definitions)

"An artificial neural network (or simply a neural network) is a biologically inspired computational model that consists of processing elements (neurons) and connections between them, as well as of training and recall algorithms." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Biologically inspired computational model consisting of processing elements (called neurons) and connections between them with coefficients (weights) bound to the connections, which constitute the neuronal structure. Training and recall algorithms are also attached to the structure." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"massively parallel interconnected network of simple (usually adaptive) elements and their hierarchical organizations, intended to interact with the objects of the real world in the same way as the biological nervous systems do. In a more general sense, artificial neural networks also encompass abstract schemata, such as mathematical estimators and systems of symbolic rules, constructed automatically from masses of examples, without heuristic design or other human intervention. Such schemata are supposed to describe the operation of biological or artificial neural networks in a highly idealized form and define certain performance limits." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"A collaboration of simple, primitive processing elements that self-organize and self-optimize to achieve computation goals. While these occur in biological systems, in this context we usually mean artificial neural networks such as might be used in optical character recognition applications." (Bruce P Douglass, "Real-Time Agility", 2009)

"An artificial neural network, often just called a “neural network” (NN), is an interconnected group of artificial neurons that uses a mathematical model or computational model for information processing based on a connectionist approach to computation. Knowledge is acquired by the network from its environment through a learning process, and interneuron connection strengths (synaptic weighs) are used to store the acquired knowledge." (Larbi Esmahi et al, "Adaptive Neuro-Fuzzy Systems", Encyclopedia of Artificial Intelligence, 2009)

"An interconnected group of units or neurons that uses a mathematical model for information processing based on a connectionist approach to computation." (Soledad Delgado et al, "Growing Self-Organizing Maps for Data Analysis", Encyclopedia of Artificial Intelligence, 2009)

"Artificial neural networks (ANNs) are non-linear mapping structures based on the function of the human brain. They are powerful tools for modeling, especially when the underlying data relationship is unknown." (Siddhartha Bhattacharjee et al, "Quantum Backpropagation Neural Network Approach for Modeling of Phenol Adsorption from Aqueous Solution by Orange Peel Ash", 2013)

"A computer representation of knowledge that attempts to mimic the neural networks of the human body" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"a massively parallel distributed processor made up of simple processing units, which has a natural propensity for storing experimental knowledge and making it available for use." (Pablo Escandell-Montero et al, "Artificial Neural Networks in Physical Therapy", 2015)

"Computational models inspired by brain's nervous systems which are capable of machine learning and pattern recognition. ANN are composed by simple, and highly interconnected processing elements that process information by their dynamic state response to external inputs." (Nuno Pombo et al, "Machine Learning Approaches to Automated Medical Decision Support Systems", 2015)

"Computational models inspired by the properties of biological nervous systems. Usually composed of layers of highly interconnected simple processing units, they are characterised by learning capabilities and can be implemented in software and hardware." (D T Pham & M Castellani, "The Bees Algorithm as a Biologically Inspired Optimisation Method", 2015)

"Computer models of interconnected neurons that can be trained to carry out pattern recognition and other low-level cognitive functions through supervised or unsupervised of learning." (Eitan Gross, "Stochastic Neural Network Classifiers", Encyclopedia of Information Science and Technology 3rd Ed., 2015)

"Is non-parametric tool that learns from the surroundings, retains the learning and uses it subsequently." (Kandarpa K Sarma, "Learning Aided Digital Image Compression Technique for Medical Application", 2016)

"A computational graph for machine learning or simulation of a biological neural network (brain)." (Hobson Lane et al, "Natural Language Processing in Action: Understanding, analyzing, and generating text with Python", 2019)

"A machine learning algorithm that is created by mimicking the information transmission and problem-solving mechanism in the human brain." (Tolga Ensari et al, "Overview of Machine Learning Approaches for Wireless Communication", 2019)

"Information elaboration system, software, or hardware that is based on the biological nervous systems, and it is composed of code units called 'nodes' or 'artificial neurons'." (Gaetano B Ronsivalle & Arianna Boldi, "Artificial Intelligence Applied: Six Actual Projects in Big Organizations", 2019)

"A predictive computer algorithm inspired by the biology of the human brain that can learn linear and non-linear functions from data. Artificial neural networks are particularly useful when the complexity of the data or the modelling task makes the design of a function that maps inputs to outputs by hand impractical." (Timofei Bogomolov et al, "Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques", 2020)

"An artificial neural network is a collection of neurons connected by weights." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Is a computational model based on the structure and functions of biological neural networks." (Heorhii Kuchuk et al, "Application of Deep Learning in the Processing of the Aerospace System's Multispectral Images", 2020)

"It mimics animal neural networks and useful in taking some action by observing some example instead of being explicitly programmed." (Shouvik Chakraborty & Kalyani Mali, "An Overview of Biomedical Image Analysis From the Deep Learning Perspective", 2020)

"An artificial neural network is based on a simplification of neurons in an animal brain which is a group of interconnected neurons." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Artificial neural networks (ANNs) are a type of computing system that is inspired by biological neural networks present in the animal brain." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Artificial neural networks (ANNs), also simply called neural networks (NNs), are bionic systems of neurons for vaguely computing and responding like human brains. ANNs show their power in the field of prediction and classification for a long time by a black-box system. ANNs enter a new era with the assistance of GPU for deep learning nowadays." (Yuh-Wen Chen, "Social Network Analysis: Self-Organizing Map and WINGS by Multiple-Criteria Decision Making", 2021)

"It is a computing model based on the structure of the human brain with many interconnected processing nodes that model input-output relationships. The model is organized in layers of nodes that interconnect to each other." (Mário P Véstias, "Convolutional Neural Network", 2021)

"It is an information processing model inspired by the form of the brain in which biological nervous systems, such as the brain, process information." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

"An artificial neuron network (ANN) is a computing system patterned after the operation of neurons in the human brain." (Databricks) [source]

14 May 2018

🔭Data Science: Reinforcement Learning (Just the Quotes)

"A neural network training method based on presenting input vector x and looking at the output vector calculated by the network. If it is considered 'good', then a 'reward' is given to the network in the sense that the existing connection weights get increased, otherwise the network is "punished"; the connection weights, being considered as 'not appropriately set,' decrease." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"A training paradigm where the neural network is presented with a sequence of input data, followed by a reinforcement signal." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"learning mode in which adaptive changes of the parameters due to reward or punishment depend on the final outcome of a whole sequence of behavior. The results of learning are evaluated by some performance index." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"A learning method which interprets feedback from an environment to learn optimal sets of condition/response relationships for problem solving within that environment" (Pi-Sheng Deng, "Genetic Algorithm Applications to Optimization Modeling", Encyclopedia of Artificial Intelligence, 2009)

"A sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states. Differently from supervised learning, in this case there is no target value for each input pattern, only a reward based of how good or bad was the action taken by the agent in the existent environment." (Marley Vellasco et al, "Hierarchical Neuro-Fuzzy Systems" Part II, Encyclopedia of Artificial Intelligence, 2009)

"a type of machine learning in which an agent learns, through its own experience, to navigate through an environment, choosing actions in order to maximize the sum of rewards." (Lisa Torrey & Jude Shavlik, "Transfer Learning", 2010)

"a machine learning technique whereby actions are associated with credits or penalties, sometimes with delay, and whereby, after a series of learning episodes, the learning agent has developed a model of which action to choose in a particular environment, based on the expectation of accumulated rewards." (Apostolos Georgas, "Scientific Workflows for Game Analytics", Encyclopedia of Business Analytics and Optimization", 2014)

"A type of machine learning in which the machine learns what to do by discovering through trial and error the way to maximize a reward." (Gloria Phillips-Wren, "Intelligent Systems to Support Human Decision Making", 2014)

"it stands, in the context of computational learning, for a family of algorithms aimed at approximating the best policy to play in a certain environment (without building an explicit model of it) by increasing the probability of playing actions that improve the rewards received by the agent." (Fernando S Oliveira, "Reinforcement Learning for Business Modeling", 2014)

"a special case of supervised learning in which the cognitive computing system receives feedback on its performance to guide it to a goal or good outcome." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"The knowledge is obtained using rewards and punishments which there is an agent (learner) that acts autonomously and receives a scalar reward signal that is used to evaluate the consequences of its actions." (Nuno Pombo et al, "Machine Learning Approaches to Automated Medical Decision Support Systems", 2015)

"It is also known as learning with a critic. The agent takes a sequence of actions and receives a reward/penalty only at the very end, with no feedback during the intermediate actions. Using this limited information, the agent should learn to generate the actions to maximize the reward in later trials. For example, in chess, we do a set of moves, and at the very end, we win or lose the game; so we need to figure out what the actions that led us to this result were and correspondingly credit them." (Ethem Alpaydın, "Machine learning : the new AI", 2016)

"A learning algorithm for a robot or a software agent to take actions in an environment so as to maximize the sum of rewards through trial and error." (Tomohiro Yamaguchi et al, "Analyzing the Goal-Finding Process of Human Learning With the Reflection Subtask", 2018)

"Training/learning method aiming to automatically determine the ideal behavior within a specific context based on rewarding desired behaviors and/or punishing undesired one." (Ioan-Sorin Comşa et al, "Guaranteeing User Rates With Reinforcement Learning in 5G Radio Access Networks", 2019)

"Brach of the Artificial Intelligence field devoted to obtaining optimal control sequences for agents only by interacting with a concrete dynamical system." (Juan Parras & Santiago Zazo, "The Threat of Intelligent Attackers Using Deep Learning: The Backoff Attack Case", 2020)

"Machine learning approaches often used in robotics. A reward is used to teach a system a desired behavior." (Jörg Frochte et al, "Concerning the Integration of Machine Learning Content in Mechatronics Curricula", 2020)

"This area of deep learning includes methods which iterates over various steps in a process to get the desired results. Steps that yield desirable outcomes are content and steps that yield undesired outcomes are reprimanded until the algorithm is able to learn the given optimal process. In unassuming terms, learning is finished on its own or effort on feedback or content-based learning." (Amit K Tyagi & Poonam Chahal, "Artificial Intelligence and Machine Learning Algorithms", 2020)

"A machine learning paradigm that utilizes evaluative feedback to cultivate desired behavior." (Marten H L Kaas, "Raising Ethical Machines: Bottom-Up Methods to Implementing Machine Ethics", 2021)

"Is an area of machine learning that learn for the experience in order to maximize the rewards." (Walaa Alnasser et al, "An Overview on Protecting User Private-Attribute Information on Social Networks", 2021)

"Reinforcement learning is also a subset of AI algorithms which creates independent, self-learning systems through trial and error. Any positive action is assigned a reward and any negative action would result in a punishment. Reinforcement learning can be used in training autonomous vehicles where the goal would be obtaining the maximum rewards." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"Reinforcement Learning uses a kind of algorithm that works by trial and error, where the learning is enabled using a feedback loop of 'rewards' and 'punishments'. When the algorithm is fed a dataset, it treats the environment like a game, and is told whether it has won or lost each time it performs an action. In this way, reinforcement learning algorithms build up a picture of the 'moves' that result in success, and those that don't." (Accenture)

13 May 2018

🔬Data Science: Self-Organizing Map (Definitions)

"A clustering neural net, with topological structure among cluster units." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"A self organizing map is a form of Kohonen network that arranges its clusters in a (usually) two-dimensional grid so that the codebook vectors (the cluster centers) that are close to each other on the grid are also close in the k-dimensional feature space. The converse is not necessarily true, as codebook vectors that are close in feature-space might not be close on the grid. The map is similar in concept to the maps produced by descriptive techniques such as multi-dimensional scaling (MDS)." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"result of a nonparametric regression process that is mainly used to represent high-dimensional, nonlinearly related data items in an illustrative, often two-dimensional display, and to perform unsupervised classification and clustering." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"a method of organizing and displaying textual information according to the frequency of occurrence of text and the relationship of text from one document to another." (William H Inmon, "Building the Data Warehouse", 2005)

"A type of unsupervised neural network used to group similar cases in a sample. SOMs are unsupervised (see supervised network) in that they do not require a known dependent variable. They are typically used for exploratory analysis and to reduce dimensionality as an aid to interpretation of complex data. SOMs are similar in purpose to Ic-means clustering and factor analysis." (David Scarborough & Mark J Somers, "Neural Networks in Organizational Research: Applying Pattern Recognition to the Analysis of Organizational Behavior", 2006)

"A method to learn to cluster input vectors according to how they are naturally grouped in the input space. In its simplest form, the map consists of a regular grid of units and the units learn to represent statistical data described by model vectors. Each map unit contains a vector used to represent the data. During the training process, the model vectors are changed gradually and then the map forms an ordered non-linear regression of the model vectors into the data space." (Atiq Islam et al, "CNS Tumor Prediction Using Gene Expression Data Part II", Encyclopedia of Artificial Intelligence, 2009)

"A neural-network method that reduces the dimensions of data while preserving the topological properties of the input data. SOM is suitable for visualizing high-dimensional data such as microarray data." (Emmanuel Udoh & Salim Bhuiyan, "C-MICRA: A Tool for Clustering Microarray Data", 2009)

"A neural network unsupervised method of vector quantization widely used in classification. Self-Organizing Maps are a much appreciated for their topology preservation property and their associated data representation system. These two additive properties come from a pre-defined organization of the network that is at the same time a support for the topology learning and its representation. (Patrick Rousset & Jean-Francois Giret, "A Longitudinal Analysis of Labour Market Data with SOM" Encyclopedia of Artificial Intelligence, 2009)

"A simulated neural network based on a grid of artificial neurons by means of prototype vectors. In an unsupervised training the prototype vectors are adapted to match input vectors in a training set. After completing this training the SOM provides a generalized K-means clustering as well as topological order of neurons." (Laurence Mukankusi et al, "Relationships between Wireless Technology Investment and Organizational Performance", 2009)

"A subtype of artificial neural network. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space." (Soledad Delgado et al, "Growing Self-Organizing Maps for Data Analysis", 2009)

"An unsupervised neural network providing a topology-preserving mapping from a high-dimensional input space onto a two-dimensional output space." (Thomas Lidy & Andreas Rauber, "Music Information Retrieval", 2009)

"Category of algorithms based on artificial neural networks that searches, by means of self-organization, to create a map of characteristics that represents the involved samples in a determined problem." (Paulo E Ambrósio, "Artificial Intelligence in Computer-Aided Diagnosis", 2009)

"Self-organizing maps (SOMs) are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks." (Lluís Formiga & Francesc Alías, "GTM User Modeling for aIGA Weight Tuning in TTS Synthesis", Encyclopedia of Artificial Intelligence, 2009)

"SOFM [self-organizing feature map] is a data mining method used for unsupervised learning. The architecture consists of an input layer and an output layer. By adjusting the weights of the connections between input and output layer nodes, this method identifies clusters in the data." (Indranil Bose, "Data Mining in Tourism", 2009)

"The self-organizing map is a subtype of artificial neural networks. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space. The self-organizing map is a single layer feed-forward network where the output syntaxes are arranged in low dimensional (usually 2D or 3D) grid. Each input is connected to all output neurons. Attached to every neuron there is a weight vector with the same dimensionality as the input vectors. The number of input dimensions is usually a lot higher than the output grid dimension. SOMs are mainly used for dimensionality reduction rather than expansion." (Larbi Esmahi et al, "Adaptive Neuro-Fuzzy Systems", Encyclopedia of Artificial Intelligence, 2009)

"A type of neural network that uses unsupervised learning to produce two-dimensional representations of an input space." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The Self-organizing map is a non-parametric and non-linear neural network that explores data using unsupervised learning. The SOM can produce output that maps multidimensional data onto a two-dimensional topological map. Moreover, since the SOM requires little a priori knowledge of the data, it is an extremely useful tool for exploratory analyses. Thus, the SOM is an ideal visualization tool for analyzing complex time-series data." (Peter Sarlin, "Visualizing Indicators of Debt Crises in a Lower Dimension: A Self-Organizing Maps Approach", 2012)

"SOMs or Kohonen networks have a grid topology, with unequal grid weights. The topology of the grid provides a low dimensional visualization of the data distribution." (Siddhartha Bhattacharjee et al, "Quantum Backpropagation Neural Network Approach for Modeling of Phenol Adsorption from Aqueous Solution by Orange Peel Ash", 2013)

"An unsupervised neural network widely used in exploratory data analysis and to visualize multivariate object relationships." (Manuel Martín-Merino, "Semi-Supervised Dimension Reduction Techniques to Discover Term Relationships", 2015)

"ANN used for visualizing low-dimensional views of high-dimensional data." (Pablo Escandell-Montero et al, "Artificial Neural Networks in Physical Therapy", 2015)

"Is a unsupervised learning ANN, which means that no human intervention is needed during the learning and that little needs to be known about the characteristics of the input data." (Nuno Pombo et al, "Machine Learning Approaches to Automated Medical Decision Support Systems", 2015)

"A kind of artificial neural network which attempts to mimic brain functions to provide learning and pattern recognition techniques. SOM have the ability to extract patterns from large datasets without explicitly understanding the underlying relationships. They transform nonlinear relations among high dimensional data into simple geometric connections among their image points on a low-dimensional display." (Felix Lopez-Iturriaga & Iván Pastor-Sanz, "Using Self Organizing Maps for Banking Oversight: The Case of Spanish Savings Banks", 2016)

"Neural network which simulated some cerebral functions in elaborating visual information. It is usually used to classify a large amount of data." (Gaetano B Ronsivalle & Arianna Boldi, "Artificial Intelligence Applied: Six Actual Projects in Big Organizations", 2019)

"Classification technique based on unsupervised-learning artificial neural networks allowing to group data into clusters." Julián Sierra-Pérez & Joham Alvarez-Montoya, "Strain Field Pattern Recognition for Structural Health Monitoring Applications", 2020)

"It is a type of artificial neural network (ANN) trained using unsupervised learning for dimensionality reduction by discretized representation of the input space of the training samples called as map." (Dinesh Bhatia et al, "A Novel Artificial Intelligence Technique for Analysis of Real-Time Electro-Cardiogram Signal for the Prediction of Early Cardiac Ailment Onset", 2020)

"Being a particular type of ANNs, the Self Organizing Map is a simple mapping from inputs: attributes directly to outputs: clusters by the algorithm of unsupervised learning. SOM is a clustering and visualization technique in exploratory data analysis." (Yuh-Wen Chen, "Social Network Analysis: Self-Organizing Map and WINGS by Multiple-Criteria Decision Making", 2021)

12 May 2018

🔬Data Science: Backpropagation (Definitions)

"A learning algorithm for multilayer neural nets based on minimizing the mean, or total, squared error." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"A learning scheme by which a multi-layer feedforward network is organized for pattern recognition or classification utilizing an external teacher, and error feedback (or propagation)." (Guido J Deboeck and Teuvo Kohonen, "Visual explorations in finance with self-organizing maps", 2000)

"weight-vector optimization method used in multilayered feed-forward networks. The corrective steps arc made starting at the output layer and proceeding toward the input layer." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"A class of feed-forward neural networks used for classification, forecasting, and estimation. Backpropagation is the process by which connection weights between neurons are modified using a backward pass of error derivatives." (David Scarborough & Mark J Somers, "Neural Networks in Organizational Research: Applying Pattern Recognition to the Analysis of Organizational Behavior", 2006)

"A method for training a neural network by adjusting the weights using errors between the current prediction and the training set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A supervised learning algorithm used to train artificial neural networks, where the network learns from many inputs, similar to the way a child learns to identify a bird from examples of birds and birds attributes." (Eitan Gross, "Stochastic Neural Network Classifiers", 2015)

"A learning algorithm for artificial neural networks used for supervised learning, where connection weights are iteratively updated to decrease the approximation error at the output units." (Ethem Alpaydın, "Machine learning: the new AI", 2016)

"Learning algorithm that optimizes a neural network by gradient descent to minimize a cost function and improve performance." (Terrence J Sejnowski, "The Deep Learning Revolution", 2018)

"The backpropagation algorithm is an ML algorithm used to train neural networks. The algorithm calculates for each neuron in a network the contribution | the neuron makes to the error of the network. Using this error calculation for each neuron it is possible to update the weights on the inputs to each neuron so as to reduce the overall error of the network. The backpropagation algorithm is so named because it works in a two stage process. In the first stage an instance is input to the network and the information flows forward through the network until the network generates a prediction for that instance. In the second stage the error of the network on that instance is calculated by comparing the network's prediction to the correct output for that instance (as specified by the training data) and then this error is then shared back (or backpropagated) through the neurons in the network on a layer by layer basis beginning at the output layer." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Backpropagation is short for 'backward propagation of errors'. Backpropagation in convolutional neural networks is a way of training these networks based on a known, desired output for a specific sample case." (Accenture)

11 May 2018

Application Support: One Database, Two Vendors, No Love Story or Maybe…

Introduction

Situation: An organization has several BI tools provisioned with data from the same data warehouse (DW), the BI infrastructure being supported by the same service provider (vendor). The organization wants to adopt a new BI technology, though for it must be brought another vendor into the picture. The data the tool requires are already available in the DW, though the DW needs to be extended with logic and other components to support the new tool. This means that two vendors will be active in the same DW, more generally in the same environment.

Question(s): What is the best approach for making this work? Which are the challenges for making it work, considering that two vendors?

Preliminary

When you ask IT people about this situation, many will tell you that’s not a good idea, being circumspect at having two vendors within the same environment. Some will recall previous experience in which things went really bad, escalated to some degree. They will even show their disagreement through body language or increase tonality. Even if they had also good experiences with having two vendors support the same environment, the negative experiences will prevail. It’s the typical reaction to an idea when something that caused considerable trouble is recalled. This behavior is understandable as generally human tend to remember more the issues they had, rather than successes. Problems leave deeper marks than success, especially when challenges are seen as burdens.

Reacting defensively is a result of the “I’ve been burned once” syndrome. People react adversely and tend to avoid situations in which they were burned, instead of dealing with them, instead of recognizing which were the circumstances that lead to the situation in the first place, of recognizing opportunities for healing and raising above the challenges.

Personally, at a first glance, the caution would make me advise as well against having two or more vendors playing within same playground. I had my plate of extreme cases in which something went wrong and the vendors started acting like kids. Parents (in general people who work with children) know what I’m talking about, children don’t like to share their toys and parents often find themselves in the position of mediating between them. When the toy get’s broken it’s easy to blame other kid for it, same as somebody else must put the toy at its place, because that somebody played the last time with it. It’s a mix between I’m in charge and the blame game. Who needs that?

At second sight, if parents made it, why wouldn’t professionals succeed in making two vendors work together? Sure, parents have more practice in dealing with kids, have such situations on a daily basis, and there are fewer variables to think about it… I have seen vendors sitting together until they come up with a solution, I’ve seen vendors open to communicate, putting the customer on the first place, even if that meant living the ego behind. Where there’s a will there’s a way.

The Solution Space

In IT there are seldom general recipes that always lead to success, and whether a solution works or not depends on a serious of factors – environment, skills, communication, human behavior and quite often chance, the chance of doing the right thing at the right time. However, the recipe can be used as a starting point, eventually to define the best scenario, what will happen when everything goes well. At the opposite side there is the worst scenario, what will happen when everything goes south. These two opposite scenarios are in general the frame in which a solution can be defined.

Within this frame one can add several other reference points or paths, and these are made of the experience of people handling and experiencing similar situations – what worked, what didn’t, what could work, what are the challenges, and so on. In general, people’s experience and knowledge prove to be a good estimator in decision making, and even if relative, it proves some insight into the problem at hand.

Let’s reconsider the parents put in the situation of dealing with children fighting for the same toy, though from the perspective of all the toys available to play with. There are several options available: the kids could take (supervised) turn in playing with the toys, fact that could be a win-win situation if they are willing to cooperate. One can take the toys (temporarily) away, though this could lead to other issues. One can reaffirm who’s the owner of each toy, each kid being allowed to play only with his toy. One could buy a second toy, and thus brake the bank even if this will not make the issue entirely go away. Probably there are other solutions inventive parents might find.

Similarly, in the above stated problem, one option, and maybe the best, is having the vendors share ownership for the DW by finding a way to work together. Defining the ownership for each tool can alleviate some of the problems but not all, same as building a second DW. We can probably all agree that taking the tools away is not the right thing to do, and even if it’s a solution, it doesn’t support the purpose.

Sharing Ownership

Complex IT environments like the one of a DW depend on vendors’ capability of working together in reaching the same goal, even if in play are different interests. This presumes the disposition of the parties in relinquishing some control, sharing responsibilities. Unfortunately, not all vendors are willing to do that. That’s the point where imaginary obstacles are built, is where effort needs to be put to eliminate such obstacles.

When working together, often one of the parties must play the coordinator role. In theory, this role can be played by any of the vendors, and the roles can even change after case. Another approach is when the coordinator role can be taken also by a person or a board from the customer side. In case of a data warehouse it can be an IT professional, a Project Manager or a BI Competency Center (BICC) . This would allow to smoothly coordinate the activities, as well to mediate the communication and other types of challenges faced.

How will ownership sharing work? Let’s suppose vendor A wants to change something in the infrastructure. The change is first formulated, shortly reviewed, and approved by both vendors and customer, and will then be implemented and documented by vendor A as needed. Vendor B is involved in the process by validating the concept and reviewing the documentation, its involvement being minimized. There can be still some delays in the process, though the overhead is somehow minimized. There will be also scenarios in which vendor B needs only to be informed that a change has occurred, or sometimes is enough if a change was properly documented.

This approach involves also a greater need for documentation, versioning, established processes, their role being to facilitate the communication and track the changes occurred in the environment.

Splitting Ownership

Splitting the ownership involves setting clear boundaries and responsibilities within which each vendor can perform the work. One is forced thus to draw a line and say which components or activities belong to each vendor.

The architecture of existing solutions makes it sometimes hard to split the ownership when the architecture was not designed for it. A solution would be to redesign the whole architecture, though even then might not be possible to draw a clear line when there are grey areas. One needs eventually to consider the advantages and disadvantages and decide to which vendor the responsibility suits best.

For example, in the context of a DW security can be enforced via schemas within same or different databases, though there are also objects (e.g. tables with basis data) used by multiple applications. One of the vendors (vendor A) will get the ownership of the objects, thus when vendor B needs a change to those table, it must require the change to vendor A. Once the changes are done the vendor B needs to validate the changes, and if there are problems further communication occurs. Per total this approach will take more time than if the vendor B would have done alone the changes. However, it works even if it comes with some challenges.

There’s also the possibility to give vendor B temporary permissions to do the changes, fact that will shorten the effort needed. The vendor A will still be in charge, and will have to prove the documentation, and do eventually some validation as well.

Separating Ownership

Giving each vendor its own playground is a costly solution, though it can be the only solution in certain scenarios. For example, when an architecture is supposed to replace (in time) another, or when the existing architecture has certain limitations. In the context of a DW this involves duplicating the data loads, the data themselves, as well logic, eventually processes, and so on.

Pushing this just to solve a communication problem is the wrong thing to do. What happens if a third or a fourth vendor joins the party? Would it be for each vendor a new environment created? Hopefully, not…

On the other side, there are also vendors that don’t want to relinquish the ownership, and will play their cards not to do it. The overhead of dealing with such issues may surpass in extremis the costs of having a second environment. In the end the final decision has the customer.

Hybrid Approach

A hybrid between sharing and splitting ownership can prove to give the best from the two scenarios. It’s useful and even recommended to define the boundaries of work for each vendor, following to share ownership on the areas where there’s an intersection of concerns, the grey areas. For sensitive areas there could be some restrictions in cooperation.

A hybrid solution can involve as well splitting some parts of the architecture, though the performance and security are mainly the driving factors.

Conclusion

I wanted with this post to make the reader question some of the hot-brained decisions made when two or more vendors are involved in the supporting an architecture. Even if the problem is put within the context of a DW it’s occurrence extends far beyond this context. We are enablers and problem solvers. Instead of avoiding challenges we should better make sure that we’re removing or minimizing the risks.

🔬Data Science: K-Means Algorithm (Definitions)

"A top-down grouping method where the number of clusters is defined prior to grouping." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"An algorithm used to assign K centers to represent the clustering of N points (K< N). The points are iteratively adjusted so that each of the N points is assigned to one of the K clusters, and each of the K clusters is the mean of its assigned points." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k = n. The algorithm minimizes the total intra-cluster variance or the squared error function." (Dimitrios G Tsalikakis et al, "Segmentation of Cardiac Magnetic Resonance Images", 2009)

"The k-means algorithm assigns any number of data objects to one of k clusters." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The clustering algorithm that divides a dataset into k groups such that the members in each group are as similar as possible, that is, closest to one another." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"K-Means is a technique for clustering. It works by randomly placing K points, called centroids, and iteratively moving them to minimize the squared distance of elements of a cluster to their centroid." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"It is an iterative algorithm that partition the hole data set into K non overlaping subsets (Clusters). Each data point belongs to only one subset." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)

[Non-scalable K-means:] "A Microsoft Clustering algorithm method that uses a distance measure to assign a data point to its closest cluster." (Microsoft Technet)

"An algorithm that places each value in the cluster with the nearest mean, and in which clusters are formed by minimizing the within-cluster deviation from the mean." (Microsoft, "SSAS Glossary")

10 May 2018

🔬Data Science: Support Vector Machines [SVM] (Definitions)

"A supervised machine learning classification approach with the objective to find the hyperplane maximizing the minimum distance between the plane and the training data points." (Xiaoyan Yu et al, "Automatic Syllabus Classification Using Support Vector Machines", 2009)

"Support vector machines [SVM] is a methodology used for classification and regression. SVMs select a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible." (Yorgos Goletsis et al, "Bankruptcy Prediction through Artificial Intelligence", 2009)

"SVM is a data mining method useful for classification problems. It uses training data and kernel functions to build a model that can appropriately predict the class of an unclassified observation." (Indranil Bose, "Data Mining in Tourism", 2009)

"A modeling technique that assigns points to classes based on the assignment of previous points, and then determines the gap dividing the classes where the gap is furthest from points in both classes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A machine-learning technique that classifies objects. The method starts with a training set consisting of two classes of objects as input. The SVA computes a hyperplane, in a multidimensional space, that separates objects of the two classes. The dimension of the hyperspace is determined by the number of dimensions or attributes associated with the objects. Additional objects (i.e., test set objects) are assigned membership in one class or the other, depending on which side of the hyperplane they reside." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A machine learning algorithm that works with labeled training data and outputs results to an optimal hyperplane. A hyperplane is a subspace of the dimension minus one (that is, a line in a plane)." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"A classification algorithm that finds the hyperplane dividing the training data into given classes. This division by the hyperplane is then used to classify the data further." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"Machine learning techniques that are used to make predictions of continuous variables and classifications of categorical variables based on patterns and relationships in a set of training data for which the values of predictors and outcomes for all cases are known." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"It is a supervised machine learning tool utilized for data analysis, regression, and classification." (Shradha Verma, "Deep Learning-Based Mobile Application for Plant Disease Diagnosis", 2019)

"It is a supervised learning algorithm in ML used for problems in both classification and regression. This uses a technique called the kernel trick to transform the data and then determines an optimal limit between the possible outputs, based on those transformations." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

"Support Vector Machines (SVM) are supervised machine learning algorithms used for classification and regression analysis. Employed in classification analysis, support vector machines can carry out text categorization, image classification, and handwriting recognition." (Accenture)

🔬Data Science: Cross-validation (Definitions)

"A method for assessing the accuracy of a regression or classification model. A data set is divided up into a series of test and training sets, and a model is built with each of the training set and is tested with the separate test set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A method for assessing the accuracy of a regression or classification model." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"A statistical method derived from cross-classification which main objective is to detect the outlying point in a population set." (Tomasz Ciszkowski & Zbigniew Kotulski, "Secure Routing with Reputation in MANET", 2008)

"Process by which an original dataset d is divided into a training set t and a validation set v. The training set is used to produce an effort estimation model (if applicable), later used to predict effort for each of the projects in v, as if these projects were new projects for which effort was unknown. Accuracy statistics are then obtained and aggregated to provide an overall measure of prediction accuracy." (Emilia Mendes & Silvia Abrahão, "Web Development Effort Estimation: An Empirical Analysis", 2008)

"A method of estimating predictive error of inducers. Cross-validation procedure splits that dataset into k equal-sized pieces called folds. k predictive function are built, each tested on a distinct fold after being trained on the remaining folds." (Gilles Lebrun et al, EA Multi-Model Selection for SVM, 2009)

"Method to estimate the accuracy of a classifier system. In this approach, the dataset, D, is randomly split into K mutually exclusive subsets (folds) of equal size (D1, D2, …, Dk) and K classifiers are built. The i-th classifier is trained on the union of all Dj ¤ j¹i and tested on Di. The estimate accuracy is the overall number of correct classifications divided by the number of instances in the dataset." (M Paz S Lorente et al, "Ensemble of ANN for Traffic Sign Recognition" [in "Encyclopedia of Artificial Intelligence"], 2009)

"The process of assessing the predictive accuracy of a model in a test sample compared to its predictive accuracy in the learning or training sample that was used to make the model. Cross-validation is a primary way to assure that over learning does not take place in the final model, and thus that the model approximates reality as well as can be obtained from the data available." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Validating a scoring procedure by applying it to another set of data." (Dougal Hutchison, "Automated Essay Scoring Systems", 2009)

"A method for evaluating the accuracy of a data mining model." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Cross-validation is a method of splitting all of your data into two parts: training and validation. The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"A technique used for validation and model selection. The data is randomly partitioned into K groups. The model is then trained K times, each time with one of the groups left out, on which it is evaluated." (Simon Rogers & Mark Girolami, "A First Course in Machine Learning", 2017)

"A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set." (Adrian Carballal et al, "Approach to Minimize Bias on Aesthetic Image Datasets", 2019)

09 May 2018

🔬Data Science: Meta-Analysis (Definitions)

"A set of statistical procedures designed to accumulate experimental and correlational results across independent studies that address related sets of research questions." (Ying-Chieh Liu et al, "Meta-Analysis Research on Virtual Team Performance", 2008)

"A statistical technique in which the outcomes from multiple experimental comparisons are synthesized by evaluating effect sizes. Because the recommendations are based on multiple experiments, practitioners can have greater confidence in the results from an effective meta-analysis." (Ruth C Clark, "Building Expertise: Cognitive Methods for Training and Performance Improvement", 2008)

"Study characteristics can be thought of as the independent variable." (Ernest W Brewer, "Using Meta-Analysis as a Research Tool in Making Educational and Organizational Decisions", 2009)

"The exhaustive search process which comprises numerous and versatile algorithmic procedures to exploit the gene expression results by combining or further processing them with sophisticated statistical learning and data mining techniques coupled with annotated information concerning functional properties of these genes residing in large databases." (Aristotelis Chatziioannou & Panagiotis Moulos, "DNA Microarrays: Analysis and Interpretation", 2009)

"The statistical analysis of a group of relevantly similar experimental studies, in order to summarize their results considered as a whole." (Saul Fisher, "Cost-Effectiveness", 2009)

"A quantitative research review that applies statistical techniques to examine, standardize and combine the results of different empirical studies that investigate a set of related research hypotheses." (Olusola O Adesope & John C Nesbit, "A Systematic Review of Research on Collaborative Learning with Concept Maps", 2010)

"Analysis of a number of comparable studies with the aim to combine those studies in a statistically valid way to test hypotheses (about the effect of an intervention)." (Cor van Dijkum & Laura Vegter, "A Client Perspective on E-Health: Illustrated with an Example from The Netherlands", 2010)

"A computation of average effect sizes among many experiments. Data based on a meta-analysis give us greater confidence in the results because they reflect many research studies." (Ruth C Clark & Richard E Mayer, "e-Learning and the Science of Instruction", 2011)

"Analysis of previously analyzed data relating to the same or similar biological phenomena or treatment studied across the same or similar technology platforms." (Padmalatha S Reddy et al, "Knowledge-Driven, Data-Assisted Integrative Pathway Analytics", 2011)

"A set of techniques for the quantitative analysis of results from two or more studies on the same or similar issues." (Geoff Cumming, "Understanding The New Statistics", 2013)

"A method of combining effect sizes from individual studies into a single composite effect size." (Jonathan van‘t Riet et al, "The Effects of Active Videogames on BMI among Young People: A Meta-Analysis", 2016)

"A procedure that allows the statistical averaging of results from independent studies of the same phenomena. Meta-analysis essentially combines studies on the same topic into a single large study, providing an index of how strongly the independent variable affected the dependent variable on an average in the set of studies." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A research design that combines and synthesize different types of data from multiple sources." (Mzoli Mncanca & Chinedu Okeke, "Early Exposure to Domestic Violence and Implications for Early Childhood Education Services", 2019)

"A quantitative, formal, epidemiological study design used to systematically assess the results of previous research to derive conclusions about that body of research." (Helena H Borba et al, "Challenges in Evidence-Based Practice Education: From Teaching Concepts Towards Decision-Making Learning", 2021)

08 May 2018

📦Data Migrations (DM): Facts, Principles and Practices

Data Migrations Series

Introduction

Ask a person who is familiar with cars ‘how a car works‘ - you’ll get an answer even if it doesn’t entirely reflect the reality. Of course, the deeper one's knowledge about cars, the more elaborate or exact is the answer. One doesn't have to be a mechanic to give an acceptable explanation, though in order to design, repair or build a car one needs extensive knowledge about a car’s inner workings.

Keeping the proportions, the same statements are valid for the inner workings of Data Migrations (DM) – almost everybody in IT knows what a DM is, though to design or perform one you might need an expert.

The good news about DMs is that their inner workings are less complex than the ones of cars. Basically, a DM requires some understanding of data architecture, data modelling and data manipulation, and some knowledge of business data and processes. A data architect, a database developer, a data modeler or any other data specialist can approach such an endeavor. In theory, with some guidance also a person with knowledge about business data and processes can do the work. Even if DMs imply certain complexity, they are not rocket science! In fact, there are tools that can be used to do most the work, there some general principles and best practices about the architecture, planning and execution that can help in the process.

Principles and Best Practices

It might be useful to explain the difference between principles and best practices, because they’ll more likely lead you to success if you understood and incorporated them in your solutions. Principles as patterns of advice are general or fundamental ideas, truths or values stated in a context-independent manner. Practices on the other side are specific actions or applications of these principles stated in a context-dependent way. The difference between them is relatively thin, and therefore, they are easy to confound, though by looking at their generality, one can easily identify which is which.

For example, in the 60’s become known the "keep it simple, stupid" (aka KISS) principle, which states that a simple solution works better than a complex one, and therefore as key goal one should search the simplicity in design. Even if kind of pejorative, it’s a much simpler restatement of Occam’s razor –do something in the simplest manner possible because simpler is usually better. To apply it one must understand what simplicity means, and how it can be translated in designs. According to Hans Hofmann "the ability to simplify means to eliminate the unnecessary so that the necessary may speak" or in a quote quote attributed to Einstein: "everything should be made as simple as possible, but not simpler". This is the range within which the best practices derived from KISS can be defined.

There are multiple practices that allow reducing the complexity of DM solutions: start with a Proof-of-Concept (PoC), start small and build incrementally, use off-the-shelf software, use the best tool for the purpose, use incremental data loads, split big data files into smaller ones, and so on. As can be seen all of them are direct actions that address specific aspects of the DM architecture or process.

Data Migration Truths

When looking at principles and best practices they seem to be further rooted in some basic truths or facts common to most DMs. When considered together, they offer a broader view and understanding of what a DM is about. Here are some of the most important facts about DMs:

DM as a project:

A DM is a subproject with specific characteristics
A DM is typically a one-time activity before Go live
A DM’s success is entirely dependent or an organization’s capability of running projects
Responsibilities are not always clear
Requirements change as the project progresses
Resources aren't available when needed
Parallel migrations require a common strategy
A successful DM can be used as recipe for further migrations
A DM's success is a matter of perception
The volume of work increases toward the end

DM Architecture

A DM is more complex and messier than initially thought
A DM needs to be repeatable
A DM requires experts from various areas
There are several architectures to be considered
The migration approach is dependent on the future architecture
Management Systems have their own requirements
No matter how detailed the planning something is always forgotten
Knowledge of the source and target systems aren't always available
DM are too big to be performed manually
Some tasks are easier to be performed manually
Steps in the migration needs to be rerun
It takes several iterations before arriving to the final solution
Several data regulations apply
Fall-back is always an alternative
IT supports the migration project/processes
Technologies are enablers and not guarantees for success
Tools address only a set of the needed functionality
Troubleshooting needs to be performed before, during and after migrations
Failure/nonconformities need to be documented
A DM is an opportunity to improve the quality of the data
A DM needs to be transparent for the business

DM implications for the Business:

A DM requires a downtime for the system involved
The business has several expectations/assumptions
Some expectations are considered as self-evident
The initial assumptions are almost always wrong
A DM's success/failure depends on business' perception
Business' knowledge about the data and processes is relative
The business is involved for whole project’s duration
Business needs continuous communication
Data migration is mainly a business rather than a technical challenge
Business’ expertize in every data area is needed
DM and Data Quality (DQ) need to be part of a Data Management strategy
Old legacy system data have further value for the business
Reporting requirements come with their own data requirements

DM and Data Quality:

Not all required data are available
Data don't match the expectations
Quality of the data needs to be judged based on the target system
DQ is usually performed as a separate project with different timelines
Data don't have the same importance for the business
Improving DQ is a collective effort
Data cleaning needs to be done at the source (when possible)
Data cleaning is a business activity
The business is responsible for the data
Quality improvement is governed by 80-20 rule
No organization is willing paying for perfect data quality
If can’t be counted, it isn’t visible

More to come, stay tuned…

🔬Data Science: Cluster Analysis (Definitions)

"Generally, cluster analysis, or clustering, comprises a wide array of mathematical methods and algorithms for grouping similar items in a sample to create classifications and hierarchies through statistical manipulation of given measures of samples from the population being clustered. (Hannu Kivijärvi et al, "A Support System for the Strategic Scenario Process", 2008)

"Defining groups based on the 'degree' to which an item belongs in a category. The degree may be determined by indicating a percentage amount." (Mary J Lenard & Pervaiz Alam, "Application of Fuzzy Logic to Fraud Detection", 2009)

"A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

"A statistical technique whereby data or objects are classified into groups (clusters) that are similar to one another but different from data or objects in other clusters." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2018)

"Clustering or cluster analysis is a set of techniques of multivariate data analysis aimed at selecting and grouping homogeneous elements in a data set. Clustering techniques are based on measures relating to the similarity between the elements. In many approaches this similarity, or better, dissimilarity, is designed in terms of distance in a multidimensional space. Clustering algorithms group items on the basis of their mutual distance, and then the belonging to a set or not depends on how the element under consideration is distant from the collection itself." (Crescenzio Gallo, "Building Gene Networks by Analyzing Gene Expression Profiles", 2018)

"A type of an unsupervised learning that aims to partition a set of objects in such a way that objects in the same group (called a cluster) are more similar, whereas characteristics of objects assigned into different clusters are quite distinct." (Timofei Bogomolov et al, "Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques", 2020)

"Cluster analysis is the process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

🔬Data Science: Simulation Model (Definitions)

"A 'what-if' model that attempts to simulate the effects of alternative management policies and assumptions about the firm's external environment. It is basically a tool for management's laboratory." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"Simulation models are formal representations of a portion of reality. Simulation models allow managers to share and test assumptions about problem causes and solutions." (Luis F Luna-Reyes, "System Dynamics to Understand Public Information Technology", 2008)

"A simplified, computer, simulation-based construction (model) of some real world phenomenon (or the problem task)." (Hassan Qudrat-Ullah, "System Dynamics Based Technology for Decision Support", 2009)

"A model that shows the expected operation of a system based solely on the model." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An analytical technique that often involves running models repeatedly using a variety of inputs to determine the upper and lower bounds of possible outcomes. This simulation process is also sometimes used to identify the likely distribution of outputs given a series of assumptions around how the inputs are distributed." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"A representation of a system that can be used to mimic the processes of the system under varying circumstances. It is usually operated subject to stochastic disturbances." (Özgür Yalçınkaya, "A General Simulation Modelling Framework for Train Timetabling Problem", 2016)

"A model that represents an actual procedure over time." (Rania Tegou, "Excess Inventories and Stock Out Events Through Advanced Demand Analysis and Emergency Deliveries", 2018)

"technique that created a detailed model to predict the behavior of CI/service" (ITIL)

07 May 2018

🔬Data Science: Fuzzy Set (Definitions)

"A fuzzy set is a class of objects with a continuum of grades of membership. Such a set is characterized by a membership (characteristic) function which assigns to each object a grade of membership ranging between zero and one. The notions of inclusion, union, intersection, complement, relation, convexity, etc., are extended to such sets, and various properties of these notions in the context of fuzzy sets are established. In particular, a separation theorem for convex fuzzy sets is proved without requiring that the fuzzy sets be disjoint." (Lotfi A Zadeh, "Fuzzy Sets", 1965)

"A fuzzy set can be defined mathematically by assigning to each possible individual in the universe of discourse a value representing its grade of membership in the fuzzy set. This grade corresponds to the degree to which that individual is similar or compatible with the concept represented by the fuzzy set. Thus, individuals may belong in the fuzzy act to a greater or lesser degree as indicated by a larger or smaller membership grade. As already mentioned, these membership grades are very often represented by real-number values ranging in the closed interval between 0 and 1." (George J Klir & Bo Yuan, "Fuzzy Sets and Fuzzy Logic: Theory and Applications", 1995)

"A set of items whose degree of membership in the set may range from 0 to l." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"A set whose members belong to it to some degree. In contrast, a standard or nonfuzzy set contains its members all or none. The set of even numbers has no fuzzy members." (Guido Deboeck & Teuvo Kohonen (Eds), "Visual Explorations in Finance with Self-Organizing Maps" 2nd Ed., 2000)

"A fuzzy set is any set that allows its members to have different grades of membership (membership function) in the interval [0,1]. A numerical value between 0 and 1 that represents the degree to which an element belongs to a particular set, also referred to as membership value." (Harish Garg, "Predicting Uncertain Behavior and Performance Analysis of the Pulping System in a Paper Industry using PSO and Fuzzy Methodology", 2014)

"A set whose elements have degrees of membership, as opposed to a classical set." (Michael Mutingi et al, Fuzzy System Dynamics of Manpower Systems, 2014)

"Any set that allows its members to have different grades of membership (membership function) in the interval [0,1]. A numerical value between 0 and 1 that represents the degree to which an element belongs to a particular set, also referred to as membership value." (Harish Garg, "A Hybrid GA-GSA Algorithm for Optimizing the Performance of an Industrial System by Utilizing Uncertain Data", 2015)

"It is a set of elements which have no strict boundaries." (Alexander P Ryjov & Igor F Mikhalevich, "Hybrid Intelligence Framework for Improvement of Information Security of Critical Infrastructures", 2021)

06 May 2018

🔬Data Science: Precision (Definitions)

"Precision is the ‘spread’ or variability of repeated measures of the same value." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"Defines the variation in repeated measurements of the same item. There are two major ways to measure precision - repeatability and reproducibility." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"An inherent quality characteristic that is a measure of an attribute’s having the right level of granularity in the data values." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Largest likely estimation error, measured by MOE." (Geoff Cumming, "Understanding The New Statistics", 2013)

"The level of detail included in information, such as the number of decimal places in a number, the number of pixels/inch in an image (resolution), or other measure reflecting how closely information is observed. Not to be confused with Accuracy defined elsewhere in this glossary." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Within the quality management system, precision is a measure of exactness. |" (For Dummies, "PMP Certification All-in-One For Dummies, 2nd Ed.", 2013)

"Precision easures the accuracy of a result set, that is, how many of the retrieved resources for a query are relevant." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)