SQL Troubles

23 September 2018

🔭Data Science: Computation (Just the Quotes)

"If the system exhibits a structure which can be represented by a mathematical equivalent, called a mathematical model, and if the objective can be also so quantified, then some computational method may be evolved for choosing the best schedule of actions among alternatives. Such use of mathematical models is termed mathematical programming." (George Dantzig, "Linear Programming and Extensions", 1959)

"Computers do not decrease the need for mathematical analysis, but rather greatly increase this need. They actually extend the use of analysis into the fields of computers and computation, the former area being almost unknown until recently, the latter never having been as intensively investigated as its importance warrants. Finally, it is up to the user of computational equipment to define his needs in terms of his problems, In any case, computers can never eliminate the need for problem-solving through human ingenuity and intelligence." (Richard E Bellman & Paul Brock, "On the Concepts of a Problem and Problem-Solving", American Mathematical Monthly 67, 1960)

"Cellular automata are discrete dynamical systems with simple construction but complex self-organizing behaviour. Evidence is presented that all one-dimensional cellular automata fall into four distinct universality classes. Characterizations of the structures generated in these classes are discussed. Three classes exhibit behaviour analogous to limit points, limit cycles and chaotic attractors. The fourth class is probably capable of universal computation, so that properties of its infinite time behaviour are undecidable." (Stephen Wolfram, "Nonlinear Phenomena, Universality and complexity in cellular automata", Physica 10D, 1984)

"The formal structure of a decision problem in any area can be put into four parts: (1) the choice of an objective function denning the relative desirability of different outcomes; (2) specification of the policy alternatives which are available to the agent, or decisionmaker, (3) specification of the model, that is, empirical relations that link the objective function, or the variables that enter into it, with the policy alternatives and possibly other variables; and (4) computational methods for choosing among the policy alternatives that one which performs best as measured by the objective function." (Kenneth Arrow, "The Economics of Information", 1984)

"In spite of the insurmountable computational limits, we continue to pursue the many problems that possess the characteristics of organized complexity. These problems are too important for our well being to give up on them. The main challenge in pursuing these problems narrows down fundamentally to one question: how to deal with systems and associated problems whose complexities are beyond our information processing limits? That is, how can we deal with these problems if no computational power alone is sufficient?" (George Klir, "Fuzzy sets and fuzzy logic", 1995)

"Small changes in the initial conditions in a chaotic system produce dramatically different evolutionary histories. It is because of this sensitivity to initial conditions that chaotic systems are inherently unpredictable. To predict a future state of a system, one has to be able to rely on numerical calculations and initial measurements of the state variables. Yet slight errors in measurement combined with extremely small computational errors (from roundoff or truncation) make prediction impossible from a practical perspective. Moreover, small initial errors in prediction grow exponentially in chaotic systems as the trajectories evolve. Thus, theoretically, prediction may be possible with some chaotic processes if one is interested only in the movement between two relatively close points on a trajectory. When longer time intervals are involved, the situation becomes hopeless."(Courtney Brown, "Chaos and Catastrophe Theories", 1995)

"An artificial neural network (or simply a neural network) is a biologically inspired computational model that consists of processing elements (neurons) and connections between them, as well as of training and recall algorithms." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"In science, it is a long-standing tradition to deal with perceptions by converting them into measurements. But what is becoming increasingly evident is that, to a much greater extent than is generally recognized, conversion of perceptions into measurements is infeasible, unrealistic or counter-productive. With the vast computational power at our command, what is becoming feasible is a counter-traditional move from measurements to perceptions. […] To be able to compute with perceptions it is necessary to have a means of representing their meaning in a way that lends itself to computation." (Lotfi A Zadeh, "The Birth and Evolution of Fuzzy Logic: A personal perspective", 1999)

"Theories of choice are at best approximate and incomplete. One reason for this pessimistic assessment is that choice is a constructive and contingent process. When faced with a complex problem, people employ a variety of heuristic procedures in order to simplify the representation and the evaluation of prospects. These procedures include computational shortcuts and editing operations, such as eliminating common components and discarding nonessential differences. The heuristics of choice do not readily lend themselves to formal analysis because their application depends on the formulation of the problem, the method of elicitation, and the context of choice." (Amos Tversky & Daniel Kahneman, "Advances in Prospect Theory: Cumulative Representation of Uncertainty" [in "Choices, Values, and Frames"], 2000)

"Prime numbers belong to an exclusive world of intellectual conceptions. We speak of those marvelous notions that enjoy simple, elegant description, yet lead to extreme - one might say unthinkable - complexity in the details. The basic notion of primality can be accessible to a child, yet no human mind harbors anything like a complete picture. In modern times, while theoreticians continue to grapple with the profundity of the prime numbers, vast toil and resources have been directed toward the computational aspect, the task of finding, characterizing, and applying the primes in other domains." (Richard Crandall & Carl Pomerance, "Prime Numbers: A Computational Perspective", 2001)

"Complexity Theory is concerned with the study of the intrinsic complexity of computational tasks. Its 'final' goals include the determination of the complexity of any well-defined task. Additional goals include obtaining an understanding of the relations between various computational phenomena (e.g., relating one fact regarding computational complexity to another). Indeed, we may say that the former type of goal is concerned with absolute answers regarding specific computational phenomena, whereas the latter type is concerned with questions regarding the relation between computational phenomena." (Oded Goldreich, "Computational Complexity: A Conceptual Perspective", 2008)

"Granular computing is a general computation theory for using granules such as subsets, classes, objects, clusters, and elements of a universe to build an efficient computational model for complex applications with huge amounts of data, information, and knowledge. Granulation of an object a leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indiscernibility, similarity, proximity, or functionality. In human reasoning and concept formulation, the granules and the values of their attributes are fuzzy rather than crisp. In this perspective, fuzzy information granulation may be viewed as a mode of generalization, which can be applied to any concept, method, or theory." (Salvatore Greco et al, "Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach", 2009)

"How are we to explain the contrast between the matter-of-fact way in which v-1 and other imaginary numbers are accepted today and the great difficulty they posed for learned mathematicians when they first appeared on the scene? One possibility is that mathematical intuitions have evolved over the centuries and people are generally more willing to see mathematics as a matter of manipulating symbols according to rules and are less insistent on interpreting all symbols as representative of one or another aspect of physical reality. Another, less self-congratulatory possibility is that most of us are content to follow the computational rules we are taught and do not give a lot of thought to rationales." (Raymond S Nickerson, "Mathematical Reasoning: Patterns, Problems, Conjectures, and Proofs", 2009)

"It should also be noted that the novel information generated by interactions in complex systems limits their predictability. Without randomness, complexity implies a particular non-determinism characterized by computational irreducibility. In other words, complex phenomena cannot be known a priori." (Carlos Gershenson, "Complexity", 2011)

"The notion of emergence is used in a variety of disciplines such as evolutionary biology, the philosophy of mind and sociology, as well as in computational and complexity theory. It is associated with non-reductive naturalism, which claims that a hierarchy of levels of reality exist. While the emergent level is constituted by the underlying level, it is nevertheless autonomous from the constituting level. As a naturalistic theory, it excludes non-natural explanations such as vitalistic forces or entelechy. As non-reductive naturalism, emergence theory claims that higher-level entities cannot be explained by lower-level entities." (Martin Neumann, "An Epistemological Gap in Simulation Technologies and the Science of Society", 2011)

"Black Swans (capitalized) are large-scale unpredictable and irregular events of massive consequence - unpredicted by a certain observer, and such un - predictor is generally called the 'turkey' when he is both surprised and harmed by these events. [...] Black Swans hijack our brains, making us feel we 'sort of' or 'almost' predicted them, because they are retrospectively explainable. We don’t realize the role of these Swans in life because of this illusion of predictability. […] An annoying aspect of the Black Swan problem - in fact the central, and largely missed, point - is that the odds of rare events are simply not computable." (Nassim N Taleb, "Antifragile: Things that gain from disorder", 2012)

"[…] there exists a close relation between design analysis of algorithm and computational complexity theory. The former is related to the analysis of the resources (time and/or space) utilized by a particular algorithm to solve a problem and the later is related to a more general question about all possible algorithms that could be used to solve the same problem. There are different types of time complexity for different algorithms." (Shyamalendu Kandar, "Introduction to Automata Theory, Formal Languages and Computation", 2013)

"These nature-inspired algorithms gradually became more and more attractive and popular among the evolutionary computation research community, and together they were named swarm intelligence, which became the little brother of the major four evolutionary computation algorithms." (Yuhui Shi, "Emerging Research on Swarm Intelligence and Algorithm Optimization", Information Science Reference, 2014)

"The higher the dimension, in other words, the higher the number of possible interactions, and the more disproportionally difficult it is to understand the macro from the micro, the general from the simple units. This disproportionate increase of computational demands is called the curse of dimensionality." (Nassim N Taleb, "Skin in the Game: Hidden Asymmetries in Daily Life", 2018)

"Computational complexity theory, or just complexity theory, is the study of the difficulty of computational problems. Rather than focusing on specific algorithms, complexity theory focuses on problems." (Rod Stephens, "Essential Algorithms" 2nd Ed., 2019)

19 September 2018

🔭Data Science: Features (Just the Quotes)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Every bit of knowledge we gain and every conclusion we draw about the universe or about any part or feature of it depends finally upon some observation or measurement. Mankind has had again and again the humiliating experience of trusting to intuitive, apparently logical conclusions without observations, and has seen Nature sail by in her radiant chariot of gold in an entirely different direction." (Oliver J Lee, "Measuring Our Universe: From the Inner Atom to Outer Space", 1950)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Complexity is not an objective factor but a subjective one. Supersignals reduce complexity, collapsing a number of features into one. Consequently, complexity must be understood in terms of a specific individual and his or her supply of supersignals. We learn supersignals from experience, and our supply can differ greatly from another individual's. Therefore there can be no objective measure of complexity." (Dietrich Dorner, "The Logic of Failure: Recognizing and Avoiding Error in Complex Situations", 1989)

"Formulation of a mathematical model is the first step in the process of analyzing the behaviour of any real system. However, to produce a useful model, one must first adopt a set of simplifying assumptions which have to be relevant in relation to the physical features of the system to be modelled and to the specific information one is interested in. Thus, the aim of modelling is to produce an idealized description of reality, which is both expressible in a tractable mathematical form and sufficiently close to reality as far as the physical mechanisms of interest are concerned." (Francois Axisa, "Discrete Systems" Vol. I, 2001)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"It is impossible to construct a model that provides an entirely accurate picture of network behavior. Statistical models are almost always based on idealized assumptions, such as independent and identically distributed (i.i.d.) interarrival times, and it is often difficult to capture features such as machine breakdowns, disconnected links, scheduled repairs, or uncertainty in processing rates." (Sean Meyn, "Control Techniques for Complex Networks", 2008)

"In order to deal with these phenomena, we abstract from details and attempt to concentrate on the larger picture - a particular set of features of the real world or the structure that underlies the processes that lead to the observed outcomes. Models are such abstractions of reality. Models force us to face the results of the structural and dynamic assumptions that we have made in our abstractions." (Bruce Hannon and Matthias Ruth, "Dynamic Modeling of Diseases and Pests", 2009)

"Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression." (Pankaj Mehta & David J Schwab, "An exact mapping between the Variational Renormalization Group and Deep Learning", 2014)

"A predictive model overfits the training set when at least some of the predictions it returns are based on spurious patterns present in the training data used to induce the model. Overfitting happens for a number of reasons, including sampling variance and noise in the training set. The problem of overfitting can affect any machine learning algorithm; however, the fact that decision tree induction algorithms work by recursively splitting the training data means that they have a natural tendency to segregate noisy instances and to create leaf nodes around these instances. Consequently, decision trees overfit by splitting the data on irrelevant features that only appear relevant due to noise or sampling variance in the training data. The likelihood of overfitting occurring increases as a tree gets deeper because the resulting predictions are based on smaller and smaller subsets as the dataset is partitioned after each feature test in the path." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks use a graph-based representation to encode the structural relationships - such as direct influence and conditional independence - between subsets of features in a domain. Consequently, a Bayesian network representation is generally more compact than a full joint distribution (because it can encode conditional independence relationships), yet it is not forced to assert a global conditional independence between all descriptive features. As such, Bayesian network models are an intermediary between full joint distributions and naive Bayes models and offer a useful compromise between model compactness and predictive accuracy." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Decision trees are also discriminative models. Decision trees are induced by recursively partitioning the feature space into regions belonging to the different classes, and consequently they define a decision boundary by aggregating the neighboring regions belonging to the same class. Decision tree model ensembles based on bagging and boosting are also discriminative models." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"There are two kinds of mistakes that an inappropriate inductive bias can lead to: underfitting and overfitting. Underfitting occurs when the prediction model selected by the algorithm is too simplistic to represent the underlying relationship in the dataset between the descriptive features and the target feature. Overfitting, by contrast, occurs when the prediction model selected by the algorithm is so complex that the model fits to the dataset too closely and becomes sensitive to noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"The power of deep learning models comes from their ability to classify or predict nonlinear data using a modest number of parallel nonlinear steps4. A deep learning model learns the input data features hierarchy all the way from raw data input to the actual classification of the data. Each layer extracts features from the output of the previous layer." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Decision trees are important for a few reasons. First, they can both classify and regress. It requires literally one line of code to switch between the two models just described, from a classification to a regression. Second, they are able to determine and share the feature importance of a given training set." (Russell Jurney, "Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark", 2017)

"Extracting good features is the most important thing for getting your analysis to work. It is much more important than good machine learning classifiers, fancy statistical techniques, or elegant code. Especially if your data doesn’t come with readily available features (as is the case with web pages, images, etc.), how you reduce it to numbers will make the difference between success and failure." (Field Cady, "The Data Science Handbook", 2017)

"Feature extraction is also the most creative part of data science and the one most closely tied to domain expertise. Typically, a really good feature will correspond to some real‐world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers." (Field Cady, "The Data Science Handbook", 2017)

"Variables which follow symmetric, bell-shaped distributions tend to be nice as features in models. They show substantial variation, so they can be used to discriminate between things, but not over such a wide range that outliers are overwhelming." (Steven S Skiena, "The Data Science Design Manual", 2017)

"The idea behind deeper architectures is that they can better leverage repeated regularities in the data patterns in order to reduce the number of computational units and therefore generalize the learning even to areas of the data space where one does not have examples. Often these repeated regularities are learned by the neural network within the weights as the basis vectors of hierarchical features." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as features or variables), but when we go higher than three attributes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens, hundreds, thousands, and, in extreme cases, millions of attributes." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Any machine learning model is trained based on certain assumptions. In general, these assumptions are the simplistic approximations of some real-world phenomena. These assumptions simplify the actual relationships between features and their characteristics and make a model easier to train. More assumptions means more bias. So, while training a model, more simplistic assumptions = high bias, and realistic assumptions that are more representative of actual phenomena = low bias." (Imran Ahmad, "40 Algorithms Every Programmer Should Know", 2020)

"Smoothing and aggregating can help us see important features and relationships, but when we have only a handful of observations, smoothing techniques can be misleading. With just a few observations, we prefer rug plots over histograms, box plots, and density curves, and we use scatterplots rather than smooth curves and density contours. This may seem obvious, but when we have a large amount of data, the amount of data in a subgroup can quickly dwindle. This phenomenon is an example of the curse of dimensionality." (Sam Lau et al, "Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python", 2023)

18 September 2018

🔭Data Science: Regularities (Just the Quotes)

"By sampling we can learn only about collective properties of populations, not about properties of individuals. We can study the average height, the percentage who wear hats, or the variability in weight of college juniors [...]. The population we study may be small or large, but there must be a population - and what we are studying must be a population characteristic. By sampling, we cannot study individuals as particular entities with unique idiosyncrasies; we can study regularities (including typical variabilities as well as typical levels) in a population as exemplified by the individuals in the sample." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"Theories are usually introduced when previous study of a class of phenomena has revealed a system of uniformities. […] Theories then seek to explain those regularities and, generally, to afford a deeper and more accurate understanding of the phenomena in question. To this end, a theory construes those phenomena as manifestations of entities and processes that lie behind or beneath them, as it were." (Carl G Hempel, "Philosophy of Natural Science", 1966)

"System' is the concept that refers both to a complex of interdependencies between parts, components, and processes, that involves discernible regularities of relationships, and to a similar type of interdependency between such a complex and its surrounding environment." (Talcott Parsons, "Systems Analysis: Social Systems", 1968)

"The dynamics of any system can be explained by showing the relations between its parts and the regularities of their interactions so as to reveal its organization. For us to fully understand it, however, we need not only to see it as a unity operating in its internal dynamics, but also to see it in its circumstances, i.e., in the context to which its operation connects it. This understanding requires that we adopt a certain distance for observation, a perspective that in the case of historical systems implies a reference to their origin. This can be easy, for instance, in the case of man-made machines, for we have access to every detail of their manufacture. The situation is not that easy, however, as regards living beings: their genesis and their history are never directly visible and can be reconstructed only by fragments." (Humberto Maturana, "The Tree of Knowledge", 1987)

"The term chaos is used in a specific sense where it is an inherently random pattern of behaviour generated by fixed inputs into deterministic (that is fixed) rules (relationships). The rules take the form of non-linear feedback loops. Although the specific path followed by the behaviour so generated is random and hence unpredictable in the long-term, it always has an underlying pattern to it, a 'hidden' pattern, a global pattern or rhythm. That pattern is self-similarity, that is a constant degree of variation, consistent variability, regular irregularity, or more precisely, a constant fractal dimension. Chaos is therefore order (a pattern) within disorder (random behaviour)." (Ralph D Stacey, "The Chaos Frontier: Creative Strategic Control for Business", 1991)

"A measure that corresponds much better to what is usually meant by complexity in ordinary conversation, as well as in scientific discourse, refers not to the length of the most concise description of an entity (which is roughly what AIC [algorithmic information content] is), but to the length of a concise description of a set of the entity’s regularities. Thus something almost entirely random, with practically no regularities, would have effective complexity near zero. So would something completely regular, such as a bit string consisting entirely of zeroes. Effective complexity can be high only a region intermediate between total order and complete." (Murray Gell-Mann, "What is Complexity?", Complexity Vol 1 (1), 1995)

"The second law of thermodynamics, which requires average entropy (or disorder) to increase, does not in any way forbid local order from arising through various mechanisms of self-organization, which can turn accidents into frozen ones producing extensive regularities. Again, such mechanisms are not restricted to complex adaptive systems." (Murray Gell-Mann, "What is Complexity?", Complexity Vol 1 (1), 1995)

"A form of machine learning in which the goal is to identify regularities in the data. These regularities may include clusters of similar instances within the data or regularities between attributes. In contrast to supervised learning, in unsupervised learning no target attribute is defined in the data set." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Unexpected phenomena appearing (and often having a regularity or pattern) from a collection of apparently unrelated elements and where the elements themselves do not have the characteristics of the phenomena and that phenomena itself is not contained deductively within the elements." (Jeremy Horne, "Visualizing Big Data From a Philosophical Perspective", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

17 September 2018

🔭Data Science: Underfitting (Just the Quotes)

"A smaller model with fewer covariates has two advantages: it might give better predictions than a big model and it is more parsimonious (simpler). Generally, as you add more variables to a regression, the bias of the predictions decreases and the variance increases. Too few covariates yields high bias; this called underfitting. Too many covariates yields high variance; this called overfitting. Good predictions result from achieving a good balance between bias and variance. […] finding a good model involves trading of fit and complexity." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"When generating trees, it is usually optimal to grow a larger tree than is justifiable and then prune it back. The main reason this works well is that stop splitting rules do not look far enough forward. That is, stop splitting rules tend to underfit, meaning that even if a rule stops at a split for which the next candidate splits give little improvement, it may be that splitting them one layer further will give a large improvement in accuracy." (Bertrand Clarke et al, "Principles and Theory for Data Mining and Machine Learning", 2009)

"Briefly speaking, to solve a Machine Learning problem means you optimize a model to fit all the data from your training set, and then you use the model to predict the results you want. Therefore, evaluating a model need to see how well it can be used to predict the data out of the training set. Usually there are three types of the models: underfitting, fair and overfitting model [...]. If we want to predict a value, both (a) and (c) in this figure cannot work well. The underfitting model does not capture the structure of the problem at all, and we say it has high bias. The overfitting model tries to fit every sample in the training set and it did it, but we say it is of high variance. In other words, it fails to generalize new data." (Shudong Hao, "A Beginner’s Tutorial for Machine Learning Beginners", 2014)

"There are two kinds of mistakes that an inappropriate inductive bias can lead to: underfitting and overfitting. Underfitting occurs when the prediction model selected by the algorithm is too simplistic to represent the underlying relationship in the dataset between the descriptive features and the target feature. Overfitting, by contrast, occurs when the prediction model selected by the algorithm is so complex that the model fits to the dataset too closely and becomes sensitive to noise in the data."(John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Underfitting is when a model doesn’t take into account enough information to accurately model real life. For example, if we observed only two points on an exponential curve, we would probably assert that there is a linear relationship there. But there may not be a pattern, because there are only two points to reference. [...] It seems that the best way to mitigate underfitting a model is to give it more information, but this actually can be a problem as well. More data can mean more noise and more problems. Using too much data and too complex of a model will yield something that works for that particular data set and nothing else." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Bias occurs normally when the model is underfitted and has failed to learn enough from the training data. It is the difference between the mean of the probability distribution and the actual correct value. Hence, the accuracy of the model is different for different data sets (test and training sets). To reduce the bias error, data scientists repeat the model-building process by resampling the data to obtain better prediction values." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"High-bias models typically produce simpler models that do not overfit and in those cases the danger is that of underfitting. Models with low-bias are typically more complex and that complexity enables us to represent the training data in a more accurate way. The danger here is that the flexibility provided by higher complexity may end up representing not only a relationship in the data but also the noise. Another way of portraying the bias-variance trade-off is in terms of complexity v simplicity." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"If either bias or variance is high, the model can be very far off from reality. In general, there is a trade-off between bias and variance. The goal of any machine-learning algorithm is to achieve low bias and low variance such that it gives good prediction performance. In reality, because of so many other hidden parameters in the model, it is hard to calculate the real bias and variance error. Nevertheless, the bias and variance provide a measure to understand the behavior of the machine-learning algorithm so that the model model can be adjusted to provide good prediction performance." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Overfitting and underfitting are two important factors that could impact the performance of machine-learning models. Overfitting occurs when the model performs well with training data and poorly with test data. Underfitting occurs when the model is so simple that it performs poorly with both training and test data. [...] When the model does not capture and fit the data, it results in poor performance. We call this underfitting. Underfitting is the result of a poor model that typically does not perform well for any data." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Overfitting refers to the phenomenon where a model is highly fitted on a dataset. This generalization thus deprives the model from making highly accurate predictions about unseen data. [...] Underfitting is a phenomenon where the model is not trained with high precision on data at hand. The treatment of underfitting is subject to bias and variance. A model will have a high bias if both train and test errors are high [...] If a model has a high bias type underfitting, then the remedy can be to increase the model complexity, and if a model is suffering from high variance type underfitting, then the cure can be to bring in more data or otherwise make the model less complex." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"The tension between bias and variance, simplicity and complexity, or underfitting and overfitting is an area in the data science and analytics process that can be closer to a craft than a fixed rule. The main challenge is that not only is each dataset different, but also there are data points that we have not yet seen at the moment of constructing the model. Instead, we are interested in building a strategy that enables us to tell something about data from the sample used in building the model." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Even though a natural way of avoiding overfitting is to simply build smaller networks (with fewer units and parameters), it has often been observed that it is better to build large networks and then regularize them in order to avoid overfitting. This is because large networks retain the option of building a more complex model if it is truly warranted. At the same time, the regularization process can smooth out the random artifacts that are not supported by sufficient data. By using this approach, we are giving the model the choice to decide what complexity it needs, rather than making a rigid decision for the model up front (which might even underfit the data)." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. [...] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure. The opposite is called underfitting - when the model cannot capture the structure of the data." (Umberto Michelucci, "Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks", 2018)

"The high generalization error in a neural network may be caused by several reasons. First, the data itself might have a lot of noise, in which case there is little one can do in order to improve accuracy. Second, neural networks are hard to train, and the large error might be caused by the poor convergence behavior of the algorithm. The error might also be caused by high bias, which is referred to as underfitting. Finally, overfitting (i.e., high variance) may cause a large part of the generalization error. In most cases, the error is a combination of more than one of these different factors." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The trick is to walk the line between underfitting and overfitting. An underfit model has low variance, generally making the same predictions every time, but with extremely high bias, because the model deviates from the correct answer by a significant amount. Underfitting is symptomatic of not having enough data points, or not training a complex enough model. An overfit model, on the other hand, has memorized the training data and is completely accurate on data it has seen before, but varies widely on unseen data. Neither an overfit nor underfit model is generalizable - that is, able to make meaningful predictions on unseen data." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Any fool can fit a statistical model, given the data and some software. The real challenge is to decide whether it actually fits the data adequately. It might be the best that can be obtained, but still not good enough to use." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

16 September 2018

🔭Data Science: Statistical Modeling (Just the Quotes)

"The most widely used mathematical tools in the social sciences are statistical, and the prevalence of statistical methods has given rise to theories so abstract and so hugely complicated that they seem a discipline in themselves, divorced from the world outside learned journals. Statistical theories usually assume that the behavior of large numbers of people is a smooth, average 'summing-up' of behavior over a long period of time. It is difficult for them to take into account the sudden, critical points of important qualitative change. The statistical approach leads to models that emphasize the quantitative conditions needed for equilibrium - a balance of wages and prices, say, or of imports and exports. These models are ill suited to describe qualitative change and social discontinuity, and it is here that catastrophe theory may be especially helpful." (Alexander Woodcock & Monte Davis, "Catastrophe Theory", 1978)

"Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model." (Scott Zeger, "Statistical reasoning in epidemiology", American Journal of Epidemiology, 1991)

"[…] it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd. The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis and statistical models, especially substantive ones, do not seem essentially different from other kinds of model." (Sir David Cox, "Comment on ‘Model uncertainty, data mining and statistical inference’", Journal of the Royal Statistical Society, Series A 158, 1995)

"Building statistical models is just like this. You take a real situation with real data, messy as this is, and build a model that works to explain the behavior of real data." (Martha Stocking, New York Times, 2000)

"The role of graphs in probabilistic and statistical modeling is threefold: (1) to provide convenient means of expressing substantive assumptions; (2) to facilitate economical representation of joint probability functions; and (3) to facilitate efficient inferences from observations." (Judea Pearl, "Causality: Models, Reasoning, and Inference", 2000)

"Statistical cognition is concerned with obtaining cognitive evidence about various statistical techniques and ways to present data. It’s certainly important to choose an appropriate statistical model, use the correct formulas, and carry out accurate calculations. It’s also important, however, to focus on understanding, and to consider statistics as communication between researchers and readers." (Geoff Cumming, "Understanding the New Statistics", 2012)

"Statistical models in the social sciences rely on correlations, generally not causes, of our behavior. It is inevitable that such models of reality do not capture reality well. This explains the excess of false positives and false negatives." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Once a model has been fitted to the data, the deviations from the model are the residuals. If the model is appropriate, then the residuals mimic the true errors. Examination of the residuals often provides clues about departures from the modeling assumptions. Lack of fit - if there is curvature in the residuals, plotted versus the fitted values, this suggests there may be whole regions where the model overestimates the data and other whole regions where the model underestimates the data. This would suggest that the current model is too simple relative to some better model." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Prediction about the future assumes that the statistical model will continue to fit future data. There are several reasons this is often implausible, but it also seems clear that the model will often degenerate slowly in quality, so that the model will fit data only a few periods in the future almost as well as the data used to fit the model. To some degree, the reliability of extrapolation into the future involves subject-matter expertise." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"The random element in most data analysis is assumed to be white noise - normal errors independent of each other. In a time series, the errors are often linked so that independence cannot be assumed (the last examples). Modeling the nature of this dependence is the key to time series." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"A statistical model is a relatively simple approximation to account for complex phenomena that generate data. A statistical model consists of one or more equations involving both random variables and parameters. The random variables have stated or assumed distributions. The parameters are unknown fixed quantities. The random components of statistical models account for the inherent variability in most observed phenomena." (Richard M Heiberger & Burt Holland, "Statistics Concepts", 2015)

"An oft-repeated rule of thumb in any sort of statistical model fitting is 'you can't fit a model with more parameters than data points'. This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints [...]. A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. [...] this misconception, which I like to call the 'model complexity myth' [...] is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive." (Jake Vanderplas, "The Model Complexity Myth", 2015) [source]

"Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so." (Pedro Domingos, "The Master Algorithm", 2015)

"In machine learning, knowledge is often in the form of statistical models, because most knowledge is statistical [...] Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump." (Pedro Domingos, "The Master Algorithm", 2015)

"One final warning about the use of statistical models (whether linear or otherwise): The estimated model describes the structure of the data that have been observed. It is unwise to extend this model very far beyond the observed data." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"The central limit conjecture states that most errors are the result of many small errors and, as such, have a normal distribution. The assumption of a normal distribution for error has many advantages and has often been made in applications of statistical models." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"When we use algebraic notation in statistical models, the problem becomes more complicated because we cannot 'observe' a probability and know its exact number. We can only estimate probabilities on the basis of observations." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction [...]. But the deterministic part of a model is not going to be a perfect representation of the observed world [...] and the difference between what the model predicts, and what actually happens, is the second component of a model and is known as the residual error - although it is important to remember that in statistical modelling, ‘error’ does not refer to a mistake, but the inevitable inability of a model to exactly represent what we observe." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

🔭Data Science: Principles (Just the Quotes)

"In all disciplines in which there is systematic knowledge of things with principles, causes, or elements, it arises from a grasp of those: we think we have knowledge of a thing when we have found its primary causes and principles, and followed it back to its elements." (Aristotle, "Physics", cca. 350 BC)

"[…] the least initial deviation from the truth is multiplied later a thousand-fold. Admit, for instance, the existence of a minimum magnitude, and you will find that the minimum which you have introduced, small as it is, causes the greatest truths of mathematics to totter. The reason is that a principle is great rather in power than in extent; hence that which was small at the start turns out a giant at the end." (St. Thomas Aquinas, "De Ente et Essentia", cca. 1252)

"It is superfluous to suppose that what can be accounted for by a few principles has been produced by many." (Thomas Aquinas, "Summa Theologica", cca. 1266-1273)

"Reality cannot be found except in One single source, because of the interconnection of all things with one another. […] It is a good thing to proceed in order and to establish propositions (principles). This is the way to gain ground and to progress with certainty." (Gottfried Leibniz, 1670)

"Every science has for its basis a system of principles as fixed and unalterable as those by which the universe is regulated and governed. Man cannot make principles; he can only discover them." (Thomas Paine, "The Age of Reason", 1794)

"A maxim is a conclusion upon observation of matters of fact, and is merely speculative; a ‘principle’ carries knowledge within itself, and is prospective." (Samuel T Coleridge, "The Table Talk and Omniana of Samuel Taylor Coleridge", 1831)

"The function of theory is to put all this in systematic order, clearly and comprehensively, and to trace each action to an adequate, compelling cause. […] Theory should cast a steady light on all phenomena so that we can more easily recognize and eliminate the weeds that always spring from ignorance; it should show how one thing is related to another, and keep the important and the unimportant separate. If concepts combine of their own accord to form that nucleus of truth we call a principle, if they spontaneously compose a pattern that becomes a rule, it is the task of the theorist to make this clear." (Carl von Clausewitz, "On War", 1832)

"In the original discovery of a proposition of practical utility, by deduction from general principles and from experimental data, a complex algebraical investigation is often not merely useful, but indispensable; but in expounding such a proposition as a part of practical science, and applying it to practical purposes, simplicity is of the importance: - and […] the more thoroughly a scientific man has studied higher mathematics, the more fully does he become aware of this truth – and […] the better qualified does he become to free the exposition and application of principles from mathematical intricacy." (William J M Rankine, "On the Harmony of Theory and Practice in Mechanics", 1856)

"The more man inquires into the laws which regulate the material universe, the more he is convinced that all its varied forms arise from the action of a few simple principles." (Charles Babbage, "Passages From the Life of a Philosopher", 1864)

"As in the experimental sciences, truth cannot be distinguished from error as long as firm principles have not been established through the rigorous observation of facts." (Louis Pasteur, "Étude sur la maladie des vers à soie", 1870)

"It is of the nature of true science to take nothing on trust or on authority. Every fact must be established by accurate observation, experiment, or calculation. Every law and principle must rest on inductive argument." (Sir John W Dawson, "The Chain of Life in Geological Time", 1880)

"A modern mathematical proof is not very different from a modern machine, or a modern test setup: the simple fundamental principles are hidden and almost invisible under a mass of technical details." (Hermann Weyl, "Unterrichtsblätter für Mathematik und Naturwissenschaften", 1932)

"The fundamental gospel of statistics is to push back the domain of ignorance, prejudice, rule-of-thumb, arbitrary or premature decisions, tradition, and dogmatism and to increase the domain in which decisions are made and principles are formulated on the basis of analyzed quantitative facts." (Robert W Burgess, "The Whole Duty of the Statistical Forecaster", Journal of the American Statistical Association 32 (200), 1937)

"It is always more easy to discover and proclaim general principles than it is to apply them." (Winston Churchill, "The Second World War: The gathering storm", 1948)

"The method of guessing the equation seems to be a pretty effective way of guessing new laws. This shows again that mathematics is a deep way of expressing nature, and any attempt to express nature in philosophical principles, or in seat-of-the-pants mechanical feelings, is not an efficient way." (Richard Feynman, "The Character of Physical Law", 1965)

"No theory ever agrees with all the facts in its domain, yet it is not always the theory that is to blame. Facts are constituted by older ideologies, and a clash between facts and theories may be proof of progress. It is also a first step in our attempt to find the principles implicit in familiar observational notions." (Paul K Feyerabend, "Against Method: Outline of an Anarchistic Theory of Knowledge", 1975)

"Real progress in understanding nature is rarely incremental. All important advances are sudden intuitions, new principles, new ways of seeing." (Marilyn Ferguson, "The Aquarian Conspiracy: Personal and Social Transformation in the 1980s", 1980)

"The word theory, as used in the natural sciences, doesn’t mean an idea tentatively held for purposes of argument - that we call a hypothesis. Rather, a theory is a set of logically consistent abstract principles that explain a body of concrete facts. It is the logical connections among the principles and the facts that characterize a theory as truth. No one element of a theory [...] can be changed without creating a logical contradiction that invalidates the entire system. Thus, although it may not be possible to substantiate directly a particular principle in the theory, the principle is validated by the consistency of the entire logical structure." (Alan Cromer, "Uncommon Sense: The Heretical Nature of Science", 1993)

"Engineering is the application of scientific principles toward practical ends. If the engineering isn't practical, it's bad engineering." (Steve McConnell, "After the Gold Rush: Creating a True Profession of Software Engineering", 1999)

"A small error in the beginning (or in principles) leads to a big error in the end (or in conclusions)." (ancient axiom)

15 September 2018

🔭Data Science: Simplicity (Just the Quotes)

"We consider it a good principle to explain the phenomena by the simplest hypothesis possible." (Ptolemy)

"Simplicity of structure means organic unity, whether the organism be simple or complex; and hence in all times the emphasis which critics have laid upon Simplicity, though they have not unfrequently confounded it with narrowness of range." (George H Lewes, "The Principles of Success in Literature", 1865)

"The first obligation of Simplicity is that of using the simplest means to secure the fullest effect. But although the mind instinctlvely rejects all needless complexity, we shall greatly err if we fail to recognise the fact, that what the mind recoils from is not the complexity, but the needlessness." (George H Lewes, "The Principles of Success in Literature", 1865)

"Simplicity and precision ought to be the characteristics of a scientific nomenclature: words should signify things, or the analogies of things, and not opinions." (Sir Humphry Davy, Elements of Chemical Philosophy", 1812)

"The aim of science is always to reduce complexity to simplicity." (William James, "The Principles of Psychology", 1890)

"Let us notice first of all, that every generalization implies in some measure the belief in the unity and simplicity of nature." (Jules H Poincaré, "Science and Hypothesis", 1905)

"Simplicity is the soul of efficiency." (Austin Freeman, "The Eye of Osiris", 1911)

"A theory is the more impressive the greater the simplicity of its premises is, the more different kinds of things it relates, and the more extended its area of applicability." (Albert Einstein, "Autobiographical Notes", 1949)

"As shorthand, when the phenomena are suitably simple, words such as equilibrium and stability are of great value and convenience. Nevertheless, it should be always borne in mind that they are mere shorthand, and that the phenomena will not always have the simplicity that these words presuppose." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"Scientists whose work has no clear, practical implications would want to make their decisions considering such things as: the relative worth of (1) more observations, (2) greater scope of his conceptual model, (3) simplicity, (4) precision of language, (5) accuracy of the probability assignment." (C West Churchman, "Costs, Utilities, and Values", 1956)

"The central task of a natural science is to make the wonderful commonplace: to show that complexity, correctly viewed, is only a mask for simplicity; to find pattern hidden in apparent chaos." (Herbert A Simon, "The Sciences of the Artificial", 1969)

"For if as scientists we seek simplicity, then obviously we try the simplest surviving theory first, and retreat from it only when it proves false. Not this course, but any other, requires explanation. If you want to go somewhere quickly, and several alternate routes are equally likely to be open, no one asks why you take the shortest. The simplest theory is to be chosen not because it is the most likely to be true but because it is scientifically the most rewarding among equally likely alternatives. We aim at simplicity and hope for truth." (Nelson Goodman, "Problems and Projects", 1972)

"Simplicity does not precede complexity, but follows it." (Alan Perlis, "Epigrams on Programming", 1982)

"Organized simplicity occurs where a small number of significant factors and a large number of insignificant factors appear initially to be complex, but on investigation display hidden simplicity." (Robert L Flood & Ewart R Carson, "Dealing with Complexity: An introduction to the theory and application of systems", 1988)

"The amount of understanding produced by a theory is determined by how well it meets the criteria of adequacy - testability, fruitfulness, scope, simplicity, conservatism - because these criteria indicate the extent to which a theory systematizes and unifies our knowledge." (Theodore Schick Jr., "How to Think about Weird Things: Critical Thinking for a New Age", 1995)

"[…] the simplest hypothesis proposed as an explanation of phenomena is more likely to be the true one than is any other available hypothesis, that its predictions are more likely to be true than those of any other available hypothesis, and that it is an ultimate a priori epistemic principle that simplicity is evidence for truth." (Richard Swinburne, "Simplicity as Evidence for Truth", 1997)

"Simplicity isn’t just about reduction. It can also be about augmentation. It consists of removing what isn’t relevant from our models but also of bringing in those elements that are essential to making those models truer." (John Maeda, "The Laws of Simplicity", 2006)

More quotes on "Simplicity" at the-web-of-knowledge.blogspot.com.

30 July 2018

🧱IT: Performance (Definitions)

"quantitative measures of the execution of some element or set of elements, such as average execution time or memory usage." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)

"Key characteristic of a computer system upon which it is compared to other systems." (Max Domeika, "Software Development for Embedded Multi-core Systems", 2011)

"A measure of the degree to which a system or component accomplishes designated functions within given constraints, such as accuracy, time, or resource usage." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"The degree to which a system or component accomplishes its designated functions within given constraints regarding processing time and throughput rate" (IEEE 610)

"The capability of the service-oriented systems to achieve its functionality well, which can be measured by the throughput and the response time." (Ye Wang et al, "Towards an Understanding of Performance, Reliability, and Security", Encyclopedia of Information Science and Technology, 4th Ed., 2018)

22 June 2018

📦Data Migrations (DM): Approaches to Planning a Data Migration for an ERP Implementation

Data Migrations Series

Introduction

ERP implementations are one of the most complex projects to plan as they often imply changes/transformations at different levels (e.g. strategic, processes, data, cultural, technological), span over one or more years, involve many resources that need to be efficiently managed, and often come with important costs for the organization.

One way of handling complexity is to ignore the nonessential in planning by focusing on the important activities/phases, following to go deeper as the project progresses. Another way to handle complexity is to split it at manageable parts – identifying and grouping together components. For example, Data Migration (DM) and Data Quality (DQ) are managed as subprojects, with their own planning. The two strategies can be combined to increase the effect.

Planning a DM cannot be done without looking at the timelines of the ERP implementation and considering the various interfaces to the DQ, however in this post I will focus only on the first two.

The Context

In the context of an ERP implementation there are three main approaches to the planning of a DM – pushing the activities toward the end of the implementation, pushing most activities toward the beginning of the implementation, or splitting the various activities over the whole timeline of the ERP implementation. Borrowing a term from statistics, we can talk about a left-skewed plan, a right-skewed plan, respectively a uniform-distributed plan.

For exemplification I will use a set of Lego pieces grouped together in 3 rows and representing the main phases of an ERP implementation, DM, respectively DQ:

The Lego pieces are a good tool for representing the phases, even they can be mischievous because there’s often no clear delimitation between certain phases as they overlap or repeat over several iterations, and bricks’ length doesn’t necessarily represent the actual duration of the phases. In addition, the phases are oversimplified in order not to clutter the diagrams. The detailed phases will be considered in further posts. The color changes gradually as the activities get closer toward the end.

Left-Skewed Planning

One way to plan a DM is backwards from the Go-Live, the DM activities flowing continuously backwards (the DM bricks are arranged from the end over the implementation bricks) and thus accumulating toward the end of the ERP implementation. This approach is natural considering that the requirements stabilize in the second half of the implementation, the code freeze occurring toward the end. Stabilizing means that the most important code changes were performed, and only minor changes need to be performed, typically bug fixing, refactoring or last-minute changes.

The DM starts with a set of requirements in what concerns the business processes, the data and configuration. Each change to these requirements equate with overwork that need to be performed. Typically, this happens only at entity level, however there can be changes that have impact for the whole or important parts of the migration. Additionally, after each set of changes another dry-run needs to be performed in order test the changes. Therefore, to minimize the volume of rework a DM needs a stable environment – in other words a stabile data model, configuration and requirements.

Thus, the conceptualizing of the migration including the prototyping can start in the first half of the implementation or, for long-running projects, even in the second half of the implementation. Performing the DM without interruptions assures an optimal use and planning of the resources – the resources are continuously working on the project, they are focused toward the end.

On the other side, the accumulation of the activities toward the end can easily lead to problems in what concerns resources’ availability. This type of approach needs a good planning, otherwise the project runs into the risk of having the Go-Live delayed until the stability of the DM is assured, or of going Live with data that don’t have the expected quality. These risks can be alleviated by adding a puffer to DM’s timeline, or by considering one of the other two approaches.

This approach minimizes the various types of waste associated with software projects, and thus the costs associated with waste.

Right-Skewed Planning

To release the pressure existing toward the end of the project, some of the DM activities can be performed toward the beginning of the project – the conceptualization and prototyping, as well the data mapping. One can in theory build also an important part of the DM for the standard functionality, following to address the changes in data model, processes and configuration in subsequent iterations. This could involve a higher volume of rework and more dry-runs, however this depends on the complexity and number of the changes. If a small number of customizations are expected, then this approach may be the best approach. Even in the case of many customization this approach might be something to consider, however the DM costs increase with the number of customization made, and in certain contexts the increase can be exponential.

This approach pushes some of the costs toward the beginning of the implementation, and this can have positive as well negative aspects. For example, it is well-known that ERP implementations involve cost overruns. With this approach the DM costs are assured toward the beginning and one can better get a hold of the budget, at least in theory. As negative aspect could be considered the cases in which an ERP implementation is stopped toward the middle of the project, the incurred costs being thus higher. In the end the main cost-driver are the volume of customizations.

Breaking a DM in two can have several other negative aspects. The data cleaning needs to be broken eventually as well, most probably in the second phase more data enrichment activities need to be considered.

The resources that worked on the first phase might not be available for the second phase. An adequate knowledge transfer might be hard to make, so the second team might need good documentation or time to understand the solution. This can lead to other type of behavior, e.g. rewriting unnecessarily the code, the push for a redesign, and so on.

As the environment stabilizes much later, there is the risk that an important part of the migration need to be reworked/redesigned. In extreme cases might be needed to start from zero. The chances for this to happen are small, though such a case can occur. Probably some of the code, transformation can be reused, though this depends from case to case. Without knowing implementation’s details it’s difficult to estimate the chances for something like this to happen. Sometimes is enough to invalidate a premise considered in design phase. Usually the interplay between several new requirements lead to redesign.

Uniform-distributed Plan

To alleviate the risks from the first two approaches, some of the activities could be uniformly distributed over the whole duration of the ERP implementation. This approach works well when same resources from the vendor side are involved in activities from all the three layers, the nature of the tasks allowing them to work continuously in the project. For example, the consultants working on the DM concept are helping on the mapping of the attributes, as well on data cleansing. When the work on multiple activities isn’t possible then the vendor(s) more likely will have problems in assigning resources to the project. Either the same resources will be assigned for big parts from the projects, incurring thus higher costs, or the resources will be replaced by others, additional learning being involved. In either case the costs are higher.

One of the main dangers of this approach is that certain activities will expand taking the time available, incurring thus higher costs. When the Implementation time is much higher than DM’s duration, the distance between DM’s phases can increase dramatically, being almost impossible to manage resources adequately. Keeping the metaphor of the Lego pieces, it will be thus also more difficult to build a structure on which an edifice can be built. With proper planning and adequate use of resources and knowledge the empty spaces can be incorporated in the structure for project’s advantage.

Even if this approach attempts to even the DM effort over the whole duration of the ERP implementation, performing the activities too early, before requirements stabilized can have an adverse effect.

Personal Approach

Looking back at the projects I worked on, I think I used a hybrid between the 3 approaches. The DM was planed backwards from the Go Live, however the first draft of the DM concept and prototyping was performed at the beginning of the implementation. This assured that the technical solution was working. Being involved in the creation of the data mappings as well in data cleansing, the jumps between activities allowed me to smoothly switch between the various activities, however toward the end of the project this became a bottleneck, the activities being harder to synchronize, and the volume of work could be addressed at that time only with overtime.

With a few exceptions I worked mainly alone on the technical activities, being responsible for the data mappings, design, prototyping, implementation, testing, protocolling, and execution of the DM. I think that more resources would have removed some of the burden but made the planning more complicated and the synchronization even harder. Probably a team of 2-3 people that could cover these activities would provide the optimum balance between costs, effort and quality.

Conclusion

I suppose there is no best solution that will work for all. The three approaches are more an attempt to highlight some of the extreme usages of planning. In an ERP implementation there are so many factors, so many chances for a decision to be an opportunity or a threat. My advice – ponder the various aspects/constraints, choose an approach, and adjust it as the project advances.

Previous Post <<||>> Next Post

20 June 2018

💫ERP Systems: Dynamics AX 2009 (Part I: Deleting Obsolete Companies)

Introduction

During implementations, migrations and other projects are created in Dynamics AX temporary companies (aka legal entities, data areas) that aren’t needed anymore once they fulfilled their purpose. Excepting the fact that obsolete companies occupy space in the data center, under certain circumstances they can lead to performance problems. The logical thing to do would be to delete the obsolete companies as long there’s no further demand from the business.

In what follows we will look at several methods for deleting obsolete companies. The scripts were tested in Dynamics AX 2009, and more likely they’ll work in coming versions as long the data model behind was kept.

Warning:
Please note that the scripts are provided “AS IS” only to exemplify a technique and they come without any warranty! Before attempting any of the methods described here, review the comments from “Further Considerations” section!

Method 1: Using DynamcsAX Built-In Functionality

Dynamics AX 2009 provides built-in functionality for deleting a company, however when the volume of data in the system goes above a certain limit the functionality starts to perform poorly, even when run directly on an AOS. (It is recommended to run long-running administration jobs directly on the AOS rather than clients.) For example, it was attempted to use this method to delete several companies in Dynamics AX Test environment. By the first company the deletion job needed a few hours, while by the second company the job hasn’t finished after two days, being thus forced to stop it. After two further failed attempts it came the time to look for another solution.

Warning:
It seems that this solution can lead to orphaned data (see [1]). So, even if you are using this method, you might need to consider one of the following methods as well.

Method 2: Using sp_MSforEachTable

In almost all tables in AX the company is stored in a DataAreaId attribute. Over this attribute the records belonging to a company are logically partitioned. This allows writing a script via the undocumented sp_MSforEachTable stored procedure:

--delete the data for one data area
sp_MSforEachTable @command1 = 'DELETE FROM ? WHERE DataAreaId = ''m01'''

An error with be thrown for the tables that don’t contain the DataAreaId attribute:

"Msg 207, Level 16, State 1, Line 1"

Invalid column name 'DataAreaId'.The script can be extended to delete in the same step two or more companies:

--delete the data for multiple data areas
 sp_MSforEachTable @command1 = 'DELETE FROM ? WHERE DataAreaId IN (''m01'', ''m02'')'

During the first test the script needed half of hour to run, however a few tables in which the company is stored in other attributes remained untouched. One can either search for such tables manually, via a script, or run the built-in AX functionality. We opted for running the built-in functionality, which managed to delete the remaining data relatively fast.

Warning:
Microsoft doesn’t support this method and can be used when the volume of obsolete data is relatively small! What does it mean relatively small? The most important limitation of this method is the transaction log, considering that the deleted data are logged. One can either change log’s size to accommodate the volume of data to be deleted or run the deletion only for a subset of the tables. (Changing the recovery model to “simple” or “bulk-logged” won’t make a difference.)

The second important limitation is the available memory, once the available memory is reached SQL Server having to paginate the data, fact that could lead to further disk space consumed. Other limitations have more with the performance to do, e.g. each deletion is reflected also in the indexes. One might consider for example dropping the indexes before deletion and recreating them afterwards.

Method 3: Using a Cursor

Instead of using the undocumented sp_MSforEachTable stored procedure, the loop can be performed via a cursor (see [1]). This method is advantageous when the deletion needs to be performed only for a subset of tables one could use a cursor. The deletion can be grouped together with other activities and run together.

Method 4: Using „Shadow“ Tables

When the volume of data available is huge, and the volume of data that remain in the table is small compared with the overall data, it might be useful to consider using “shadow” tables. One can take advantage of the fact that a truncate command performs incomparable better than a delete command. To use a truncate on a table, the records that need to be kept could be saved temporarily to a copy (aka “shadow”) of the table, the truncate then applied, and the copied records could be moved back. The following scripts exemplify the logic needed to delete the records from InventDim (inventory dimensions) table:

-- (optional) prove the number of records
SELECT count(*) 
FROM dbo.InventDim 
WHERE DataAreaId = 'm01'

-- create the “shadow” table
CREATE TABLE [dbo].[INVENTDIM_Dump](
[INVENTDIMID] [nvarchar](30) NOT NULL,
[INVENTBATCHID] [nvarchar](21) NOT NULL,
[WMSLOCATIONID] [nvarchar](12) NOT NULL,
[INVENTSERIALID] [nvarchar](21) NOT NULL,
[INVENTLOCATIONID] [nvarchar](10) NOT NULL,
[CONFIGID] [nvarchar](10) NOT NULL,
[INVENTSIZEID] [nvarchar](10) NOT NULL,
[INVENTCOLORID] [nvarchar](10) NOT NULL,
[INVENTSITEID] [nvarchar](10) NOT NULL,
[DATAAREAID] [nvarchar](4) NOT NULL,
[RECVERSION] [int] NOT NULL,
[RECID] [bigint] NOT NULL,
[WMSPALLETID] [nvarchar](18) NOT NULL,
[INVENTSTYLEID] [nvarchar](10) NOT NULL
) ON [PRIMARY]

-- copy the data into the “shadow” table
INSERT INTO [dbo].[InventDim_Dump] WITH (TABLOCK)
SELECT *
FROM [dbo].[InventDim] 
WHERE DataAreaId = 'm01'

-- truncate the data frome the main table 
--TRUNCATE TABLE [dbo].[InventDim]

-- copy the data back
INSERT INTO [dbo].[InventDim] WITH (TABLOCK)
SELECT *
FROM [dbo].[InventDim_Dump]

-- (optional) prove whether the IDs were correctly copied 
SELECT count(*)
FROM [dbo].[InventDim] A
JOIN [dbo].[InventDim_Dump] B
ON A.recid = B.RECID 
AND A. DATAAREAID = B.DATAAREAID 
WHERE A.DataAreaId = 'm01'

-- drop the „shadow“ table 
--DROP TABLE[dbo].[InventDim_Dump]

As can be seen the “shadow” tables are simplified versions of the original tables, without constraints or indexes. They can be eventually created in another schema or even other database.

Except the script for table’s creation in the other scripts table’s name can be easily replaced in the editor via the search and replace functionality, trick that reduces considerably the time needed for development. I needed on average 5 minutes for each table, plus 3-4 hours for further tests.

The optional steps are more for exemplification and can be eventually removed.

The Tablock hint used in inserts provides better performance and minimizes the volume of data logged.

I used this method only for the tables having more than 3 million records, around 50 tables in total. Between them there were a few tables having 20-200 GB worth of records. I started with these big tables and figured out that also smaller tables could benefit from this method. A few minutes gained for each small table resulted in the end in a gain of a couple of hours.

The remained records were 0-25% of the initial tables.

In theory, these steps could be performed within a cursor in which the creation of the “shadow” tables could be automated via table metadata as well. This approach will pay-off especially when the schema is not fixed, or the procedure needs to be repeated on different schemas.

Method 5: Delete Records in Batches

There will be a point beyond which the performance provided by the fourth method will deprecate considerably. This point is based on the volume of records available in the table, and the records needed to be inserted back and forth. Without further tests, I suppose that this point lies in the 50-75% interval. Beyond this point for big tables in range of 10x or 100x GB it might be useful to delete the data in batches. A push in this direction might be constrained by the need to shrink the transaction log in between the deletes. The query could be written as follow:

-- deleting top x records 
DELETE top 10000
FROM dbo.InventDim WITH (TABLOCK)
WHERE DataAreaId = ‘m01’

The query can be included in a loop or run manually until no records are returned. It can be tested with different batch sizes to determine the best solution. In between is recommended to check also the growth of the log file and truncate it accordingly when needed.

Method 6: Using X++ code

For those having some basic knowledge of X++ and Dynamics AX classes, a solution based on deleting data via AX code could prove to be a better solution as standard functionality can be leveraged, functionality that eventually considers also the business logic implemented. The downside is the code that need to be written for this purpose, however there are already some examples available on the web (see [4]).

Hint:
In AX 2012 built-in support for batch deletes was added via the delete_from statement (see [3]).

Further Aspects

Before attempting a deletion, it might be useful to analyze how many records will be deleted from each table, and eventually devise different scenarios for specific table categories. To get the number of records one can use either the built-in functionality from AX or use the sp_MSforEachTable stored procedure and export the results to text, following to overwork the data further in Excel:

-- listing the number of records per company 
sp_MSforEachTable @command1 = 'SELECT dataareaid, ''?'' table_name, count(*) no_records FROM ? WHERE DataAreaId IN (''m01'', ''m02'') GROUP BY dataareaid'

The results can be used also to approximate the space occupied by the data.

Independently of the method used it is recommended to restrict users‘ access to the system and to deactivate the scheduled AX or SQL Server jobs. This will ensure that no blockings will occur in the system during the respective time.

As data are synchronized between the AOS’s and the database, it is recommended to shut down the not needed AOS services before the deletions are performed, and restart them once all activities were performed.

To minimize the risks associated with the loss of data it’s recommended to perform a backup of the database(s) before performing any changes.

By deleting the data directly on the database, the business logic from AX (including customizations) is skipped. In theory this can lead to logical inconsistencies, however considering that all the data for a company are deleted, the risks are very small, unless intercompanies are involved.

After the data are deleted it is recommended to recreate the indexes and update the statistics on the tables.

Check whether the transaction log can accommodate the volume of records to be deleted! In extreme cases your SQL Server might crash! From this consideration it might be advantageous to delete only a company at a time.

Based on the volume of data available in the transaction log it might be needed to truncate the log(s) between the steps, as well at the end.

After the principle “better safe than sorry”, it might be a good idea to check the physical and logical consistency of the data before letting the users in.

To minimize the impact on the business, it is recommended to perform the deletion outside the working hours, otherwise the action can lead to blocking and even deadlocks in the system. Always attempt to use standard functionality and resort to other methods only when there’s no way around it.

It is recommended to always test the scripts thoroughly in the test environment before attempting their productive usage!