SQL Troubles: Data Mining

Showing posts with label Data Mining. Show all posts

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses.

29 November 2019

🧭Business Intelligence: Perspectives (Part 5: Data Soup - From BI to Analytics)

Business Intelligence Series

The days when everything was reduced to simple terminology like reports or queries are gone. One can see it in the market trends related to reporting or data, as well in the jargon soup the IT people use on the daily basis – Business Intelligence (BI), Data Mining (DM), Analytics, Data Science, Data Warehousing (DW), Machine Learning (ML), Artificial Intelligence (AI) and so on. What’s more confusing for the users and other spectators is the easiness with which all these concepts are used, sometimes interchangeably, and often it feels like nothing makes sense.

BI is used nowadays to refer to the technologies, architectures, methodologies, processes and practices used to transform data into what is desired as meaningful and useful information. From its early beginnings in the 60s, the intelligence from Business Intelligence (BI) refers to the ability to apprehend the interrelationships of the facts to be processed (aka data) in such a way as to guide action towards a desired goal.

The main purpose of BI was and is to guide actions and provide a solid basis for decision making, aspect not necessarily reflected in the way organizations use their BI infrastructure. Except basic operational/tactical/strategic reports and metrics that reflect to a higher or lower degree organizations’ goals, BI often fails to provide the expected value. The causes are multiple ranging from an organizations maturity in devising a strategy and dividing it into SMART goals and objectives, to the misuse of technologies for the wrong purposes.

Despite the basic data analysis techniques, the rich visualizations and navigation functionality, BI fails often to deliver by itself more than ordinary and already known information. Information becomes valuable when it brings novelty, when it can be easily transformed into knowledge, or even better, when knowledge is extracted directly. To address the limitations of the BI a series of techniques appeared in parallel and coined in the 90s as Data Mining.

Mining is the process of obtaining something valuable from a resource. What DM tries to achieve as process is the extraction of knowledge in form patterns from the data by categorizing, clustering, identifying dependencies or anomalies. When compared with data analysis, the main characteristics of DM is the fact that is used to test models and hypotheses, and that it uses a set of semiautomatic and automatic out-of-the-box statistics packages, AI or predictive algorithms with applicability in different areas – Web, text, speech, business processes, etc.

DM proved to be useful by allowing to build models rooted in historical data, models which allowed predicting outcome or behavior, however the models are pretty basic and there’s always a threshold beyond which they can’t go. Furthermore, the costs of preparing the data and of the needed infrastructure seem to be high compared with the benefits data mining provides. There are scenarios in which DM proves to bring benefit, while in others it raises more challenges than can solve. Privacy, security, misuse of information and the blind use of techniques without understanding the data or the models behind, are just some of such challenges.

Information seems too common, while knowledge can become expensive to obtain. The middle way between the two found its future into another buzzword – analytics – the systematic analysis of data or statistics using specific mathematical methods. Analytics combine the agility of data analysis techniques with the power of predictive and prescriptive techniques used in DM in discovering patterns into the data. Analytics attempts to identify why it happens by using a chain of inferences resulted from data’s analyzing and understanding. From another perspective analytics seems to be a rebranded and slightly enhanced version of BI.

24 December 2018

🔭Data Science: Data Mining (Just the Quotes)

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data. […] Data mining centers on the automated discovery of new facts and relationships in data. The idea is that the raw material is the business data, and the data mining algorithm is the excavator, sifting through the vast quantities of raw data looking for the valuable nuggets of business information." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Data mining is more of an art than a science. No one can tell you exactly how to choose columns to include in your data mining models. There are no hard and fast rules you can follow in deciding which columns either help or hinder the final model. For this reason, it is important that you understand how the data behaves before beginning to mine it. The best way to achieve this level of understanding is to see how the data is distributed across columns and how the different columns relate to one another. This is the process of exploring the data." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Things are changing. Statisticians now recognize that computer scientists are making novel contributions while computer scientists now recognize the generality of statistical theory and methodology. Clever data mining algorithms are more scalable than statisticians ever thought possible. Formal statistical theory is more pervasive than computer scientists had realized." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Most mainstream data-mining techniques ignore the fact that real-world datasets are combinations of underlying data, and build single models from them. If such datasets can first be separated into the components that underlie them, we might expect that the quality of the models will improve significantly. (David Skillicorn, "Understanding Complex Datasets: Data Mining with Matrix Decompositions", 2007)

"The name ‘data mining’ derives from the metaphor of data as something that is large, contains far too much detail to be used as it is, but contains nuggets of useful information that can have value. So data mining can be defined as the extraction of the valuable information and actionable knowledge that is implicit in large amounts of data. (David Skillicorn, "Understanding Complex Datasets: Data Mining with Matrix Decompositions", 2007)

"Compared to traditional statistical studies, which are often hindsight, the field of data mining finds patterns and classifications that look toward and even predict the future. In summary, data mining can (1) provide a more complete understanding of data by finding patterns previously not seen and (2) make models that predict, thus enabling people to make better decisions, take action, and therefore mold future events." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Traditional statistical studies use past information to determine a future state of a system (often called prediction), whereas data mining studies use past information to construct patterns based not solely on the input data, but also on the logical consequences of those data. This process is also called prediction, but it contains a vital element missing in statistical analysis: the ability to provide an orderly expression of what might be in the future, compared to what was in the past (based on the assumptions of the statistical method)." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The difference between human dynamics and data mining boils down to this: Data mining predicts our behaviors based on records of our patterns of activity; we don't even have to understand the origins of the patterns exploited by the algorithm. Students of human dynamics, on the other hand, seek to develop models and theories to explain why, when, and where we do the things we do with some regularity." (Albert-László Barabási, "Bursts: The Hidden Pattern Behind Everything We Do", 2010)

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining. Students often confuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith and experience." (Foster Provost, "Data Science for Business", 2013)

"Data Mining is the art and science of discovering useful innovative patterns from data. (Anil K. Maheshwari, "Business Intelligence and Data Mining", 2015)

"Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so." (Pedro Domingos, "The Master Algorithm", 2015)

"Today we routinely learn models with millions of parameters, enough to give each elephant in the world his own distinctive wiggle. It’s even been said that data mining means 'torturing the data until it confesses'." (Pedro Domingos, "The Master Algorithm", 2015)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

20 November 2018

🔭Data Science: Overfitting (Just the Quotes)

"When training a neural network, it is important to understand when to stop. […] If the same training patterns or examples are given to the neural network over and over, and the weights are adjusted to match the desired outputs, we are essentially telling the network to memorize the patterns, rather than to extract the essence of the relationships. What happens is that the neural network performs extremely well on the training data. However, when it is presented with patterns it hasn't seen before, it cannot generalize and does not perform well. What is the problem? It is called overtraining." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

[Over-fitting fallacy:] "The error of designing an over-complex trading strategy with too many parameters that performs well on the in-sample-data, but is actually no more than a close description of the past data. This is a problem often encountered in time-series analysis and modelling." (Kermit Zieg &Heinrich Weber, "The Complete Guide to Point-and-Figure Charting", 2003)

"A smaller model with fewer covariates has two advantages: it might give better predictions than a big model and it is more parsimonious (simpler). Generally, as you add more variables to a regression, the bias of the predictions decreases and the variance increases. Too few covariates yields high bias; this called underfitting. Too many covariates yields high variance; this called overfitting. Good predictions result from achieving a good balance between bias and variance. […] finding a good model involves trading of fit and complexity." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Learning a complicated function that matches the training data closely but fails to recognize the underlying process that generates the data. As a result of overfitting, the model performs poor on new input. Overfitting occurs when the training patterns are sparse in input space and/or the trained networks are too complex." (Frank Padberg, "Counting the Hidden Defects in Software Documents", 2010)

"A forecaster should almost never ignore data, especially when she is studying rare events […]. Ignoring data is often a tip-off that the forecaster is overconfident, or is overfitting her model - that she is interested in showing off rather than trying to be accurate." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"A problem in data mining when random variations in data are misclassified as important patterns. Overfitting often occurs when the data set is too small to represent the real world." (Microsoft, "SQL Server 2012 Glossary", 2012)

"If you look too hard at a set of data, you will find something - but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset. Data mining techniques can be very powerful, and the need to detect and avoid overfitting is one of the most important concepts to grasp when applying data mining to real problems. The concept of overfitting and its avoidance permeates data science processes, algorithms, and evaluation methods." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Overfitting occurs when a formula describes a set of data very closely, but does not lead to any sensible explanation for the behavior of the data and does not predict the behavior of comparable data sets. In the case of overfitting, the formula is said to describe the noise of the system rather than the characteristic behavior of the system. Overfitting occurs frequently with models that perform iterative approximations on training data, coming closer and closer to the training data set with each iteration. Neural networks are an example of a data modeling strategy that is prone to overfitting." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"Briefly speaking, to solve a Machine Learning problem means you optimize a model to fit all the data from your training set, and then you use the model to predict the results you want. Therefore, evaluating a model need to see how well it can be used to predict the data out of the training set. Usually there are three types of the models: underfitting, fair and overfitting model [...]. If we want to predict a value, both (a) and (c) in this figure cannot work well. The underfitting model does not capture the structure of the problem at all, and we say it has high bias. The overfitting model tries to fit every sample in the training set and it did it, but we say it is of high variance. In other words, it fails to generalize new data." (Shudong Hao, "A Beginner’s Tutorial for Machine Learning Beginners", 2014)

"Neural networks can model very complex patterns and decision boundaries in the data and, as such, are very powerful. In fact, they are so powerful that they can even model the noise in the training data, which is something that definitely should be avoided. One way to avoid this overfitting is by using a validation set in a similar way as with decision trees.[...] Another scheme to prevent a neural network from overfitting is weight regularization, whereby the idea is to keep the weights small in absolute sense because otherwise they may be fitting the noise in the data. This is then implemented by adding a weight size term (e.g., Euclidean norm) to the objective function of the neural network." (Bart Baesens, "Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications", 2014)

"Underfitting is when a model doesn’t take into account enough information to accurately model real life. For example, if we observed only two points on an exponential curve, we would probably assert that there is a linear relationship there. But there may not be a pattern, because there are only two points to reference. [...] It seems that the best way to mitigate underfitting a model is to give it more information, but this actually can be a problem as well. More data can mean more noise and more problems. Using too much data and too complex of a model will yield something that works for that particular data set and nothing else." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Neural nets are typically over-parametrized, and hence are prone to overfitting. Originally early stopping was set up as the primary tuning parameter, and the stopping time was determined using a held-out set of validation data. In modern networks the regularization is tuned adaptively to avoid overfitting, and hence it is less of a problem." (Bradley Efron & Trevor Hastie, "Computer Age Statistical Inference: Algorithms, Evidence, and Data Science", 2016)

"The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting - that is, the more you should prefer simplicity." (Brian Christian & Thomas L Griffiths, "Algorithms to Live By: The Computer Science of Human Decisions", 2016)

"When memorization happens, you may have the illusion that everything is working well because your machine learning algorithm seems to have fitted the in sample data so well. Instead, problems can quickly become evident when you start having it work with out-of-sample data and you notice that it produces errors in its predictions as well as errors that actually change a lot when you relearn from the same data with a slightly different approach. Overfitting occurs when your algorithm has learned too much from your data, up to the point of mapping curve shapes and rules that do not exist [...]. Any slight change in the procedure or in the training data produces erratic predictions." (John P Mueller & Luca Massaron, Machine Learning for Dummies, 2016)

"By far the greatest headache in machine learning is the problem of overfitting. This means that your results look great for the data you trained them on, but they don’t generalize to other data in the future. [...] The solution is to train on some of your data and assess performance on other data." (Field Cady, "The Data Science Handbook", 2017)

"Cross-validation means we split our data into test and training sets, and then train the model on the training set before testing it on the test set. Cross-validation prevents overfitting, which is when a model seems quite accurate but fails to actually predict future events well." (Russell Jurney, "Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark", 2017)

"Multilayer perceptrons share with polynomial classifiers one unpleasant property. Theoretically speaking, they are capable of modeling any decision surface, and this makes them prone to overfitting the training data." (Miroslav Kubat," An Introduction to Machine Learning" 2nd Ed., 2017)

"The main reason why pruning tends to improve classification performance on future examples is that the removal of low-level tests, which have poor statistical support, usually reduces the danger of overfitting. This, however, works only up to a certain point. If overdone, a very high extent of pruning can (in the extreme) result in the decision being replaced with a single leaf labeled with the majority class." (Miroslav Kubat," An Introduction to Machine Learning" 2nd Ed., 2017)

"From a typical training set, many alternative decision trees can be created. As a rule, smaller trees are to be preferred, their main advantages being interpretability, removal of irrelevant and redundant attributes, and lower danger of overfitting noisy training data." (Miroslav Kubat, "An Introduction to Machine Learning" 2nd Ed., 2017)

"High-bias models typically produce simpler models that do not overfit and in those cases the danger is that of underfitting. Models with low-bias are typically more complex and that complexity enables us to represent the training data in a more accurate way. The danger here is that the flexibility provided by higher complexity may end up representing not only a relationship in the data but also the noise. Another way of portraying the bias-variance trade-off is in terms of complexity v simplicity." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"If either bias or variance is high, the model can be very far off from reality. In general, there is a trade-off between bias and variance. The goal of any machine-learning algorithm is to achieve low bias and low variance such that it gives good prediction performance. In reality, because of so many other hidden parameters in the model, it is hard to calculate the real bias and variance error. Nevertheless, the bias and variance provide a measure to understand the behavior of the machine-learning algorithm so that the model model can be adjusted to provide good prediction performance." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Overfitting and underfitting are two important factors that could impact the performance of machine-learning models. Overfitting occurs when the model performs well with training data and poorly with test data. Underfitting occurs when the model is so simple that it performs poorly with both training and test data. [...] When the model does not capture and fit the data, it results in poor performance. We call this underfitting. Underfitting is the result of a poor model that typically does not perform well for any data." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Overfitting refers to the phenomenon where a model is highly fitted on a dataset. This generalization thus deprives the model from making highly accurate predictions about unseen data. [...] Underfitting is a phenomenon where the model is not trained with high precision on data at hand. The treatment of underfitting is subject to bias and variance. A model will have a high bias if both train and test errors are high [...] If a model has a high bias type underfitting, then the remedy can be to increase the model complexity, and if a model is suffering from high variance type underfitting, then the cure can be to bring in more data or otherwise make the model less complex." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"The tension between bias and variance, simplicity and complexity, or underfitting and overfitting is an area in the data science and analytics process that can be closer to a craft than a fixed rule. The main challenge is that not only is each dataset different, but also there are data points that we have not yet seen at the moment of constructing the model. Instead, we are interested in building a strategy that enables us to tell something about data from the sample used in building the model." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Variance is a prediction error due to different sets of training samples. Ideally, the error should not vary from one training sample to another sample, and the model should be stable enough to handle hidden variations between input and output variables. Normally this occurs with the overfitted model." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Variance is error from sensitivity to fluctuations in the training set. If our training set contains sampling or measurement error, this noise introduces variance into the resulting model. [...] Errors of variance result in overfit models: their quest for accuracy causes them to mistake noise for signal, and they adjust so well to the training data that noise leads them astray. Models that do much better on testing data than training data are overfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Even though a natural way of avoiding overfitting is to simply build smaller networks (with fewer units and parameters), it has often been observed that it is better to build large networks and then regularize them in order to avoid overfitting. This is because large networks retain the option of building a more complex model if it is truly warranted. At the same time, the regularization process can smooth out the random artifacts that are not supported by sufficient data. By using this approach, we are giving the model the choice to decide what complexity it needs, rather than making a rigid decision for the model up front (which might even underfit the data)." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. [...] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure. The opposite is called underfitting - when the model cannot capture the structure of the data." (Umberto Michelucci, "Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks", 2018)

"The high generalization error in a neural network may be caused by several reasons. First, the data itself might have a lot of noise, in which case there is little one can do in order to improve accuracy. Second, neural networks are hard to train, and the large error might be caused by the poor convergence behavior of the algorithm. The error might also be caused by high bias, which is referred to as underfitting. Finally, overfitting (i.e., high variance) may cause a large part of the generalization error. In most cases, the error is a combination of more than one of these different factors." (Charu C Aggarwal, "Neural Networks and Deep Learning: A Textbook", 2018)

"The trick is to walk the line between underfitting and overfitting. An underfit model has low variance, generally making the same predictions every time, but with extremely high bias, because the model deviates from the correct answer by a significant amount. Underfitting is symptomatic of not having enough data points, or not training a complex enough model. An overfit model, on the other hand, has memorized the training data and is completely accurate on data it has seen before, but varies widely on unseen data. Neither an overfit nor underfit model is generalizable - that is, able to make meaningful predictions on unseen data." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Any fool can fit a statistical model, given the data and some software. The real challenge is to decide whether it actually fits the data adequately. It might be the best that can be obtained, but still not good enough to use." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"The classifier accuracy would be extra ordinary when the test data and the training data are overlapping. But when the model is applied to a new data it will fail to show acceptable accuracy. This condition is called as overfitting." (Jesu V Nayahi J & Gokulakrishnan K, "Medical Image Classification", 2019)

"We over-fit when we go too far in adapting to local circumstances, in a worthy but misguided effort to be ‘unbiased’ and take into account all the available information. Usually we would applaud the aim of being unbiased, but this refinement means we have less data to work on, and so the reliability goes down. Over-fitting therefore leads to less bias but at a cost of more uncertainty or variation in the estimates, which is why protection against over-fitting is sometimes known as the bias/variance trade-off." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Well, in statistics we develop models from a sample of data and are trying to make inferences to a broader population. […] If you use a lot of parameters to explain the data in hand (the sample), you may have captured your particular dataset but completely miss the mark for the population as a whole! This is known as 'overfitting'." (Therese M Donovan & Ruth M Mickey, "Bayesian Statistics for Beginners: A Step-by-Step Approach", 2019)

"In machine learning, our data has biases as well as useful information for our task. The more exactly our machine learning model fits the data, the more it reflects these biases. This means that the predictions may be based on spurious relationships that incidentally occur in the training data." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"[...] with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." (John von Neymann) [attributed]

13 May 2018

🔬Data Science: Self-Organizing Map (Definitions)

"A clustering neural net, with topological structure among cluster units." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"A self organizing map is a form of Kohonen network that arranges its clusters in a (usually) two-dimensional grid so that the codebook vectors (the cluster centers) that are close to each other on the grid are also close in the k-dimensional feature space. The converse is not necessarily true, as codebook vectors that are close in feature-space might not be close on the grid. The map is similar in concept to the maps produced by descriptive techniques such as multi-dimensional scaling (MDS)." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"result of a nonparametric regression process that is mainly used to represent high-dimensional, nonlinearly related data items in an illustrative, often two-dimensional display, and to perform unsupervised classification and clustering." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"a method of organizing and displaying textual information according to the frequency of occurrence of text and the relationship of text from one document to another." (William H Inmon, "Building the Data Warehouse", 2005)

"A type of unsupervised neural network used to group similar cases in a sample. SOMs are unsupervised (see supervised network) in that they do not require a known dependent variable. They are typically used for exploratory analysis and to reduce dimensionality as an aid to interpretation of complex data. SOMs are similar in purpose to Ic-means clustering and factor analysis." (David Scarborough & Mark J Somers, "Neural Networks in Organizational Research: Applying Pattern Recognition to the Analysis of Organizational Behavior", 2006)

"A method to learn to cluster input vectors according to how they are naturally grouped in the input space. In its simplest form, the map consists of a regular grid of units and the units learn to represent statistical data described by model vectors. Each map unit contains a vector used to represent the data. During the training process, the model vectors are changed gradually and then the map forms an ordered non-linear regression of the model vectors into the data space." (Atiq Islam et al, "CNS Tumor Prediction Using Gene Expression Data Part II", Encyclopedia of Artificial Intelligence, 2009)

"A neural-network method that reduces the dimensions of data while preserving the topological properties of the input data. SOM is suitable for visualizing high-dimensional data such as microarray data." (Emmanuel Udoh & Salim Bhuiyan, "C-MICRA: A Tool for Clustering Microarray Data", 2009)

"A neural network unsupervised method of vector quantization widely used in classification. Self-Organizing Maps are a much appreciated for their topology preservation property and their associated data representation system. These two additive properties come from a pre-defined organization of the network that is at the same time a support for the topology learning and its representation. (Patrick Rousset & Jean-Francois Giret, "A Longitudinal Analysis of Labour Market Data with SOM" Encyclopedia of Artificial Intelligence, 2009)

"A simulated neural network based on a grid of artificial neurons by means of prototype vectors. In an unsupervised training the prototype vectors are adapted to match input vectors in a training set. After completing this training the SOM provides a generalized K-means clustering as well as topological order of neurons." (Laurence Mukankusi et al, "Relationships between Wireless Technology Investment and Organizational Performance", 2009)

"A subtype of artificial neural network. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space." (Soledad Delgado et al, "Growing Self-Organizing Maps for Data Analysis", 2009)

"An unsupervised neural network providing a topology-preserving mapping from a high-dimensional input space onto a two-dimensional output space." (Thomas Lidy & Andreas Rauber, "Music Information Retrieval", 2009)

"Category of algorithms based on artificial neural networks that searches, by means of self-organization, to create a map of characteristics that represents the involved samples in a determined problem." (Paulo E Ambrósio, "Artificial Intelligence in Computer-Aided Diagnosis", 2009)

"Self-organizing maps (SOMs) are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks." (Lluís Formiga & Francesc Alías, "GTM User Modeling for aIGA Weight Tuning in TTS Synthesis", Encyclopedia of Artificial Intelligence, 2009)

"SOFM [self-organizing feature map] is a data mining method used for unsupervised learning. The architecture consists of an input layer and an output layer. By adjusting the weights of the connections between input and output layer nodes, this method identifies clusters in the data." (Indranil Bose, "Data Mining in Tourism", 2009)

"The self-organizing map is a subtype of artificial neural networks. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space. The self-organizing map is a single layer feed-forward network where the output syntaxes are arranged in low dimensional (usually 2D or 3D) grid. Each input is connected to all output neurons. Attached to every neuron there is a weight vector with the same dimensionality as the input vectors. The number of input dimensions is usually a lot higher than the output grid dimension. SOMs are mainly used for dimensionality reduction rather than expansion." (Larbi Esmahi et al, "Adaptive Neuro-Fuzzy Systems", Encyclopedia of Artificial Intelligence, 2009)

"A type of neural network that uses unsupervised learning to produce two-dimensional representations of an input space." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The Self-organizing map is a non-parametric and non-linear neural network that explores data using unsupervised learning. The SOM can produce output that maps multidimensional data onto a two-dimensional topological map. Moreover, since the SOM requires little a priori knowledge of the data, it is an extremely useful tool for exploratory analyses. Thus, the SOM is an ideal visualization tool for analyzing complex time-series data." (Peter Sarlin, "Visualizing Indicators of Debt Crises in a Lower Dimension: A Self-Organizing Maps Approach", 2012)

"SOMs or Kohonen networks have a grid topology, with unequal grid weights. The topology of the grid provides a low dimensional visualization of the data distribution." (Siddhartha Bhattacharjee et al, "Quantum Backpropagation Neural Network Approach for Modeling of Phenol Adsorption from Aqueous Solution by Orange Peel Ash", 2013)

"An unsupervised neural network widely used in exploratory data analysis and to visualize multivariate object relationships." (Manuel Martín-Merino, "Semi-Supervised Dimension Reduction Techniques to Discover Term Relationships", 2015)

"ANN used for visualizing low-dimensional views of high-dimensional data." (Pablo Escandell-Montero et al, "Artificial Neural Networks in Physical Therapy", 2015)

"Is a unsupervised learning ANN, which means that no human intervention is needed during the learning and that little needs to be known about the characteristics of the input data." (Nuno Pombo et al, "Machine Learning Approaches to Automated Medical Decision Support Systems", 2015)

"A kind of artificial neural network which attempts to mimic brain functions to provide learning and pattern recognition techniques. SOM have the ability to extract patterns from large datasets without explicitly understanding the underlying relationships. They transform nonlinear relations among high dimensional data into simple geometric connections among their image points on a low-dimensional display." (Felix Lopez-Iturriaga & Iván Pastor-Sanz, "Using Self Organizing Maps for Banking Oversight: The Case of Spanish Savings Banks", 2016)

"Neural network which simulated some cerebral functions in elaborating visual information. It is usually used to classify a large amount of data." (Gaetano B Ronsivalle & Arianna Boldi, "Artificial Intelligence Applied: Six Actual Projects in Big Organizations", 2019)

"Classification technique based on unsupervised-learning artificial neural networks allowing to group data into clusters." Julián Sierra-Pérez & Joham Alvarez-Montoya, "Strain Field Pattern Recognition for Structural Health Monitoring Applications", 2020)

"It is a type of artificial neural network (ANN) trained using unsupervised learning for dimensionality reduction by discretized representation of the input space of the training samples called as map." (Dinesh Bhatia et al, "A Novel Artificial Intelligence Technique for Analysis of Real-Time Electro-Cardiogram Signal for the Prediction of Early Cardiac Ailment Onset", 2020)

"Being a particular type of ANNs, the Self Organizing Map is a simple mapping from inputs: attributes directly to outputs: clusters by the algorithm of unsupervised learning. SOM is a clustering and visualization technique in exploratory data analysis." (Yuh-Wen Chen, "Social Network Analysis: Self-Organizing Map and WINGS by Multiple-Criteria Decision Making", 2021)

10 May 2018

🔬Data Science: Cross-validation (Definitions)

"A method for assessing the accuracy of a regression or classification model. A data set is divided up into a series of test and training sets, and a model is built with each of the training set and is tested with the separate test set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A method for assessing the accuracy of a regression or classification model." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"A statistical method derived from cross-classification which main objective is to detect the outlying point in a population set." (Tomasz Ciszkowski & Zbigniew Kotulski, "Secure Routing with Reputation in MANET", 2008)

"Process by which an original dataset d is divided into a training set t and a validation set v. The training set is used to produce an effort estimation model (if applicable), later used to predict effort for each of the projects in v, as if these projects were new projects for which effort was unknown. Accuracy statistics are then obtained and aggregated to provide an overall measure of prediction accuracy." (Emilia Mendes & Silvia Abrahão, "Web Development Effort Estimation: An Empirical Analysis", 2008)

"A method of estimating predictive error of inducers. Cross-validation procedure splits that dataset into k equal-sized pieces called folds. k predictive function are built, each tested on a distinct fold after being trained on the remaining folds." (Gilles Lebrun et al, EA Multi-Model Selection for SVM, 2009)

"Method to estimate the accuracy of a classifier system. In this approach, the dataset, D, is randomly split into K mutually exclusive subsets (folds) of equal size (D1, D2, …, Dk) and K classifiers are built. The i-th classifier is trained on the union of all Dj ¤ j¹i and tested on Di. The estimate accuracy is the overall number of correct classifications divided by the number of instances in the dataset." (M Paz S Lorente et al, "Ensemble of ANN for Traffic Sign Recognition" [in "Encyclopedia of Artificial Intelligence"], 2009)

"The process of assessing the predictive accuracy of a model in a test sample compared to its predictive accuracy in the learning or training sample that was used to make the model. Cross-validation is a primary way to assure that over learning does not take place in the final model, and thus that the model approximates reality as well as can be obtained from the data available." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Validating a scoring procedure by applying it to another set of data." (Dougal Hutchison, "Automated Essay Scoring Systems", 2009)

"A method for evaluating the accuracy of a data mining model." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Cross-validation is a method of splitting all of your data into two parts: training and validation. The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"A technique used for validation and model selection. The data is randomly partitioned into K groups. The model is then trained K times, each time with one of the groups left out, on which it is evaluated." (Simon Rogers & Mark Girolami, "A First Course in Machine Learning", 2017)

"A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set." (Adrian Carballal et al, "Approach to Minimize Bias on Aesthetic Image Datasets", 2019)

05 May 2018

🔬Data Science: Clustering (Definitions)

"Grouping of similar patterns together. In this text the term 'clustering' is used only for unsupervised learning problems in which the desired groupings are not known in advance." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"The process of grouping similar input patterns together using an unsupervised training algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Clustering attempts to identify groups of observations with similar characteristics." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"The process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects, which are 'similar' between them and are 'dissimilar' to the objects belonging to other clusters." (Juan R González et al, "Nature-Inspired Cooperative Strategies for Optimization", 2008)

"Grouping the nodes of an ad hoc network such that each group is a self-organized entity having a cluster-head which is responsible for formation and management of its cluster." (Prayag Narula, "Evolutionary Computing Approach for Ad-Hoc Networks", 2009)

"The process of assigning individual data items into groups (called clusters) so that items from the same cluster are more similar to each other than items from different clusters. Often similarity is assessed according to a distance measure." (Alfredo Vellido & Iván Olie, "Clustering and Visualization of Multivariate Time Series", 2010)

"Verb. To output a smaller data set based on grouping criteria of common attributes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The process of partitioning the data attributes of an entity or table into subsets or clusters of similar attributes, based on subject matter or characteristic (domain)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A data mining technique that analyzes data to group records together according to their location within the multidimensional attribute space." (SQL Server 2012 Glossary, "Microsoft", 2012)

"Clustering aims to partition data into groups called clusters. Clustering is usually unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't." (Ivan Idris, "Python Data Analysis", 2014)

"Form of data analysis that groups observations to clusters. Similar observations are grouped in the same cluster, whereas dissimilar observations are grouped in different clusters. As opposed to classification, there is not a class attribute and no predefined classes exist." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"Organization of data in some semantically meaningful way such that each cluster contains related data while the unrelated data are assigned to different clusters. The clusters may not be predefined." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

[cluster analysis:] "A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Clustering is a classification technique where similar kinds of objects are grouped together. The similarity between the objects maybe determined in different ways depending upon the use case. Therefore, clustering in measurement space may be an indicator of similarity of image regions, and may be used for segmentation purposes." (Shiwangi Chhawchharia, "Improved Lymphocyte Image Segmentation Using Near Sets for ALL Detection", 2016)

"Clustering techniques share the goal of creating meaningful categories from a collection of items whose properties are hard to directly perceive and evaluate, which implies that category membership cannot easily be reduced to specific property tests and instead must be based on similarity. The end result of clustering is a statistically optimal set of categories in which the similarity of all the items within a category is larger than the similarity of items that belong to different categories." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

[cluster analysis:]"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People", 2017)

"Clustering or cluster analysis is a set of techniques of multivariate data analysis aimed at selecting and grouping homogeneous elements in a data set. Clustering techniques are based on measures relating to the similarity between the elements. In many approaches this similarity, or better, dissimilarity, is designed in terms of distance in a multidimensional space. Clustering algorithms group items on the basis of their mutual distance, and then the belonging to a set or not depends on how the element under consideration is distant from the collection itself." (Crescenzio Gallo, "Building Gene Networks by Analyzing Gene Expression Profiles", 2018)

"Unsupervised learning or clustering is a way of discovering hidden structure in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The term clustering refers to the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters." (Satyadhyan Chickerur et al, "Forecasting the Demand of Agricultural Crops/Commodity Using Business Intelligence Framework", 2019)

"In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A cluster is a group of data objects which have similarities among them. It's a group of the same or similar elements gathered or occurring closely together." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Clustering describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"Describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

13 April 2018

🔬Data Science: Text Mining (Definitions)

"The application of data mining techniques to discover actionable and meaningful patterns, profiles, and trends from documents or other text data." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"The process of evaluating unstructured text for patterns, extract actionable data and sentiment via semantic analysis, statistical methods, etc." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Performing detailed full–text searches on the content of document." (Robert F Smallwood, "Managing Electronic Records: Methods, Best Practices, and Technologies", 2013)

"Data-mining techniques applied to text. Because these rely on the same underlying analytic approaches as text analysis, text mining is synonymous with text analysis, and the use of the term mining is primarily a matter of style and context." (Meta S Brown, "Data Mining For Dummies", 2014)

"Performing detailed full-text searches on the content of document." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"It is the process of extracting information from textual sources, via their grammatical and statistical properties. Applications of text mining include security monitoring and analysis of online texts such as blogs, web-pages, web-posts, etc." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"The analysis of raw data to produce results specific to a particular inquiry (e.g., how often a particular word is used, whether a particular product is in demand, how a particular consumer reacts to advertisements)." (James R Kalyvas & Michael R Overly, "Big Data: A Businessand Legal Guide", 2015)

"Performing detailed full-text searches on the content of document." (Robert F Smallwood, "Information Governance for Healthcare Professionals", 2018)

"The search and extraction of text, and its possible conversion to numerical data that is used for data analysis." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"The process of extracting information from collections of textual data and utilizing it for business objectives." (Gartner)

29 March 2018

🔬Data Science: Mining Model (Definitions)

"An object that contains the definition of a data mining process and the results of the training activity." (Microsoft Technet)

"Built from a mining structure, the mining model applies an algorithm to the data and processes it so that predictions can be made." (Sara Morganand & Tobias Thernstrom , "MCITP Self-Paced Training Kit : Designing and Optimizing Data Access by Using Microsoft SQL Server 2005 - Exam 70-442", 2007)

"An object that contains the definition of a data mining process and the results of the training activity. For example, a data mining model may specify the input, output, algorithm, and other properties of the process and hold the information gathered during the training activity, such as a decision tree." (Microsoft, SQL Server 2012 Glossary", 2012)

"The output of a data mining function that describes patterns and relationships that are discovered in historical data. A data mining model can be applied to new data for predicting likely new outcomes." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

15 March 2018

🔬Data Science: Training (Definitions)

"A step by step procedure for adjusting the weights in a neural net." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

[supervised training:] "Process of adjusting the weights in a neural net using a learning algorithm; the desired output for each of a set of training input vectors is presented to the net. Many iterations through the training data may be required." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

[unsupervised training:] "A training procedure in which only input vectors x are supplied to a neural network; the network learns some internal features of the whole set of all the input vectors presented to it." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"The process of adjusting the connection weights in a neural network under the control of a learning algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

[supervised training:] "Training of a neural network when the training examples comprise input vectors x and the desired output vectors y; training is performed until the neural network 'learns' to associate each input vector x with its corresponding and desired output vector y." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Exposing a neural computing system to a set of example stimuli to achieve a particular user-defined goal." (Guido J Deboeck and Teuvo Kohonen, "Visual explorations in finance with self-organizing maps", 2000)

"The process used to configure an artificial neural network by repeatedly exposing it to sample data. In feed-forward networks, as each incoming vector or individual input is processed, the network produces an output for that case. With each pass of every case vector in a sample (see epoch), connection weights between neurons are modified. A typical training regime may require tens to thousands of complete epochs before the network converges (see convergence)." (David Scarborough & Mark J Somers, "Neural Networks in Organizational Research: Applying Pattern Recognition to the Analysis of Organizational Behavior", 2006)

"The process a data mining model uses to estimate model parameters by evaluating a set of known and predictable data." (Microsoft, "SQL Server 2012 Glossary", 2012)

"In data mining, the process of fitting a model to data. This is an iterative process and may involve thousands of iterations or more." (Meta S Brown, "Data Mining For Dummies", 2014)

"The process of adjusting the weights and threshold values in a neural net to get a desired outcome" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

"Model training is the process of fitting a model to data." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Model Training is how artificial intelligence (AI) is taught to perform its tasks, and in many ways follows the same process that new human recruits must also undergo. AI training data needs to be unbiased and comprehensive to ensure that the AI’s actions and decisions do not unintentionally disadvantage a set of people. A key feature of responsible AI is the ability to demonstrate how an AI has been trained." (Accenture)

10 February 2018

🔬Data Science: Data Mining (Definitions)

"The non-trivial extraction of implicit, previously unknown, and potentially useful information from data" (Frawley et al., "Knowledge discovery in databases: An overview", 1991)

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Data mining is the process of examining large amounts of aggregated data. The objective of data mining is to either predict what may happen based on trends or patterns in the data or to discover interesting correlations in the data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals", 2001)

"A class of undirected queries, often against the most atomic data, that seek to find unexpected patterns in the data. The most valuable results from data mining are clustering, classifying, estimating, predicting, and finding things that occur together. There are many kinds of tools that play a role in data mining. The principal tools include decision trees, neural networks, memory- and cased-based reasoning tools, visualization tools, genetic algorithms, fuzzy logic, and classical statistics. Generally, data mining is a client of the data warehouse." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"The discovery of information hidden within data." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"the process of extracting valid, authentic, and actionable information from large databases." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Advanced analysis or data mining is the analysis of detailed data to detect patterns, behaviors, and relationships in data that were previously only partially known or at times totally unknown." (Margaret Y Chu, "Blissful Data", 2004)

"Analysis of detail data to discover relationships, patterns, or associations between values." (Margaret Y Chu, "Blissful Data ", 2004)

"An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"the process of analyzing large amounts of data in search of previously undiscovered business patterns." (William H Inmon, "Building the Data Warehouse", 2005)

"A type of advanced analysis used to determine certain patterns within data. Data mining is most often associated with predictive analysis based on historical detail, and the generation of models for further analysis and query." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"Refers to the process of identifying nontrivial facts, patterns and relationships from large databases. The databases have often been put together for a different purpose from the data mining exercise." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Data mining is the process of discovering implicit patterns in data stored in data warehouse and using those patterns for business advantage such as predicting future trends." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Digging through data (usually in a data warehouse or data mart) to identify interesting patterns." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"Intelligently analyzing data to extract hidden trends, patterns, and information. Commonly used by statisticians, data analysts and Management Information Systems communities." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The process of extracting valid, authentic, and actionable information from large databases." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The process of retrieving relevant data to make intelligent decisions." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"A process that minimally has four stages: (1) data preparation that may involve 'data cleaning' and even 'data transformation', (2) initial exploration of the data, (3) model building or pattern identification, and (4) deployment, which means subjecting new data to the 'model' to predict outcomes of cases found in the new data." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Automatically searching large volumes of data for patterns or associations." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The use of machine learning algorithms to find faint patterns of relationship between data elements in large, noisy, and messy data sets, which can lead to actions to increase benefit in some form (diagnosis, profit, detection, etc.)." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A way of extracting knowledge from a database by searching for correlations in the data and presenting promising hypotheses to the user for analysis and consideration." (Toby J Teorey, "Database Modeling and Design" 4th Ed., 2010)

"The process of using mathematical algorithms (usually implemented in computer software) to attempt to transform raw data into information that is not otherwise visible (for example, creating a query to forecast sales for the future based on sales from the past)." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A process that employs automated tools to analyze data in a data warehouse and other sources and to proactively identify possible relationships and anomalies." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"Process of analyzing data from different perspectives and summarizing it into useful information (e.g., information that can be used to increase revenue, cuts costs, or both)." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The process of sifting through large amounts of data using pattern recognition, fuzzy logic, and other knowledge discovery statistical techniques to identify previously unknown, unsuspected, and potentially meaningful data content relationships and trends." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data mining, a branch of computer science, is the process of extracting patterns from large data sets by combining statistical analysis and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage." (T T Wong & Loretta K W Sze, "A Neuro-Fuzzy Partner Selection System for Business Social Networks", 2012)

"Field of analytics with structured data. The model inference process minimally has four stages: data preparation, involving data cleaning, transformation and selection; initial exploration of the data; model building or pattern identification; and deployment, putting new data through the model to obtain their predicted outcomes." (Gary Miner et al, "Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications", 2012)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The process of exploring and analyzing large amounts of data to find patterns." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"An umbrella term for analytic techniques that facilitate fast pattern discovery and model building, particularly with large datasets." (Meta S Brown, "Data Mining For Dummies", 2014)

"Analysis of large quantities of data to find patterns such as groups of records, unusual records, and dependencies" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The practice of analyzing big data using mathematical models to develop insights, usually including machine learning algorithms as opposed to statistical methods."(Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Data mining is the analysis of data for relationships that have not previously been discovered." (Piyush K Shukla & Madhuvan Dixit, "Big Data: An Emerging Field of Data Engineering", Handbook of Research on Security Considerations in Cloud Computing, 2015)

"A methodology used by organizations to better understand their customers, products, markets, or any other phase of the business." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK" 4th Ed., 2015)

"Extracting information from a database to zero in on certain facts or summarize a large amount of data." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"It refers to the process of identifying and extracting patterns in large data sets based on artificial intelligence, machine learning, and statistical techniques." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"The process of exploring and analyzing large amounts of data to find patterns." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Term used to describe analyzing large amounts of data to find patterns, correlations, and similarities." (Brittany Bullard, "Style and Statistics", 2016)

"The process of extracting meaningful knowledge from large volumes of data contained in data warehouses." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A class of analytical applications that help users search for hidden patterns in a data set. Data mining is a process of analyzing large amounts of data to identify data–content relationships. Data mining is one tool used in decision support special studies. This process is also known as data surfing or knowledge discovery." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"The process of collecting, searching through, and analyzing a large amount of data in a database to discover patterns or relationships." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Data mining involves finding meaningful patterns and deriving insights from large data sets. It is closely related to analytics. Data mining uses statistics, machine learning, and artificial intelligence techniques to derive meaningful patterns." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"The analysis of the data held in data warehouses in order to produce new and useful information." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

"The process of collecting critical business information from a data source, correlating the information, and uncovering associations, patterns, and trends." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems." (Dmitry Korzun et al, "Semantic Methods for Data Mining in Smart Spaces", 2019)

"A technique using software tools geared for the user who typically does not know exactly what he's searching for, but is looking for particular patterns or trends. Data mining is the process of sifting through large amounts of data to produce data content relationships. It can predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. This is also known as data surfing." (Information Management)

"An analytical process that attempts to find correlations or patterns in large data sets for the purpose of data or knowledge discovery." (NIST SP 800-53)

"Extracting previously unknown information from databases and using that data for important business decisions, in many cases helping to create new insights." (Solutions Review)

"is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"the process of analyzing large batches of data to find patterns and instances of statistical significance. By utilizing software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective strategies for acquisition, as well as increase sales and decrease overall costs." (Insight Software)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft)

"The process of pulling actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating those connections in a way that helps decision-makers improve their product or organization." (KDnuggets)

"Data mining is the process of analyzing hidden patterns of data according to different perspectives for categorization into useful information, which is collected and assembled in common areas, such as data warehouses, for efficient analysis, data mining algorithms, facilitating business decision making and other information requirements to ultimately cut costs and increase revenue. Data mining is also known as data discovery and knowledge discovery." (Techopedia)

"Data mining is an automated analytical method that lets companies extract usable information from massive sets of raw data. Data mining combines several branches of computer science and analytics, relying on intelligent methods to uncover patterns and insights in large sets of information." (Sisense) [source]

"Data mining is the process of analyzing data from different sources and summarizing it into relevant information that can be used to help increase revenue and decrease costs. Its primary purpose is to find correlations or patterns among dozens of fields in large databases." (Logi Analytics) [source]

"Data mining is the process of analyzing massive volumes of data to discover business intelligence that helps companies solve problems, mitigate risks, and seize new opportunities." (Talend) [source]

"Data Mining is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"Data mining is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques." (Gartner)

"Data mining is the process of extracting relevant patterns, deviations and relationships within large data sets to predict outcomes and glean insights. Through it, companies convert big data into actionable information, relying upon statistical analysis, machine learning and computer science." (snowflake) [source]

"Data mining is the work of analyzing business information in order to discover patterns and create predictive models that can validate new business insights. […] Unlike data analytics, in which discovery goals are often not known or well defined at the outset, data mining efforts are usually driven by a specific absence of information that can’t be satisfied through standard data queries or reports. Data mining yields information from which predictive models can be derived and then tested, leading to a greater understanding of the marketplace." (Informatica) [source]

01 February 2018

🔬Data Science: Data Analysis (Definitions)

"Obtaining information from measured or observed data." (Ildiko E Frank & Roberto Todeschini, "The Data Analysis Handbook", 1994)

"Refers to the process of organizing, summarizing and visualizing data in order to draw conclusions and make decisions." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A combination of human activities and computer processes that answer a research question or confirm a research hypotheses. It answers the question from data files, using empirical methods such as correlation, t-test, content analysis, or Mill’s method of agreement." (Jens Mende, "Data Flow Diagram Use to Plan Empirical Research Projects", 2009)

"The study and presentation of data to create information and knowledge." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Process of applying statistical techniques to evaluate data." (Sally-Anne Pitt, "Internal Audit Quality", 2014)

"Research phase in which data gathered from observing participants are analysed, usually with statistical procedures." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Data analysis is the process of creating meaning from data. […] Data analysis is the process of creating information from data through the creation of data models and mathematics to find patterns." (Michael Heydt, "Learning Pandas" 2nd Ed, 2017)

"Data analysis is the process of organizing, cleaning, transforming, and modeling data to obtain useful information and ultimately, new knowledge." (John R. Hubbard, Java Data Analysis, 2017)

"Techniques used to organize, assess, and evaluate data and information." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide )", 2017)

"This is a class of statistical methods that make it possible to process a very large volume of data and identify the most interesting aspects of its structure. Some methods help to extract relations between different sets of data, and thus, draw statistical information that makes it possible to describe the most important information contained in the data in the most succinct manner possible. Other techniques make it possible to group data in order to identify its common denominators clearly, and thereby understand them better." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"The process and techniques for transforming and evaluating information using qualitative or quantitative tools to discover findings or inform conclusions." (Tiffany J Cresswell-Yeager & Raymond J Bandlow, "Transformation of the Dissertation: From an End-of-Program Destination to a Program-Embedded Process", 2020)

"Data Analysis is a process of gathering and extracting information from the data already present in different ways and order to study the pattern occurs." (Kirti R Bhatele, "Data Analysis on Global Stratification", 2020)

"A data lifecycle stage that involves the techniques that produce synthesized knowledge from organized information. A process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains." (CODATA)

"is the process of inspecting, cleansing, transforming, and modeling data to discover useful information, and support decision-making. The many different types of data analysis include data mining, a predictive technique used for modeling and knowledge discovery, and business intelligence, which relies on aggregation and focuses on business information." (Accenture)

"This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries to identify patterns that can improve an organization." (KDnuggets)

"Data Analysis is the process of inspecting, cleansing, transforming, and modeling data to discover useful information, and support decision-making. The many different types of data analysis include data mining, a predictive technique used for modeling and knowledge discovery, and business intelligence, which relies on aggregation and focuses on business information." (Accenture)

17 June 2015

📊Business Intelligence: Advanced Analytics (Definitions)

"A subset of analytical techniques that, among other things, often uses statistical methods to identify and quantify the influence and significance of relationships between items of interest, groups similar items together, creates predictions, and identifies mathematical optimal or near-optimal answers to business problems." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Algorithms for complex analysis of either structured or unstructured data. It includes sophisticated statistical models, machine learning, neural networks, text analytics, and other advanced data-mining techniques Advanced analytics does not include database query and reporting and OLAP cubes." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A subset of analytical techniques that, among other things, often uses statistical methods to identify and quantify the influence and significant of relationships between items of interest, group similar items together, create predictions, and identify mathematical optimal or near-optimal answers to business problems." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Advanced Analytics is the autonomous or semi-autonomous examination of data or content using sophisticated techniques and tools, typically beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations. Advanced analytic techniques include those such as data/text mining, machine learning, pattern matching, forecasting, visualization, semantic analysis, sentiment analysis, network and cluster analysis, multivariate statistics, graph analysis, simulation, complex event processing, neural networks. (Gartner)

"Analytic techniques and technologies that apply statistical and/or machine learning algorithms that allow firms to discover, evaluate, and optimize models that reveal and/or predict new insights." (Forrester)

"Advanced analytics describes data analysis that goes beyond simple mathematical calculations such as sums and averages, or filtering and sorting. Advanced analyses use mathematical and statistical formulas and algorithms to generate new information, to recognize patterns, and also to predict outcomes and their respective probabilities." (BI-Survey) [source]

"Advanced analytics is an umbrella term for a group of high-level methods and tools that can help you get more out of your data. The predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. This gives organizations the ability to perform advanced statistical models such as 'what-if' calculations, as well as to future-proof various aspects of their operations." (Sisense) [source]

15 April 2015

📊Business Intelligence: Text Analytics (Definitions)

"A technique whereby software employs linguistics and pattern detection techniques to impute some larger meaning to the words in a document. Entity extraction and document categorization are two emerging types of text analytics." (Mike Moran & Bill Hunt , "Search Engine Marketing, Inc", 2005)

"Transforms unstructured text into structured 'text data' that can then be searched, mined, or discovered." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"The process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can be leveraged in various ways." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Refers generally to the process of deriving patterns and trends from unstructured content such as notes, reports, and comments." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"The practice of analyzing unstructured data." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Text analytics a variety of computer-based techniques designed to deriving information from text sources." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can be leveraged in various ways." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"The process of deriving insights from large volumes of text, typically through the use of specialized software to identify patterns, trends, and sentiment. " (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

[AI-based text analytics:] "Machine-learning and rules-based analytics technology that mines semistructured and unstructured text data sources and extracts structured information (such as keywords, concepts, entities, topics, sentiment, emotion, and intent) to analyze the findings for correlations, trends, outliers, patterns, and anomalies." (Forrester)

"A subset of natural language processing (NLP) technologies that identifies structures and patterns in text and transforms them into actionable insights to drive better business outcomes." (Forrester)

"Text analytics is the process of deriving information from text sources. It is used for several purposes, such as: summarization (trying to find the key content across a larger body of information or a single document), sentiment analysis (what is the nature of commentary on an issue), explicative (what is driving that commentary), investigative (what are the particular cases of a specific issue) and classification (what subject or what key content pieces does the text talk about)." (Gartner)

SQL Troubles

Pages