SQL Troubles: variables

Showing posts with label variables. Show all posts

05 May 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 179: How AI Can Detect Interactions Between Multiple Variables in Complex Patterns)

Prompt Engineering Series

Prompt: "write a post of 600 words on how AI can detect interactions between multiple variables in complex patterns"

Introduction

One of the most powerful capabilities of modern Artificial Intelligence (AI) is its ability to detect interactions between multiple variables - interactions that are subtle, nonlinear, and often invisible to human intuition. Humans are good at spotting simple relationships: when one variable increases, another tends to rise or fall. But real‑world systems rarely behave so cleanly. Instead, outcomes often emerge from the interplay of many factors acting together, sometimes reinforcing each other, sometimes canceling each other out, and sometimes producing effects that only appear under very specific conditions. AI excels in precisely this territory. Its architecture allows it to uncover complex, multi‑variable interactions that would otherwise remain hidden.

The first reason AI can detect these interactions is its ability to analyze high‑dimensional data without cognitive limits. Humans can reason about two or three variables at a time, but beyond that, our intuition collapses. AI systems, especially deep learning models, can process hundreds or thousands of variables simultaneously. They can map how changes in one variable influence another, not in isolation, but in combination with many others. This is essential in fields like genomics, where the effect of a single gene may depend on the presence of dozens of others, or in economics, where market behavior emerges from the interplay of countless signals.

A second advantage lies in AI’s capacity to model nonlinear relationships. Interactions between variables are rarely linear. The effect of one variable may depend on the level of another, creating curved, threshold‑based, or conditional relationships. Traditional statistical methods often struggle with these nonlinearities unless explicitly instructed to look for them. AI models, by contrast, naturally capture nonlinear interactions through their layered structure. Neural networks, for example, learn complex transformations at each layer, allowing them to detect relationships that bend, twist, or reverse depending on context. This flexibility enables AI to uncover interactions that humans would never think to test.

Another key factor is AI’s ability to detect higher‑order interactions - relationships that involve not just pairs of variables, but combinations of three, four, or more. These higher‑order interactions are common in complex systems. For example, a medical treatment might be effective only when a patient has a specific genetic profile and a particular environmental exposure and a certain lifestyle pattern. Humans rarely detect such interactions because they require examining an enormous number of possible combinations. AI, however, can explore these combinations efficiently, identifying the rare configurations that produce meaningful effects.

AI also excels at local pattern detection, which is crucial for identifying interactions that appear only under specific conditions. Humans tend to look for global rules that apply everywhere. AI can break a dataset into many small regions and learn different relationships in each one. A variable might matter only when another variable crosses a certain threshold, or only within a particular subgroup. Models like decision trees, random forests, and gradient boosting machines are particularly good at uncovering these conditional interactions. They reveal patterns that are invisible when looking at the dataset as a whole.

A further strength comes from AI’s ability to integrate heterogeneous data sources. Interactions often span different types of information - numerical measurements, text, images, signals, or categorical variables. Humans struggle to combine such diverse inputs. AI systems, however, can fuse them into a unified representation, allowing interactions to emerge across modalities. This is especially valuable in fields like healthcare, where symptoms, lab results, imaging data, and patient history interact in complex ways.

Finally, AI’s ability to detect multi‑variable interactions is amplified by continuous learning. As new data arrives, AI systems can update their internal models, refining their understanding of how variables interact. This dynamic adaptation allows them to track evolving systems where interactions shift over time.

AI’s ability to detect interactions between multiple variables is not a replacement for human insight. Instead, it expands our analytical reach, revealing structures that lie beyond the limits of intuition. When humans and AI collaborate - combining human judgment with machine‑level pattern detection - we gain a deeper, more accurate understanding of the complex systems that shape our world.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

01 May 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 175: The Power of Scale: How AI Detects Weak Correlations Humans Miss)

Prompt Engineering Series

Prompt: "write a post of 600 words on how AI can detect weak correlations that appear only across large samples"

Introduction

Artificial Intelligence (AI) is exceptionally good at uncovering weak correlations that only emerge when you analyze massive datasets, and this ability is reshaping how organizations understand patterns, make predictions, and uncover hidden drivers of behavior. At its core, the challenge with weak correlations is that they are often too subtle to detect with traditional statistical methods, especially when analysts are limited by human attention, computational constraints, or the tendency to focus on variables that seem intuitively important. AI changes that dynamic by bringing scale, speed, and pattern‑recognition capabilities that far exceed what humans can do manually.

Weak correlations typically hide in high‑dimensional data - datasets with hundreds or thousands of variables, each interacting in complex ways. A single variable might show almost no predictive power on its own, but when combined with dozens of others, it can contribute meaningfully to a model’s accuracy. Humans struggle to reason about these multi‑variable interactions because our intuition tends to focus on strong, obvious relationships. AI, especially machine learning models, has no such limitation. It can evaluate millions of combinations of features, test them against historical outcomes, and identify subtle signals that would otherwise be lost in noise.

One of the most powerful techniques for detecting weak correlations is ensemble learning, where multiple models - each with different strengths - work together. A single decision tree might miss a faint pattern, but a forest of hundreds of trees can collectively detect it. Similarly, gradient boosting methods build models sequentially, with each new model focusing on the errors of the previous ones. This iterative refinement allows the system to pick up on small, incremental improvements that accumulate into meaningful predictive power.

Deep learning takes this even further. Neural networks excel at identifying non‑linear relationships, where the effect of one variable depends on the value of another. These relationships often appear weak or nonexistent when viewed in isolation. But when a neural network processes them through multiple layers of transformations, the combined effect becomes clear. This is why deep learning models can detect faint signals in areas like fraud detection, medical imaging, and natural language processing - domains where the patterns are too subtle or complex for traditional analytics.

Another advantage of AI is its ability to work with large sample sizes without being overwhelmed. Weak correlations often require millions of data points before they become statistically meaningful. For humans, analyzing such datasets is impractical. For AI, it’s routine. Modern machine learning frameworks can process enormous datasets efficiently, allowing models to learn from patterns that only emerge at scale. This is particularly valuable in fields like e‑commerce, where tiny behavioral signals - such as the time between clicks or the order in which products are viewed - can predict customer intent when aggregated across millions of sessions.

AI also benefits from techniques like regularization, which help prevent models from overfitting to noise. When searching for weak correlations, the risk is that a model might latch onto random fluctuations rather than meaningful patterns. Regularization methods penalize overly complex models, ensuring that only correlations that consistently improve predictive accuracy across many samples are retained. This balance between flexibility and discipline is essential for detecting subtle but real relationships.

Finally, AI’s ability to detect weak correlations has profound implications for decision‑making. It enables organizations to identify early warning signals, personalize experiences at scale, and uncover hidden drivers of outcomes. These insights often lead to competitive advantages because they reveal opportunities that competitors overlook.

In a world where data continues to grow exponentially, the ability to detect faint patterns across massive samples is becoming one of the most valuable capabilities in analytics. AI doesn’t just make this possible - it makes it practical, reliable, and increasingly essential for anyone seeking deeper understanding in complex environments.

Previous Post <<||>> Next Post

20 February 2025

💠🛠️🗒️SQL Server: Folding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 20-Feb-2024

[SQL Server] Folding

{def} the process by which the optimizer is able to properly determine certain types of indexable expressions even when the column in the expression is involved in a subexpression or nestled in a function

is an optimization over older versions of SQL Server in which the optimizer was unable to use an index to service a query clause when the table column was involved in an expression or buried in a function [1]

{type} constant folding

some constant expression is evaluated early
foldable constant expressions [2]

arithmetical expressions
logical expressions
built-in functions whose input doesn’t depend of contextual information,

e. g. SET options, language settings, database options, encryption keys
deterministic built-in functions are foldable, with some exceptions

certain forms of the LIKE predicate
[SQL Server 2012+] deterministic methods of CLR user-defined types [3]
[SQL Server 2012+] deterministic scalar-valued CLR user-defined functions [3]

nonfoldable expressions [2]

expressions whose results depend on a local variable or parameter
user-defined functions

both T-SQL and CLR

expressions whose results depend on language settings.
expressions whose results depend on SET options.
expressions whose results depend on server configuration options.
nonconstant expressions such as an expression whose result depends on the value of a column.
nondeterministic functions
if the output is a large object type, then the expressions are not folded

e.g. text, image, nvarchar(max), varchar(max), varbinary(max), XML

{benefit} the expression does not have to be evaluated repeatedly at run time [2]
{benefit} the value of the expression after it is evaluated is used by the query optimizer to estimate the size of the result set of the portion of the query [2]

e.g. TotalDue > 117.00 + 1000.00

{type} nonconstant folding

some expressions that are not constant folded but whose arguments are known at compile time, whether the arguments are parameters or constants, are evaluated by the result-set size (cardinality) estimator that is part of the optimizer during optimization [2]
deterministic functions:

e.g. UPPER, LOWER, RTRIM
e.g. DATEPART( YY only ), GetDate, CAST, CONVERT

operators

arithmetic operators: +, -, *, /, unary -,
logical Operators: AND, OR, NOT
comparison operators: <, >, <=, >=, <>, LIKE, IS NULL, IS NOT NULL

Previous Post <<||>> Next Post

References:

[1] Ken Henderson (2003) Guru's Guide to SQL Server Architecture and Internals

[2] Microsoft Learn (2012) SQL Server: Troubleshooting Poor Query Performance: Constant Folding and Expression Evaluation During Cardinality Estimation [link]
[3] Microsoft Learn (2025) SQL Server: Query processing architecture guide [link]
[4] SQLShack (2021) Query Optimization in SQL Server for beginners, by Esat Erkec [link]

05 December 2018

🔭Data Science: Variables (Just the Quotes)

"Every scientific problem can be stated most clearly if it is thought of as a search for the nature of the relation between two definitely stated variables. Very often a scientific problem is felt and stated in other terms, but it cannot be so clearly stated in any way as when it is thought of as a function by which one variable is shown to be dependent upon or related to some other variable." (Louis L Thurstone, "The Fundamentals of Statistics", 1925)

"[Disorganized complexity] is a problem in which the number of variables is very large, and one in which each of the many variables has a behavior which is individually erratic, or perhaps totally unknown. However, in spite of this helter-skelter, or unknown, behavior of all the individual variables, the system as a whole possesses certain orderly and analyzable average properties. [...] [Organized complexity is] not problems of disorganized complexity, to which statistical methods hold the key. They are all problems which involve dealing simultaneously with a sizable number of factors which are interrelated into an organic whole. They are all, in the language here proposed, problems of organized complexity." (Warren Weaver, "Science and Complexity", American Scientist Vol. 36, 1948)

"The primary purpose of a graph is to show diagrammatically how the values of one of two linked variables change with those of the other. One of the most useful applications of the graph occurs in connection with the representation of statistical data." (John F Kenney & E S Keeping, "Mathematics of Statistics" Vol. I 3rd Ed., 1954)

"The well-known virtue of the experimental method is that it brings situational variables under tight control. It thus permits rigorous tests of hypotheses and confidential statements about causation. The correlational method, for its part, can study what man has not learned to control. Nature has been experimenting since the beginning of time, with a boldness and complexity far beyond the resources of science. The correlator’s mission is to observe and organize the data of nature’s experiments." (Lee J Cronbach, "The Two Disciplines of Scientific Psychology", The American Psychologist Vol. 12, 1957)

"A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria - and some more demanding - can be specified. For example, a model with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model." (Robert R Bush & Frederick Mosteller,"A Comparison of Eight Models?", Studies in Mathematical Learning Theory, 1959)

"[A] sequence is random if it has every property that is shared by all infinite sequences of independent samples of random variables from the uniform distribution." (Joel N Franklin, 1962)

"The most valuable use of such [mathematical] models usually lies less in turning out the answer in an uncertain world than in shedding light on how much difference an alteration in the assumptions and/or variables used would make in the answer yielded by the models." (Edward G. Bennion, "New Decision-Making Tools for Managers", 1963)

"Most of our beliefs about complex organizations follow from one or the other of two distinct strategies. The closed-system strategy seeks certainty by incorporating only those variables positively associated with goal achievement and subjecting them to a monolithic control network. The open-system strategy shifts attention from goal achievement to survival and incorporates uncertainty by recognizing organizational interdependence with environment. A newer tradition enables us to conceive of the organization as an open system, indeterminate and faced with uncertainty, but subject to criteria of rationality and hence needing certainty." (James D Thompson, "Organizations in Action", 1967)

"The less we understand a phenomenon, the more variables we require to explain it." (Russell L Ackoff, "Management Science", 1967)

"To model the dynamic behavior of a system, four hierarchies of structure should be recognized: closed boundary around the system; feedback loops as the basic structural elements within the boundary; level variables representing accumulations within the feedback loops; rate variables representing activity within the feedback loops." (Jay W Forrester, "Urban Dynamics", 1969)

"However, and conversely, our models fall far short of representing the world fully. That is why we make mistakes and why we are regularly surprised. In our heads, we can keep track of only a few variables at one time. We often draw illogical conclusions from accurate assumptions, or logical conclusions from inaccurate assumptions. Most of us, for instance, are surprised by the amount of growth an exponential process can generate. Few of us can intuit how to damp oscillations in a complex system." (Donella H Meadows, "Limits to Growth", 1972)

"It is not always appreciated that the problem of theory building is a constant interaction between constructing laws and finding an appropriate set of descriptive state variables such that laws can be constructed." (Richard C Lewontin, "The Genetic Basis of Evolutionary Change", 1974)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"A system may be specified in either of two ways. In the first, which we shall call a state description, sets of abstract inputs, outputs and states are given, together with the action of the inputs on the states and the assignments of outputs to states. In the second, which we shall call a coordinate description, certain input, output and state variables are given, together with a system of dynamical equations describing the relations among the variables as functions of time. Modern mathematical system theory is formulated in terms of state descriptions, whereas the classical formulation is typically a coordinate description, for example a system of differential equations." (E S Bainbridge, "The Fundamental Duality of System Theory", 1975)

"Managers construct, rearrange, single out, and demolish many objective features of their surroundings. When people act they unrandomize variables, insert vestiges of orderliness, and literally create their own constraints." (Karl E Weick, "Social Psychology of Organizing", 1979)

"The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The formal structure of a decision problem in any area can be put into four parts: (1) the choice of an objective function denning the relative desirability of different outcomes; (2) specification of the policy alternatives which are available to the agent, or decisionmaker, (3) specification of the model, that is, empirical relations that link the objective function, or the variables that enter into it, with the policy alternatives and possibly other variables; and (4) computational methods for choosing among the policy alternatives that one which performs best as measured by the objective function." (Kenneth Arrow, "The Economics of Information", 1984)

"A mechanistic model has the following advantages: 1. It contributes to our scientiﬁc understanding of the phenomenon under study. 2. It usually provides a better basis for extrapolation (at least to conditions worthy of further experimental investigation if not through the entire range of all input variables). 3. It tends to be parsimonious (i.e, frugal) in the use of parameters and to provide better estimates of the response." (George E P Box, "Empirical Model-Building and Response Surfaces", 1987)

"A system of variables is 'interrelated' if an action that affects or meant to affect one part of the system will also affect other parts of it. Interrelatedness guarantees that an action aimed at one variable will have side effects and long-term repercussions. A large number of variables will make it easy to overlook them." (Dietrich Dorner, "The Logic of Failure: Recognizing and Avoiding Error in Complex Situations", 1989)

"The real leverage in most management situations lies in understanding dynamic complexity, not detail complexity. […] Unfortunately, most 'systems analyses' focus on detail complexity not dynamic complexity. Simulations with thousands of variables and complex arrays of details can actually distract us from seeing patterns and major interrelationships. In fact, sadly, for most people 'systems thinking' means 'fighting complexity with complexity', devising increasingly 'complex' (we should really say 'detailed') solutions to increasingly 'complex' problems. In fact, this is the antithesis of real systems thinking." (Peter M Senge, "The Fifth Discipline: The Art and Practice of the Learning Organization", 1990)

"Industrial managers faced with a problem in production control invariably expect a solution to be devised that is simple and unidimensional. They seek the variable in the situation whose control will achieve control of the whole system: tons of throughput, for example. Business managers seek to do the same thing in controlling a company; they hope they have found the measure of the entire system when they say 'everything can be reduced to monetary terms'." (Stanford Beer, "Decision and Control", 1994)

"Complex adaptive systems have the property that if you run them - by just letting the mathematical variable of 'time' go forward - they'll naturally progress from chaotic, disorganized, undifferentiated, independent states to organized, highly differentiated, and highly interdependent states. Organized structures emerge spontaneously. [...]A weak system gives rise only to simpler forms of self-organization; a strong one gives rise to more complex forms, like life. (J Doyne Farmer, "The Third Culture: Beyond the Scientific Revolution", 1995)

"In addition to dimensionality requirements, chaos can occur only in nonlinear situations. In multidimensional settings, this means that at least one term in one equation must be nonlinear while also involving several of the variables. With all linear models, solutions can be expressed as combinations of regular and linear periodic processes, but nonlinearities in a model allow for instabilities in such periodic solutions within certain value ranges for some of the parameters." (Courtney Brown, "Chaos and Catastrophe Theories", 1995)

"The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses [...] different models, all of them equally good, may give different pictures of the relation between the predictor and response variables [...] One reason for this multiplicity is that goodness-of-fit tests and other methods for checking fit give a yes–no answer. With the lack of power of these tests with data having more than a small number of dimensions, there will be a large number of models whose fit is acceptable. There is no way, among the yes–no methods for gauging fit, of determining which is the better model." (Leo Breiman, "Statistical Modeling: The two cultures" Statistical Science 16(3), 2001)

"Trimming potentially theoretically meaningful variables is not advisable unless one is quite certain that the coefficient for the variable is near zero, that the variable is inconsequential, and that trimming will not introduce misspecification error." (James Jaccard, "Interaction Effects in Logistic Regression", 2001)

"A smaller model with fewer covariates has two advantages: it might give better predictions than a big model and it is more parsimonious (simpler). Generally, as you add more variables to a regression, the bias of the predictions decreases and the variance increases. Too few covariates yields high bias; this called underfitting. Too many covariates yields high variance; this called overfitting. Good predictions result from achieving a good balance between bias and variance. […] finding a good model involves trading of fit and complexity." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Nonetheless, the basic principles regarding correlations between variables are not that difficult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Correlation analysis can help us find the size of the formal relation between two properties. An equidirectional variation is present if we observe high values of one variable together with high values of the other variable (or low ones combined with low ones). In this case there is a positive correlation. If high values are combined with low values and low values with high values, the variation is counterdirectional, and the correlation is negative." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Humans have difficulty perceiving variables accurately […]. However, in general, they tend to have inaccurate perceptions of system states, including past, current, and future states. This is due, in part, to limited ‘mental models’ of the phenomena of interest in terms of both how things work and how to influence things. Consequently, people have difficulty determining the full implications of what is known, as well as considering future contingencies for potential systems states and the long-term value of addressing these contingencies. " (William B. Rouse, "People and Organizations: Explorations of Human-Centered Design", 2007)

"To fulfill the requirements of the theory underlying uncertainties, variables with random uncertainties must be independent of each other and identically distributed. In the limiting case of an infinite number of such variables, these are called normally distributed. However, one usually speaks of normally distributed variables even if their number is finite." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Swarm intelligence can be effective when applied to highly complicated problems with many nonlinear factors, although it is often less effective than the genetic algorithm approach discussed later in this chapter. Swarm intelligence is related to swarm optimization […]. As with swarm intelligence, there is some evidence that at least some of the time swarm optimization can produce solutions that are more robust than genetic algorithms. Robustness here is defined as a solution’s resistance to performance degradation when the underlying variables are changed." (Michael J North & Charles M Macal, "Managing Business Complexity: Discovering Strategic Solutions with Agent-Based Modeling and Simulation", 2007)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"All forms of complex causation, and especially nonlinear transformations, admittedly stack the deck against prediction. Linear describes an outcome produced by one or more variables where the effect is additive. Any other interaction is nonlinear. This would include outcomes that involve step functions or phase transitions. The hard sciences routinely describe nonlinear phenomena. Making predictions about them becomes increasingly problematic when multiple variables are involved that have complex interactions. Some simple nonlinear systems can quickly become unpredictable when small variations in their inputs are introduced." (Richard N Lebow, "Forbidden Fruit: Counterfactuals and International Relations", 2010)

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Outliers or influential data points can be defined as data values that are extreme or atypical on either the independent (X variables) or dependent (Y variables) variables or both. Outliers can occur as a result of observation errors, data entry errors, instrument errors based on layout or instructions, or actual extreme values from self-report data. Because outliers affect the mean, the standard deviation, and correlation coefficient values, they must be explained, deleted, or accommodated by using robust statistics." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"System dynamics is an approach to understanding the behaviour of over time. It deals with internal feedback loops and time delays that affect the behaviour of the entire system. It also helps the decision maker untangle the complexity of the connections between various policy variables by providing a new language and set of tools to describe. Then it does this by modeling the cause and effect relationships among these variables." (Raed M Al-Qirem & Saad G Yaseen, "Modelling a Small Firm in Jordan Using System Dynamics", 2010)

"There are several key issues in the field of statistics that impact our analyses once data have been imported into a software program. These data issues are commonly referred to as the measurement scale of variables, restriction in the range of data, missing data values, outliers, linearity, and nonnormality." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using] models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"[…] a conceptual model is a diagram connecting variables and constructs based on theory and logic that displays the hypotheses to be tested." (Mary W Celsi et al, "Essentials of Business Research Methods", 2011)

"Complexity is a relative term. It depends on the number and the nature of interactions among the variables involved. Open loop systems with linear, independent variables are considered simpler than interdependent variables forming nonlinear closed loops with a delayed response." (Jamshid Gharajedaghi, "Systems Thinking: Managing Chaos and Complexity A Platform for Designing Business Architecture" 3rd Ed., 2011)

"Simplicity in a system tends to increase that system's efficiency. Because less can go wrong with fewer parts, less will. Complexity in a system tends to increase that system's inefficiency; the greater the number of variables, the greater the probability of those variables clashing, and in turn, the greater the potential for conflict and disarray. Because more can go wrong, more will. That is why centralized systems are inclined to break down quickly and become enmeshed in greater unintended consequences." (Lawrence K Samuels, "Defense of Chaos: The Chaology of Politics, Economics and Human Action", 2013)

"When statisticians, trained in math and probability theory, try to assess likely outcomes, they demand a plethora of data points. Even then, they recognize that unless it’s a very simple and controlled action such as flipping a coin, unforeseen variables can exert significant influence." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"A basic problem with MRA is that it typically assumes that the independent variables can be regarded as building blocks, with each variable taken by itself being logically independent of all the others. This is usually not the case, at least for behavioral data. […] Just as correlation doesn’t prove causation, absence of correlation fails to prove absence of causation. False-negative findings can occur using MRA just as false-positive findings do—because of the hidden web of causation that we’ve failed to identify." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"One technique employing correlational analysis is multiple regression analysis (MRA), in which a number of independent variables are correlated simultaneously (or sometimes sequentially, but we won’t talk about that variant of MRA) with some dependent variable. The predictor variable of interest is examined along with other independent variables that are referred to as control variables. The goal is to show that variable A influences variable B 'net of' the effects of all the other variables. That is to say, the relationship holds even when the effects of the control variables on the dependent variable are taken into account." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The fundamental problem with MRA, as with all correlational methods, is self-selection. The investigator doesn’t choose the value for the independent variable for each subject (or case). This means that any number of variables correlated with the independent variable of interest have been dragged along with it. In most cases, we will fail to identify all these variables. In the case of behavioral research, it’s normally certain that we can’t be confident that we’ve identified all the plausibly relevant variables." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"The theory behind multiple regression analysis is that if you control for everything that is related to the independent variable and the dependent variable by pulling their correlations out of the mix, you can get at the true causal relation between the predictor variable and the outcome variable. That’s the theory. In practice, many things prevent this ideal case from being the norm." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Validity of a theory is also known as construct validity. Most theories in science present broad conceptual explanations of relationship between variables and make many different predictions about the relationships between particular variables in certain situations. Construct validity is established by verifying the accuracy of each possible prediction that might be made from the theory. Because the number of predictions is usually infinite, construct validity can never be fully established. However, the more independent predictions for the theory verified as accurate, the stronger the construct validity of the theory." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Decision trees are considered a good predictive model to start with, and have many advantages. Interpretability, variable selection, variable interaction, and the flexibility to choose the level of complexity for a decision tree all come into play." (Ralph Winters, "Practical Predictive Analytics", 2017)

"Multivariate analysis refers to incorporation of multiple exploratory variables to understand the behavior of a response variable. This seems to be the most feasible and realistic approach considering the fact that entities within this world are usually interconnected. Thus the variability in response variable might be affected by the variability in the interconnected exploratory variables." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"The degree to which one variable can be predicted from another can be calculated as the correlation between them. The square of the correlation (R^2) is the proportion of the variance of one that can be 'explained' by knowledge of the other." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"Variables which follow symmetric, bell-shaped distributions tend to be nice as features in models. They show substantial variation, so they can be used to discriminate between things, but not over such a wide range that outliers are overwhelming." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Bayesian networks inhabit a world where all questions are reducible to probabilities, or (in the terminology of this chapter) degrees of association between variables; they could not ascend to the second or third rungs of the Ladder of Causation. Fortunately, they required only two slight twists to climb to the top." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as features or variables), but when we go higher than three attributes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens, hundreds, thousands, and, in extreme cases, millions of attributes." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Decision trees show the breakdown of the data by one variable then another in a very intuitive way, though they are generally just diagrams that don’t actually encode data visually." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"One very common problem in data visualization is that encoding numerical variables to area is incredibly popular, but readers can’t translate it back very well." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Random forests are essentially an ensemble of trees. They use many short trees, fitted to multiple samples of the data, and the predictions are averaged for each observation. This helps to get around a problem that trees, and many other machine learning techniques, are not guaranteed to find optimal models, in the way that linear regression is. They do a very challenging job of fitting non-linear predictions over many variables, even sometimes when there are more variables than there are observations. To do that, they have to employ 'greedy algorithms', which find a reasonably good model but not necessarily the very best model possible." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Exponentially growing systems are prevalent in nature, spanning all scales from biochemical reaction networks in single cells to food webs of ecosystems. How exponential growth emerges in nonlinear systems is mathematically unclear. […] The emergence of exponential growth from a multivariable nonlinear network is not mathematically intuitive. This indicates that the network structure and the flux functions of the modeled system must be subjected to constraints to result in long-term exponential dynamics." (Wei-Hsiang Lin et al, "Origin of exponential growth in nonlinear reaction networks", PNAS 117 (45), 2020)

"Mathiness refers to formulas and expressions that may look and feel like math-even as they disregard the logical coherence and formal rigor of actual mathematics. […] These equations make mathematical claims that cannot be supported by positing formal relationships - variables interacting multiplicatively or additively, for example - between ill-defined and impossible-to-measure quantities. In other words, mathiness, like truthiness and like bullshit, involves a disregard for logic or factual accuracy." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"This problem with adding additional variables is referred to as the curse of dimensionality. If you add enough variables into your black box, you will eventually find a combination of variables that performs well - but it may do so by chance. As you increase the number of variables you use to make your predictions, you need exponentially more data to distinguish true predictive capacity from luck." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

More quotes on "Variables" at the-web-of-knowledge.blogspot.com.

03 December 2018

🔭Data Science: Regression (Just the Quotes)

"One feature [...] which requires much more justification than is usually given, is the setting up of unplausible null hypotheses. For example, a statistician may set out a test to see whether two drugs have exactly the same effect, or whether a regression line is exactly straight. These hypotheses can scarcely be taken literally." (Cedric A B Smith, "Book review of Norman T. J. Bailey: Statistical Methods in Biology", Applied Statistics 9, 1960)

"The method of least squares is used in the analysis of data from planned experiments and also in the analysis of data from unplanned happenings. The word 'regression' is most often used to describe analysis of unplanned data. It is the tacit assumption that the requirements for the validity of least squares analysis are satisfied for unplanned data that produces a great deal of trouble." (George E P Box, "Use and Abuse of Regression", 1966)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Graphical methodology provides powerful diagnostic tools for conveying properties of the fitted regression, for assessing the adequacy of the fit, and for suggesting improvements. There is seldom any prior guarantee that a hypothesized regression model will provide a good description of the mechanism that generated the data. Standard regression models carry with them many specific assumptions about the relationship between the response and explanatory variables and about the variation in the response that is not accounted for by the explanatory variables. In many applications of regression there is a substantial amount of prior knowledge that makes the assumptions plausible; in many other applications the assumptions are made as a starting point simply to get the analysis off the ground. But whatever the amount of prior knowledge, fitting regression equations is not complete until the assumptions have been examined." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Stepwise regression is probably the most abused computerized statistical technique ever devised. If you think you need stepwise regression to solve a particular problem you have, it is almost certain that you do not. Professional statisticians rarely use automated stepwise regression." (Leland Wilkinson, "SYSTAT", 1984)

"Someone has characterized the user of stepwise regression as a person who checks his or her brain at the entrance of the computer center." (Dick R Wittink, "The application of regression analysis", 1988)

"Data analysis is rarely as simple in practice as it appears in books. Like other statistical techniques, regression rests on certain assumptions and may produce unrealistic results if those assumptions are false. Furthermore it is not always obvious how to translate a research question into a regression model." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Exploratory regression methods attempt to reveal unexpected patterns, so they are ideal for a first look at the data. Unlike other regression techniques, they do not require that we specify a particular model beforehand. Thus exploratory techniques warn against mistakenly fitting a linear model when the relation is curved, a waxing curve when the relation is S-shaped, and so forth." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Linear regression assumes that in the population a normal distribution of error values around the predicted Y is associated with each X value, and that the dispersion of the error values for each X value is the same. The assumptions imply normal and similarly dispersed error distributions." (Fred C Pampel, "Linear Regression: A primer", 2000)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"For linear dependences the main information usually lies in the slope. It is obvious that those points that lie far apart have the strongest influence on the slope if all points have the same uncertainty. In this context we speak of the strong leverage of distant points; when determining the parameter 'slope' these distant points carry more effective weight. Naturally, this weight is distinct from the 'statistical' weight usually used in regression analysis." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Regression toward the mean. That is, in any series of random events an extraordinary event is most likely to be followed, due purely to chance, by a more ordinary one." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using]models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"Regression analysis, like all forms of statistical inference, is designed to offer us insights into the world around us. We seek patterns that will hold true for the larger population. However, our results are valid only for a population that is similar to the sample on which the analysis has been done." (Charles Wheelan, "Naked Statistics: Stripping the Dread from the Data", 2012)

"Multiple regression, like all statistical techniques based on correlation, has a severe limitation due to the fact that correlation doesn't prove causation. And no amount of measuring of 'control' variables can untangle the web of causality. What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014 : What scientific idea is ready for retirement?", 2013)

"What nature hath joined together, multiple regression cannot put asunder." (Richard Nisbett, "2014 : What scientific idea is ready for retirement?", 2013)

"A wide variety of statistical procedures (regression, t-tests, ANOVA) require three assumptions: (i) Normal observations or errors. (ii) Independent observations (or independent errors, which is equivalent, in normal linear models to independent observations). (iii) Equal variance - when that is appropriate (for the one-sample t-test, for example, there is nothing being compared, so equal variances do not apply)." (DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Regression does not describe changes in ability that happen as time passes […]. Regression is caused by performances fluctuating about ability, so that performances far from the mean reflect abilities that are closer to the mean." (Gary Smith, "Standard Deviations", 2014)

"We encounter regression in many contexts - pretty much whenever we see an imperfect measure of what we are trying to measure. Standardized tests are obviously an imperfect measure of ability. [...] Each experimental score is an imperfect measure of “ability,” the benefits from the layout. To the extent there is randomness in this experiment - and there surely is - the prospective benefits from the layout that has the highest score are probably closer to the mean than was the score." (Gary Smith, "Standard Deviations", 2014))

"When a trait, such as academic or athletic ability, is measured imperfectly, the observed differences in performance exaggerate the actual differences in ability. Those who perform the best are probably not as far above average as they seem. Nor are those who perform the worst as far below average as they seem. Their subsequent performances will consequently regress to the mean." (Gary Smith, "Standard Deviations", 2014)

"Working an integral or performing a linear regression is something a computer can do quite effectively. Understanding whether the result makes sense - or deciding whether the method is the right one to use in the first place - requires a guiding human hand. When we teach mathematics we are supposed to be explaining how to be that guide. A math course that fails to do so is essentially training the student to be a very slow, buggy version of Microsoft Excel." (Jordan Ellenberg, "How Not to Be Wrong: The Power of Mathematical Thinking", 2014)

"Regression describes the relationship between an exploratory variable (i.e., independent) and a response variable (i.e., dependent). Exploratory variables are also referred to as predictors and can have a frequency of more than 1. Regression is being used within the realm of predictions and forecasting. Regression determines the change in response variable when one exploratory variable is varied while the other independent variables are kept constant. This is done to understand the relationship that each of those exploratory variables exhibits." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"Multiple regression provides scientists and analysts with a tool to perform statistical control - a procedure to remove unwanted influence from certain variables in the model." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

"The causal interpretation of linear regression only holds when there are no spurious relationships in your data. This is the case in two scenarios: when you control for a set of all necessary variables (sometimes this set can be empty) or when your data comes from a properly designed randomized experiment." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

More quotes on "Regression" at the-web-of-knowledge.blogspot.com.

01 June 2018

🔬Data Science: Data Model (Definitions)

"simplified and approximative description of a system or process, based on a finite set of essential variables and their analytically definable behavior." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"(i) An abstract, self-contained logical definition of the data structures and associated operators that make up the abstract machine with which users interact (such as the relational model of data). (ii) A model of the persistent data of some enterprise" (Keith Gordon, "Principles of Data Management", 2007)

"A means of encapsulating the data elements that were decided on by the business experts, in conjunction with the data stewards and IT professionals. The data models reflect an organization’s business as represented in the data." (Tony Fisher, "The Data Asset", 2009)

"An abstraction of how individual data elements relate to each other. It visually depicts how the data is to be organized and stored in a database. A data model provides the mechanism to document and understand how data is organized." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"An organization of data that describes the relationships among primitive and composite data elements." (Toby J. Teorey, "Database Modeling and Design" 4th Ed., 2010)

"A model of the structure (and to some extent the content) of a database and at least some of the rules governing the data therein." (Graham Witt, "Writing Effective Business Rules", 2012)

"A data model is a visual representation of data content and the relationships, created for purposes of understanding how data is or might be organized, and for ensuring the comprehensibility and usability of that way of organizing data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013)

"Represents data objects and their relationships with each other. Data models form the basis for data integration at the conceptual level as well as the improvement of data quality, such as with regard to the reduction of data redundancy. Data models are one component of the data architecture." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"A template formalizing the relationship between an input and an output. Its structure is fixed but it also has parameters that are modifiable; the parameters are adjusted so that the same model with different parameters can be trained on different data to implement different relationships in different tasks." (Ethem Alpaydın, "Machine learning : the new AI", 2016)

"A visual means of depicting data and its relationship to other data." (Gregory Lampshire et al, "The Data and Analytics Playbook", 2016)

"An abstract representation of a subject that looks and/or behaves like all or part of the original." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"In the context of machine learning, a model is a representation of a pattern extracted using machine learning from a data set. Consequently, models are trained, fitted to a data set, or created by running a machine learning algorithm on a data set. Popular model representations include decision trees and neural networks. A prediction model defines a mapping (or function) from a set of input attributes to a value for a target attribute. Once a model has been created, it can then be applied to new instances from the domain. For example, in order to train a spam filter model, we would apply a machine learning algorithm to a data set of historic emails that have been labeled as spam or not spam. Once the model has been trained it can be used to label (or filter) new emails that were not in the original data set." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"An abstract model that describes how data is presented and used." (Piethein Strengholt, "Data Management at Scale", 2020)

"A logical map that represents the inherent properties of the data independent of software, hardware or machine performance considerations. The model shows data elements grouped into records, as well as the association around those records." (Information Management)

"Defines how data is structured, related, and standardized for the purpose of extracting meaningful insight." (Insight Software)

"A data model is an abstract model that organizes elements of data and standardizes how they relate to one another. Their properties are generally governed by properties of the real world entities." (kloudless)

09 May 2018

🔬Data Science: Meta-Analysis (Definitions)

"A set of statistical procedures designed to accumulate experimental and correlational results across independent studies that address related sets of research questions." (Ying-Chieh Liu et al, "Meta-Analysis Research on Virtual Team Performance", 2008)

"A statistical technique in which the outcomes from multiple experimental comparisons are synthesized by evaluating effect sizes. Because the recommendations are based on multiple experiments, practitioners can have greater confidence in the results from an effective meta-analysis." (Ruth C Clark, "Building Expertise: Cognitive Methods for Training and Performance Improvement", 2008)

"Study characteristics can be thought of as the independent variable." (Ernest W Brewer, "Using Meta-Analysis as a Research Tool in Making Educational and Organizational Decisions", 2009)

"The exhaustive search process which comprises numerous and versatile algorithmic procedures to exploit the gene expression results by combining or further processing them with sophisticated statistical learning and data mining techniques coupled with annotated information concerning functional properties of these genes residing in large databases." (Aristotelis Chatziioannou & Panagiotis Moulos, "DNA Microarrays: Analysis and Interpretation", 2009)

"The statistical analysis of a group of relevantly similar experimental studies, in order to summarize their results considered as a whole." (Saul Fisher, "Cost-Effectiveness", 2009)

"A quantitative research review that applies statistical techniques to examine, standardize and combine the results of different empirical studies that investigate a set of related research hypotheses." (Olusola O Adesope & John C Nesbit, "A Systematic Review of Research on Collaborative Learning with Concept Maps", 2010)

"Analysis of a number of comparable studies with the aim to combine those studies in a statistically valid way to test hypotheses (about the effect of an intervention)." (Cor van Dijkum & Laura Vegter, "A Client Perspective on E-Health: Illustrated with an Example from The Netherlands", 2010)

"A computation of average effect sizes among many experiments. Data based on a meta-analysis give us greater confidence in the results because they reflect many research studies." (Ruth C Clark & Richard E Mayer, "e-Learning and the Science of Instruction", 2011)

"Analysis of previously analyzed data relating to the same or similar biological phenomena or treatment studied across the same or similar technology platforms." (Padmalatha S Reddy et al, "Knowledge-Driven, Data-Assisted Integrative Pathway Analytics", 2011)

"A set of techniques for the quantitative analysis of results from two or more studies on the same or similar issues." (Geoff Cumming, "Understanding The New Statistics", 2013)

"A method of combining effect sizes from individual studies into a single composite effect size." (Jonathan van‘t Riet et al, "The Effects of Active Videogames on BMI among Young People: A Meta-Analysis", 2016)

"A procedure that allows the statistical averaging of results from independent studies of the same phenomena. Meta-analysis essentially combines studies on the same topic into a single large study, providing an index of how strongly the independent variable affected the dependent variable on an average in the set of studies." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A research design that combines and synthesize different types of data from multiple sources." (Mzoli Mncanca & Chinedu Okeke, "Early Exposure to Domestic Violence and Implications for Early Childhood Education Services", 2019)

"A quantitative, formal, epidemiological study design used to systematically assess the results of previous research to derive conclusions about that body of research." (Helena H Borba et al, "Challenges in Evidence-Based Practice Education: From Teaching Concepts Towards Decision-Making Learning", 2021)

27 April 2018

🔬Data Science: Validity (Definitions)

"An argument that explains the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of decisions made from an assessment." (Asao B Inoue, "The Technology of Writing Assessment and Racial Validity", 2009)

[external *]: "The extent to which the results obtained can be generalized to other individuals and/or contexts not studied." (Joan Hawthorne et al, "Method Development for Assessing a Diversity Goal", 2009)

[external *:] "A study has external validity when its results are generalizable to the target population of interest. Formally, external validity means that the causal effect based on the study population equals the causal effect in the target population. In counterfactual terms, external validity requires that the study population be exchangeable with the target population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal *:] "A study has internal validity when it provides an unbiased estimate of the causal effect of interest. Formally, internal validity means that the empirical effect from the study is equal to the causal effect in the study population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"Construct validity is a term developed by psychometricians to describe the ability of a variable to represent accurately an underlying characteristic of interest." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[operational validity:] "is defined as a model result behavior has enough correctness for a model intended aim over the area of system intended applicability." (Sattar J Aboud et al, "Verification and Validation of Simulation Models", 2010)

"Validity is the ability of the study to produce correct results. There are various specific types of validity (see internal validity, external validity, construct validity). Threats to validity include primarily what we have termed bias, but encompass a wider range of methodological problems, including random error and lack of construct validity." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal validity:] "Accuracy of the research study in determining the relationship between independent and the dependent variables. Internal validity can be assured only if all potential confounding variables have been properly controlled." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

[external *:] "Extent to which the results of a study accurately indicate the true nature of a relationship between variables in the real world. If a study has external validity, the results are said to be generalisable to the real world." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"The degree to which inferences made from data are appropriate to the context being examined. A variety of evidence can be used to support interpretation of scores." (Anne H Cash, "A Call for Mixed Methods in Evaluating Teacher Preparation Programs", 2016)

[construct *:] "Validity of a theory is also known as construct validity. Most theories in science present broad conceptual explanations of relationship between variables and make many different predictions about the relationships between particular variables in certain situations. Construct validity is established by verifying the accuracy of each possible prediction that might be made from the theory. Because the number of predictions is usually infinite, construct validity can never be fully established. However, the more independent predictions for the theory verified as accurate, the stronger the construct validity of the theory." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

12 February 2018

🔬Data Science: Correlation (Definitions)

[correlation coefficient:] "A measure to determine how closely a scatterplot of two continuous variables falls on a straight line." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A metric that measures the linear relationship between two process variables. Correlation describes the X and Y relationship with a single number (the Pearson’s Correlation Coefficient (r)), whereas regression summarizes the relationship with a line - the regression line." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

[correlation coefficient:] "A measure of the degree of correlation between the two variables. The range of values it takes is between −1 and +1. A negative value of r indicates an inverse relationship. A positive value of r indicates a direct relationship. A zero value of r indicates that the two variables are independent of each other. The closer r is to +1 and −1, the stronger the relationship between the two variables." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"The degree of relationship between business and economic variables such as cost and volume. Correlation analysis evaluates cause/effect relationships. It looks consistently at how the value of one variable changes when the value of the other is changed. A prediction can be made based on the relationship uncovered. An example is the effect of advertising on sales. A degree of correlation is measured statistically by the coefficient of determination (R-squared)." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"A figure quantifying the correlation between risk events. This number is between negative one and positive one." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)

"A mechanism used to associate messages with the correct workflow service instance. Correlation is also used to associate multiple messaging activities with each other within a workflow." (Bruce Bukovics, "Pro WF: Windows Workflow in .NET 4", 2010)

"Correlation is sometimes used informally to mean a statistical association between two variables, or perhaps the strength of such an association. Technically, the correlation can be interpreted as the degree to which a linear relationship between the variables exists (i.e., each variable is a linear function of the other) as measured by the correlation coefficient." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"The degree of relationship between two variables; in risk management, specifically the degree of relationship between potential risks." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)

"A predictive relationship between two factors, such that when one factor changes, you can predict the nature, direction and/or amount of change in the other factor. Not necessarily a cause-and-effect relationship." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Organizing and recognizing one related event threat out of several reported, but previously distinct, events." (Mark Rhodes-Ousley, "Information Security: The Complete Reference" 2nd Ed., 2013)

"Association in the values of two or more variables." (Meta S Brown, "Data Mining For Dummies", 2014)

[correlation coefficient:] "A statistic that quantifies the degree of association between two or more variables. There are many kinds of correlation coefficients, depending on the type of data and relationship predicted." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"The degree of association between two or more variables." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical measure that indicates the extent to which two variables are related. A positive correlation indicates that, as one variable increases, the other increases as well. For a negative correlation, as one variable increases, the other decreases." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

07 February 2014

🕸Systems Engineering: Entropy (Definitions)

"The Entropy of a system is the mechanical work it can perform without communication of heat, or alteration of its total volume, all transference of heat being performed by reversible engines." (James C Maxwell, "Theory of Heat", 1899)

"Entropy is the measure of randomness." (Lincoln Barnett, "The Universe and Dr. Einstein", 1948)

"Entropy is a measure of the heat energy in a substance that has been lost and is no longer available for work. It is a measure of the deterioration of a system." (William B. Sill & Norman Hoss (Eds.), "Popular Science Encyclopedia of the Sciences", 1963)

"Entropy [...] is the amount of disorder or randomness present in any system." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"A measurement of the disorder of a data set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"[...] entropy is the amount of hidden microscopic information." (Leonard Susskind, "The Black Hole War", 2008)

"A measure of the uncertainty associated with a random variable. Entropy quantifies information in a piece of data." (Radu Mutihac, "Bayesian Neural Networks for Image Restoration" [in "Encyclopedia of Artificial Intelligence", 2009)

"Measurement that can be used in machine learning on a set of data that is to be classified. In this setting it can be defined as the amount of uncertainty or randomness (or noise) in the data. If all data is classified with the same class, the entropy of that set would be 0." (Isak Taksa et al, "Machine Learning Approach to Search Query Classification", 2009)

"A measure of uncertainty associated with the predictable value of information content. The highest information entropy is when the ambiguity or uncertainty of the outcome is the greatest." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"Refers to the inherent unknowability of data to external observers. If a bit is just as likely to be a 1 as a 0 and a user does not know which it is, then the bit contains 1 bit of entropy." (Mark S Merkow & Lakshmikanth Raghavan, "Secure and Resilient Software Development", 2010)

"The measurement of uncertainty in an outcome, or randomness in a system." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A metric used to evaluate and describe the amount of randomness associated with a random variable."(Wenbing Zhao, "Increasing the Trustworthiness of Online Gaming Applications", 2015)

"Anti-entropy is the process of detecting differences in replicas. From a performance perspective, it is important to detect and resolve inconsistencies with a minimum amount of data exchange." (Dan Sullivan, "NoSQL for Mere Mortals®", 2015)

"Average amount of information contained in a sample drawn from a distribution or data stream. Measure of uncertainty of the source of information." (Anwesha Sengupta et al, "Alertness Monitoring System for Vehicle Drivers using Physiological Signals", 2016)

"In information theory this notion, introduced by Claude Shannon, is used to express unpredictability of information content. For instance, if a data set containing n items was divided into k groups each comprising n i items, then the entropy of such a partition is H = p 1 log( p 1 ) + … + p k log( p k ), where p i = n i / n . In case of two alternative partitions, the mutual information is a measure of the mutual dependence between these partitions." (Slawomir T Wierzchon, "Ensemble Clustering Data Mining and Databases", 2018) [where i is used as index]

"Entropy is a measure of amount of uncertainty or disorder present in the system within the possible probability distribution." ("G Suseela & Y Asnath V Phamila, "Security Framework for Smart Visual Sensor Networks", 2019)

"Lack of order or predictability; gradual decline into disorder." (Adrian Carballal et al, "Approach to Minimize Bias on Aesthetic Image Datasets", 2019)

"It is the quantity which is used to describe the amount of information which must be coded for compression algorithm." (Arockia Sukanya & Kamalanand Krishnamurthy, "Thresholding Techniques for Dental Radiographic Images: A Comparative Study", 2019)

"In the physics - rate of system´s messiness or disorder in a physical system. In the social systems theory - social entropy is a sociological theory that evaluates social behaviors using a method based on the second law of thermodynamics." (Justína Mikulášková et al, "Spiral Management: New Concept of the Social Systems Management", 2020)

25 December 2011

📉Graphical Representation: Variables (Just the Quotes)

"Two dimensional charts for the representation of mathematical equations or experimental data are in very common use nowadays and are everywhere recognized as valuable devices for giving a clear conception of the manner in which the variables are related. Their application is generally restricted, however, to cases where there is but one variable and its function, if the variation to be shown is continuous. Nevertheless cases often arise in which there are two variables and a function to be represented and where it is desirable to show a continuousvariation for all three." (John B Peddle, "The Construction of Graphical Charts", 1910)

"It should be a strict rule for all kinds of curve plotting that the horizontal scale must be used for the independent variable and the vertical scale for the dependent variable. When the curves are plotted by this rule the reader can instantly select a set of conditions from the horizontal scale and read the information from the vertical scale. If there were no rule relating to the arrangement of scales for the independent and dependent variables, the reader would never be able to tell whether he should approach a chart from the vertical scale and read the information from the horizontal scale, or the reverse." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Besides being easier to construct than a bar chart, the line chart possesses other advantages. It is easier to read, for while the bars stand out more prominently than the line, they tend to become confusing if numerous, and especially so when they record alternate increase and decrease. It is easier for the eye to follow a line across the face of the chart than to jump from bar top to bar top, and the slope of the line connecting two points is a great aid in detecting minor changes. The line is also more suggestive of movement than arc bars, and movement is the very essence of a time series. Again, a line chart permits showing two or more related variables on the same chart, or the same variable over two or more corresponding periods." (Walter E Weld, "How to Chart; Facts from Figures with Graphs", 1959)

"In certain respects, line graphs are uniquely applicable to particular graphic requirements for which a bar or circle chart could not be substituted. Strictly speaking, the line graph must be used to portray changes in a continuous variable, since technically such a variable must be represented by a line and not by 'points' or 'bars'. Line graphs are often uniquely applicable to problems of analysis, particularly when it is essential to visualize a trend, observe the behavior of a set of variables through time, or portray the same variable in differing time periods." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"Almost all efforts at data analysis seek, at some point, to generalize the results and extend the reach of the conclusions beyond a particular set of data. The inferential leap may be from past experiences to future ones, from a sample of a population to the whole population, or from a narrow range of a variable to a wider range. The real difficulty is in deciding when the extrapolation beyond the range of the variables is warranted and when it is merely naive. As usual, it is largely a matter of substantive judgment - or, as it is sometimes more delicately put, a matter of 'a priori nonstatistical considerations'." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Remember, the primary function of a graph of any kind is to illustrate the relationship between two variables. [...] To draw any graph we must have established some relationship between the two variables. This relationship can be in the form of a formula" (equation is the more mathematical term), as we have just seen, or simply a set of observations, as is common in all types of statistical work. Sometimes we develop set of observations and then try to find an equation that expresses, in mathematical language, the relationship between the two variables." (Peter H Selby, "Interpreting Graphs and Tables", 1976)

"Missing data values pose a particularly sticky problem for symbols. For instance, if the ray corresponding to a missing value is simply left off of a star symbol, the result will be almost indistinguishable from a minimum (i.e., an extreme) value. It may be better either (i) to impute a value, perhaps a median for that variable, or a fitted value from some regression on other variables, (ii) to indicate that the value is missing, possibly with a dashed line, or (iii) not to draw the symbol for a particular observation if any value is missing." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"Relational graphics are essential to competent statistical analysis since they confront statements about cause and effect with evidence, showing how one variable affects another." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"When visualization tools act as a catalyst to early visual thinking about a relatively unexplored problem, neither the semantics nor the pragmatics of map signs is a dominant factor. On the other hand, syntactics" (or how the sign-vehicles, through variation in the visual variables used to construct them, relate logically to one another) are of critical importance." (Alan M MacEachren, "How Maps Work: Representation, Visualization, and Design", 1995)

"The rule is that a graph of a change in a variable with time should always have a vertical scale that starts with zero. Otherwise, it is inherently misleading." (Douglas A Downing & Jeffrey Clark, "Forgotten Statistics: A Self-Teaching Refresher Course", 1996)

"Stacked bar graphs do not show data structure well. A trend in one of the stacked variables has to be deduced by scanning along the vertical bars. This becomes especially difficult when the categories do not move in the same direction." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"By showing recent change in relation to many past changes, sparklines provide a context for nuanced analysis - and, one hopes, better decisions. [...] Sparklines efficiently display and narrate binary data" (presence/absence, occurrence/non-occurrence, win/loss). [...] Sparklines can simultaneously accommodate several variables. [...] Sparklines can narrate on-going results detail for any process producing sequential binary outcomes." (Edward R Tufte, "Beautiful Evidence", 2006)

"Show multivariate data; that is, show more than 1 or 2 variables." (Edward R Tufte, "Beautiful Evidence", 2006)

"Heatmaps are two-dimensional graphical representations of data where the values of a variable are shown as colors. Heatmaps are compelling for two reasons. First, the intuitive nature of the color scale as it relates to temperature minimizes the amount of learning necessary to understand it. From experience, we know that yellow is warmer than green, orange is warmer than yellow, and red is hot. It is not difficult to then figure out that the amount of heat is proportional to the level of the represented variable. Second, heatmaps show the data directly over the stimulus. Because the data could not be any closer to the elements to which they pertain, little mental effort is required to read a heatmap." (Agnieszka Bojkon, "Informative or Misleading? Heatmaps Deconstructed", [in "Human-Computer Interaction: New Trends, 13th International Conference"] 2009)

"Whereas charts generally focus on a trend or comparison, tables organize data for the reader to scan. Tables present data in an easy-read-format, or matrix. Tables arrange data in columns or rows so readers can make side-by-side comparisons. Tables work for many situations because they convey large amounts of data and have several variables for each item. Tables allow the reader to focus quickly on a specific item by scanning the matrix or to compare multiple items by scanning the rows or columns." (Dennis K Lieu & Sheryl Sorby, "Visualization, Modeling, and Graphics for Engineering Design", 2009)

"Using colour, itʼs possible to increase the density of information even further. A single colour can be used to represent two variables simultaneously. The difficulty, however, is that there is a limited amount of information that can be packed into colour without confusion." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"Pie charts can be used effectively to summarize a single categorical data set if there are not too many different categories. However, pie charts are not usually the best tool if the goal is to compare groups on the basis of a categorical variable." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"The process of visual analysis can potentially go on endlessly, with seemingly infinite combinations of variables to explore, especially with the rich opportunities bigger data sets give us. However, by deploying a disciplined and sensible balance between deductive and inductive enquiry you should be able to efficiently and effectively navigate towards the source of the most compelling stories." (Andy Kirk, "Data Visualization: A successful design process", 2012)

"Visual metaphors are about integrating a certain visual quality in your work that somehow conveys that extra bit of connection between the data, the design, and the topic. It goes beyond just the choice of visual variable, though this will have a strong influence. Deploying the best visual metaphor is something that really requires a strong design instinct and a certain amount of experience." (Andy Kirk, "Data Visualization: A successful design process", 2012)

"Early exploration of a dataset can be overwhelming, because you don’t know where to start. Ask questions about the data and let your curiosities guide you. […] Make multiple charts, compare all your variables, and see if there are interesting bits that are worth a closer look. Look at your data as a whole and then zoom in on categories and individual data points. […] Subcategories, the categories within categories" (within categories), are often more revealing than the main categories. As you drill down, there can be higher variability and more interesting things to see." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"The relevance to data visualization is that we are always conveying a message to some extent, and in the case of associations between variables, that message is sometimes a step removed from the data itself. If you are making visualizations, be careful not to impose your own interpretation too much when showing associations. If you are reading them, don’t assume that the message accompanying the data is as sound and scientifically based as the data themselves." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"There is a fundamental difference between circular charts and bar charts. The brain is sensitive to angular change and (by comparison) quite numb to linear change. This is particularly true when considering motion, and sensitivity to small changes. If in your narrative, highlighting minute changes in a variable is important for the story, then circular pie charts (speed needle gauges) are the way to go. If on the contrary, too much attention to change is a distraction, avoid pie charts and needles."(Jose Berengueres & Marybeth Sandell, "Introduction to Data Visualization & Storytelling: A Guide For The Data Scientist" 2nd. Ed., 2019)

"With each variable being added to the visual mapping, the richness of the visual representation is increased. Theoretically, we could add yet another visual variable, for example, by texturing the shapes. However, from a practical point of view, there are limits. While a rich visual mapping opens up the possibility to make a wider range of analytic discoveries, the downside is that the mental effort required to digest the visual representation increases as well. Therefore, it is really important to balance the visual mapping according to the task and the data." (Christian Tominski & Heidrun Schumann, "Interactive Visual Data Analysis", 2019)

SQL Troubles

Pages