SQL Troubles

13 May 2006

🖋️George Siemens - Collected Quotes

"An ecology provides the special formations needed by organizations. Ecologies are: loose, free, dynamic, adaptable, messy, and chaotic. Innovation does not arise through hierarchies. As a function of creativity, innovation requires trust, openness, and a spirit of experimentation - where random ideas and thoughts can collide for re-creation." (George Siemens, "Knowing Knowledge", 2006)

"Change pressures arise from different sectors of a system. At times it is mandated from the top of a hierarchy, other times it forms from participants at a grass-roots level. Some changes are absorbed by the organization without significant impact on, or alterations of, existing methods. In other cases, change takes root. It causes the formation of new methods (how things are done and what is possible) within the organization." (George Siemens, "Knowing Knowledge", 2006)

"Complexity and diversity results in specialized nodes (a single entity can no longer know all required elements). The act of knowledge growth and learning involves connected specialized nodes." (George Siemens, "Knowing Knowledge", 2006)

"Connections create structures. Structures do not create (though they may facilitate) connections. Our approaches today reflect this error in thinking. We have tried to do the wrong thing first with knowledge. We determine that we will have a certification before we determine what it is that we want to certify. We need to enable the growth of connections and observe the structures that emerge." (George Siemens, "Knowing Knowledge", 2006)

"Context is not as simple as being in a different space [...] context includes elements like our emotions, recent experiences, beliefs, and the surrounding environment - each element possesses attributes, that when considered in a certain light, informs what is possible in the discussion." (George Siemens, "Knowing Knowledge", 2006)

"Knowledge flow can be likened to a river that meanders through the ecology of an organization. In certain areas, the river pools and in other areas it ebbs. The health of the learning ecology of the organization depends on effective nurturing of flow." (George Siemens, "Knowing Knowledge", 2006)

"Learning is a multi-faceted, integrated process where changes with any one element alters the larger network. Knowledge is subject to the nuances of complex, adaptive systems." (George Siemens, "Knowing Knowledge", 2006)

"Hierarchy adapts knowledge to the organization; a network adapts the organization to the knowledge." (George Siemens, "Knowing Knowledge", 2006)

"Learning is the process of creating networks. Nodes are external entities which we can use to form a network. Or nodes may be people, organizations, libraries, web sites, books, journals, database, or any other source of information. The act of learning (things become a bit tricky here) is one of creating an external network of nodes - where we connect and form information and knowledge sources. The learning that happens in our heads is an internal network (neural). Learning networks can then be perceived as structures that we create in order to stay current and continually acquire, experience, create, and connect new knowledge (external). And learning networks can be perceived as structures that exist within our minds (internal) in connecting and creating patterns of understanding." (George Siemens, "Knowing Knowledge", 2006)

"Nodes and connectors comprise the structure of a network. In contrast, an ecology is a living organism. It influences the formation of the network itself." (George Siemens, "Knowing Knowledge", 2006)

"Our pre-conceived structures of interpreting knowledge sometimes interfere with new knowledge." (George Siemens, "Knowing Knowledge", 2006)

"When we focus on designing ecologies in which people can forage for knowledge, we are less concerned about communicating the minutiae of changing knowledge. Instead, we are creating the conduit through which knowledge will flow." (George Siemens, "Knowing Knowledge", 2006)

06 May 2006

🎯William Smith - Collected Quotes

"Achieving a gold standard for data quality at ingestion involves a multifaceted approach: defining explicit schemas and contracts, implementing rigorous input validation reflecting domain semantics, supporting immediate rejection or secure quarantine of low-quality data, and embedding these capabilities into high-throughput, low-latency pipelines. This first line of defense not only prevents downstream data pollution but also establishes an enterprise-wide culture and infrastructure aimed at preserving data trust from the point of entry onward." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Accuracy denotes the degree to which data correctly represents the real-world entities or events to which it refers." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"At its core, data quality encompasses multiple dimensions-including accuracy, completeness, consistency, timeliness, validity, uniqueness, and relevance-that require rigorous assessment and control. The progression from traditional data management practices to cloud-native, real-time, and federated ecosystems introduces both challenges challenges and opportunities for embedding quality assurance seamlessly across the entire data value chain." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"At its core, observability rests on three fundamental pillars: metrics, logs, and traces. In the context of data systems, these pillars translate into quantitative measurements (such as data volume, processing latency, and schema changes), detailed event records (including data pipeline execution logs and error messages), and lineage traces that map the flow of data through interconnected processes. Together, they enable a granular and multidimensional understanding of data system behavior, facilitating not just detection but also rapid root-cause analysis." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Completeness refers to the extent to which required data attributes or records are present in a dataset." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Consistency signifies the absence of conflicting data within or across sources. As data ecosystems become distributed and federated, ensuring consistency transcends simple referential integrity checks."(William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Data drift refers to shifts in the statistical properties or distributions of incoming data compared to those observed during training or baseline establishment. Common variants include covariate drift (changes in feature distributions), prior probability drift (changes in class or label proportions), and concept drift (changes in the relationship between features and targets)." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Data governance establishes the overarching policies, standards, and strategic directives that define how data assets are to be managed across the enterprise. This top-level framework sets the boundaries of authority, compliance requirements, and key performance indicators for data quality." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Data Lakes embrace a schema-on-read approach, storing vast volumes of raw or lightly processed data in native formats with minimal upfront constraints. This design significantly enhances ingestion velocity and accommodates diverse, unstructured, or semi-structured datasets. However, enforcing data quality at scale becomes more complex, as traditional static constraints are absent." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Data mesh fundamentally reframes data governance and validation by distributing accountability to domain-oriented teams who act as custodians and producers of their respective data products. These teams possess intimate domain knowledge, which is essential for nuanced validation criteria that adapt to the semantics, context, and evolution of their datasets. By treating datasets as first-class products with clear ownership, interfaces, and service-level objectives, data mesh encourages autonomous validation workflows embedded directly within the domains where data originates and is consumed." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Data quality insights generated through automated profiling and baseline analysis are only as valuable as their visibility and actionability within the broader organizational decision-making context." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Data quality verification, when executed as a set of static, invariant rules, often fails to accommodate the inherent fluidity of real-world datasets and evolving analytical contexts. To ensure robustness and relevance, quality checks must evolve beyond static constraints, incorporating adaptability driven by metadata, runtime information, and domain-specific business logic. This transformation enables the development of dynamic and context-aware validation systems capable of offering intelligent, self-tuning quality enforcement with reduced false positives and operational noise." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Effective management of data quality at scale requires a clear delineation of organizational roles and operational frameworks that ensure accountability, consistency, and continuous improvement. Central to this structure are the interrelated concepts of data governance, data stewardship, and operational ownership. Each serves distinct, yet complementary purposes in embedding responsibility within technology platforms, business processes, and organizational culture." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Establishing a comprehensive observability architecture necessitates a systematic approach that spans the entirety of the data pipeline, from initial telemetry collection to actionable insights accessible by diverse stakeholders. The core objective is to unify distributed data sources - metrics, logs, traces, and quality signals - into a coherent framework that enables rapid diagnosis, continuous monitoring, and strategic decision-making." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Governance sets the strategic framework, stewardship bridges strategy with execution, and operational ownership grounds responsibility within systems and processes. Advanced organizations achieve sustainable data quality by establishing clear roles, defined escalation channels, embedded tooling, standardized processes, and a culture that prioritizes data excellence as a collective, enforceable mandate." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Modern complex organizations increasingly confront the challenge of ensuring data quality at scale without centralizing validation activities into a single bottlenecked team. The data mesh paradigm and federated controls emerge as pivotal architectural styles and organizational patterns that enable decentralized, self-serve data quality validation while preserving coherence and reliability across diverse data products." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Observability [...] requires that systems be instrumented to expose rich telemetry, enabling ad hoc exploration and hypothesis testing regarding system health. Thus, observability demands design considerations at the architecture level, insisting on standardization of instrumentation, consistent metadata management, and tight integration across data processing, storage, and orchestration layers." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Quality gates embody a comprehensive strategy for continuous data assurance by enforcing hierarchical checks, asserting dynamic SLAs, and automating compliance decisions grounded in explicit policies. Their architecture and operationalization directly address the complex interplay between technical robustness and regulatory compliance, ensuring that only trusted data permeates downstream systems." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Robust access control forms the cornerstone of observability system security. At the core lies the principle of least privilege, wherein users and service identities are granted the minimal set of permissions required to perform their designated tasks. This principle substantially reduces the attack surface by minimizing unnecessary access and potential lateral movement paths within the system. Implementing least privilege necessitates fine-grained role-based access control (RBAC) models tailored to organizational roles and operational workflows. RBAC configurations should be explicit regarding the scopes and data domains accessible to each role, avoiding overly broad privileges." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Relevance gauges the appropriateness of data for the given analytical or business context. Irrelevant data, though possibly accurate and complete, can introduce noise and degrade model performance or decision quality." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Robust methodologies to measure and prioritize data quality dimensions involve composite metrics and scoring systems that combine quantitative indicators-such as error rates, completeness percentages, latency distributions-with qualitative assessments from domain experts." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"The architecture of a robust data quality framework hinges fundamentally on three interconnected pillars: open standards, extensible application programming interfaces (APIs), and interoperable protocols. These pillars collectively enable the seamless exchange, validation, and enhancement of data across diverse platforms and organizational boundaries." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"The data swamp anti-pattern arises from indiscriminate ingestion of uncurated data, which rapidly dilutes data warehouse utility and complicates quality monitoring." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"The selection of KPIs should be driven by a rigorous alignment with business objectives and user requirements. This mandates close collaboration with stakeholders spanning data scientists, operations teams, compliance officers, and executive sponsors." " (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Timeliness captures the degree to which data is available when needed and reflects the relevant time frame of the underlying phenomena." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"Uniqueness ensures that each entity or event is captured once and only once, preventing duplication that can distort analysis and decision-making." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025

"Validity reflects whether data conforms to the syntactic and semantic rules predefined for its domain." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

04 May 2006

Programming: Array (Definitions)

"A group of cells arranged by dimensions. A table is a two-dimensional array in which the cells are arranged in rows and columns, with one dimension forming the rows and the other dimension forming the columns. A cube is a three-dimensional array and can be visualized as a cube, with each dimension of the array forming one edge of the cube." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A collection of objects all of the same type." (Jesse Liberty, "Sams Teach Yourself C++ in 24 Hours 3rd Ed.", 2001)

"A list of variables that have the same name and data type." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"Values whose members, called elements, are accessed by an index rather than by name. An array has a rank that specifies the number of indices needed to locate an element (sometimes called the number of dimensions) within the array. It may have either zero or nonzero lower bounds in each dimension." (Damien Watkins et al, "Programming in the .NET Environment", 2002)

"A collection of data items, all of the same type, in which each item is uniquely addressed by a 32-bit integer index. Java arrays behave like objects but have some special syntax. Java arrays begin with the index value 0." (Marcus Green & Bill Brogden, "Java 2™ Programmer Exam Cram™ 2 (Exam CX-310-035)", 2003)

"A device that aggregates large collections of hard drives into a logical whole." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"An arithmetically derived matrix or table of rows and columns that is used to impose an order for efficient experimentation. The rows contain the individual experiments. The columns contain the experimental factors and their individual levels or set points." (Clyde M Creveling, "Six Sigma for Technical Processes: An Overview for R Executives, Technical Leaders, and Engineering Managers", 2006)

"A data structure containing an ordered list of elements - any Ruby object - starting with an index of 0. Compare hash." (Michael Fitzgerald, "Learning Ruby", 2007)

"An arithmetically derived matrix or table of rows and columns that is used to impose an order for efficient experimentation. The rows contain the individual experiments. The columns contain the experimental factors and their individual levels or set points." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"In a SQL database, an ordered collection of elements of the same data type stored in a single column and row of a table." (Jan L Harrington, "SQL Clearly Explained 3rd Ed. ", 2010)

"A group of values stored together in a single variable and accessed by index." (Rod Stephens, "Stephens' Visual Basic® Programming 24-Hour Trainer", 2011)

"A grouping of similar items of the same storage type in a sequential pattern, and referenced by a sequential index value." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A variable that holds a series of values with the same data type. An index into the array lets the program select a particular value." (Rod Stephens, "Start Here!™ Fundamentals of Microsoft® .NET Programming", 2011)

"A basic collection of values that is a sequence represented by a single block of memory. Arrays have efficient direct access, but do not easily grow or shrink." (Mark C Lewis, "Introduction to the Art of Programming Using Scala", 2012)

"An ordered sequence of values, stored such that you can easily access any of the values using an integer subscript that specifies the value’s offset in the sequence." (Jon Orwant et al, "Programming Perl" 4th Ed., 2012)

"A group of variables stored under a single name." (Matt Telles, "Beginning Programming", 2014)

"A structure composed of multiple identical variables that can be individually addressed." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A structure that contains an ordered collection of elements of the same data type in which each element can be referenced by its index value or ordinal position in the collection." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

30 April 2006

🖍️Ronald A Fisher - Collected Quotes

"It may often happen that an inefficient statistic is accurate enough to answer the particular questions at issue. There is however, one limitation to the legitimate use of inefficient statistics which should be noted in advance. If we are to make accurate tests of goodness of fit, the methods of fitting employed must not introduce errors of fitting comparable to the errors of random sampling; when this requirement is investigated, it appears that when tests of goodness of fit are required, the statistics employed in fitting must be not only consistent, but must be of 100 percent efficiency. This is a very serious limitation to the use of inefficient statistics, since in the examination of any body of data it is desirable to be able at any time to test the validity of one or more of the provisional assumptions which have been made." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"No human mind is capable of grasping in its entirety the meaning of any considerable quantity of numerical data." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Statistics may be regarded as (i) the study of populations, (ii) as the study of variation, and (iii) as the study of methods of the reduction of data." (Sir Ronald A Fisher, "Statistical Methods for Research Worker", 1925)

"The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of anyone individual, together with the number in the group. The populations which are the object of statistical study always display variations in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the aims of modern statisticians and those of their predecessors." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The problems which arise in the reduction of data may thus conveniently be divided into three types: (i) Problems of Specification, which arise in the choice of the mathematical form of the population. (ii) When a specification has been obtained, problems of Estimation arise. These involve the choice among the methods of calculating, from our sample, statistics fit to estimate the unknow n parameters of the population. (iii) Problems of Distribution include the mathematical deduction of the exact nature of the distributions in random samples of our estimates of the parameters, and of other statistics designed to test the validity of our specification (tests of Goodness of Fit)." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The statistical examination of a body of data is thus logically similar to the general alternation of inductive and deductive methods throughout the sciences. A hypothesis is conceived and defined with all necessary exactitude; its logical consequences are ascertained by a deductive argument; these consequences are compared with the available observations; if these are completely in, accord with the deductions, the hypothesis is justified at least until fresh and more stringent observations are available." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"In expositions of the scientific use of experimentation it is frequent to find an excessive stress laid on the importance of varying the essential conditions one at a time [...] in the state of knowledge or ignorance in which genuine research, intended to advance knowledge, has to be carried on, this simple formula is not very helpful. We are usually ignorant which, out of innumerable possible factors, may prove ultimately to be the most important, though we may have strong presuppositions that some few of them are particularly worthy of study." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"In relation to any experiment we may speak of this hypothesis as the 'null hypothesis', and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"Inductive inference is the only process known to us by which essential new knowledge comes into the world." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"[…] no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon; for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"Statistical procedure and experimental design are only two different aspects of the same whole, and that whole is the logical requirements of the complete process of adding to natural knowledge by experimentation." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation." (Sir Ronald A Fisher, "The Design of Experiments", 1935)

"To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of." (Sir Ronald A Fisher, [presidential address] 1938)

"The effects of chance are the most accurately calculable, and therefore the least doubtful of all the factors of an evolutionary situation." (Sir Ronald A Fisher, "Croonian Lecture: Population Genetics", Proceedings of the Royal Society of London Vol. 141, 1955)

"The precise specification of our knowledge is, however, the same as the precise specification of our ignorance." (Sir Ronald A Fisher, "Statistical Methods and Scientific Inference", 1959)

29 April 2006

🖍️Randall E Schumacker - Collected Quotes

"Given the important role that correlation plays in structural equation modeling, we need to understand the factors that affect establishing relationships among multivariable data points. The key factors are the level of measurement, restriction of range in data values (variability, skewness, kurtosis), missing data, nonlinearity, outliers, correction for attenuation, and issues related to sampling variation, confidence intervals, effect size, significance, sample size, and power." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Need to consider outliers as they can affect statistics such as means, standard deviations, and correlations. They can either be explained, deleted, or accommodated (using either robust statistics or obtaining additional data to fill-in). Can be detected by methods such as box plots, scatterplots, histograms or frequency distributions." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Outliers or influential data points can be defined as data values that are extreme or atypical on either the independent (X variables) or dependent (Y variables) variables or both. Outliers can occur as a result of observation errors, data entry errors, instrument errors based on layout or instructions, or actual extreme values from self-report data. Because outliers affect the mean, the standard deviation, and correlation coefficient values, they must be explained, deleted, or accommodated by using robust statistics." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Structural equation modeling is a correlation research method; therefore, the measurement scale, restriction of range in the data values, missing data, outliers, nonlinearity, and nonnormality of data affect the variance–covariance among variables and thus can impact the SEM analysis." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Structural equation modeling (SEM) uses various types of models to depict relationships among observed variables, with the same basic goal of providing a quantitative test of a theoretical model hypothesized by the researcher. More specifically, various theoretical models can be tested in SEM that hypothesize how sets of variables define constructs and how these constructs are related to each other." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"There are several key issues in the field of statistics that impact our analyses once data have been imported into a software program. These data issues are commonly referred to as the measurement scale of variables, restriction in the range of data, missing data values, outliers, linearity, and nonnormality." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

🖍️John W Tukey - Collected Quotes

"[We] need men who can practice science - not a particular science - in a word, we need scientific generalists." (John W Tukey, "The Education of a Scientific Generalist", 1949)

"[...] the whole of modern statistics, philosophy and methods alike, is based on the principle of interpreting what did happen in terms of what might have happened." (John W Tukey, "Standard Methods of Analyzing Data, 1951)

"Just remember that not all statistics has been mathematized - and that we may not have to wait for its mathematization in order to use it." (John W Tukey, "The Growth of Experimental Design in a Research Laboratory, 1953)

"Difficulties in identifying problems have delayed statistics far more than difficulties in solving problems." (John W Tukey, Unsolved Problems of Experimental Statistics, 1954)

"Predictions, prophecies, and perhaps even guidance - those who suggested this title to me must have hoped for such-even though occasional indulgences in such actions by statisticians has undoubtedly contributed to the characterization of a statistician as a man who draws straight lines from insufficient data to foregone conclusions!" (John W Tukey, "Where do We Go From Here?", Journal of the American Statistical Association, Vol. 55 (289), 1960)

"Today one of statistics' great needs is a body of able investigators who make it clear to the intellectual world that they are scientific statisticians. and they are proud of that fact that to them mathematics is incidental, though perhaps indispensable." (John W Tukey, "Statistical and Quantitative Methodology, 1961)

"If data analysis is to be well done, much of it must be a matter of judgment, and ‘theory’ whether statistical or non-statistical, will have to guide, not command." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33 (1), 1962)

"The most important maxim for data analysis to heed, and one which many statisticians seem to have shunned is this: ‘Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.’ Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33, No. 1, 1962)

"The physical sciences are used to ‘praying over’ their data, examining the same data from a variety of points of view. This process has been very rewarding, and has led to many extremely valuable insights. Without this sort of flexibility, progress in physical science would have been much slower. Flexibility in analysis is often to be had honestly at the price of a willingness not to demand that what has already been observed shall establish, or prove, what analysis suggests. In physical science generally, the results of praying over the data are thought of as something to be put to further test in another experiment, as indications rather than conclusions." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics Vol. 33 (1), 1962)

"The histogram, with its columns of area proportional to number, like the bar graph, is one of the most classical of statistical graphs. Its combination with a fitted bell-shaped curve has been common since the days when the Gaussian curve entered statistics. Yet as a graphical technique it really performs quite poorly. Who is there among us who can look at a histogram-fitted Gaussian combination and tell us, reliably, whether the fit is excellent, neutral, or poor? Who can tell us, when the fit is poor, of what the poorness consists? Yet these are just the sort of questions that a good graphical technique should answer at least approximately." (John W Tukey, "The Future of Processes of Data Analysis", 1965)

"The first step in data analysis is often an omnibus step. We dare not expect otherwise, but we equally dare not forget that this step, and that step, and other step, are all omnibus steps and that we owe the users of such techniques a deep and important obligation to develop ways, often varied and competitive, of replacing omnibus procedures by ones that are more sharply focused." (John W Tukey, "The Future of Processes of Data Analysis", 1965)

"The basic general intent of data analysis is simply stated: to seek through a body of data for interesting relationships and information and to exhibit the results in such a way as to make them recognizable to the data analyzer and recordable for posterity. Its creative task is to be productively descriptive, with as much attention as possible to previous knowledge, and thus to contribute to the mysterious process called insight." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Comparable objectives in data analysis are (l) to achieve more specific description of what is loosely known or suspected; (2) to find unanticipated aspects in the data, and to suggest unthought-of-models for the data's summarization and exposure; (3) to employ the data to assess the (always incomplete) adequacy of a contemplated model; (4) to provide both incentives and guidance for further analysis of the data; and (5) to keep the investigator usefully stimulated while he absorbs the feeling of his data and considers what to do next." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"The science and art of data analysis concerns the process of learning from quantitative records of experience. By its very nature it exists in relation to people. Thus, the techniques and the technology of data analysis must be harnessed to suit human requirements and talents. Some implications for effective data analysis are: (1) that it is essential to have convenience of interaction of people and intermediate results and (2) that at all stages of data analysis the nature and detail of output, both actual and potential, need to be matched to the capabilities of the people who use it and want it." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"In many instances, a picture is indeed worth a thousand words. To make this true in more diverse circumstances, much more creative effort is needed to pictorialize the output from data analysis. Naive pictures are often extremely helpful, but more sophisticated pictures can be both simple and even more informative." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Summarizing data is a process of constrained and partial a process that essentially and inevitably corresponds to description - some sort of fitting, though it need not necessarily involve formal criteria or well-defined computations." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"The typical statistician has learned from bitter experience that negative results are just as important as positive ones, sometimes more so." (John W Tukey, "A Statistician's Comment", 1967)

"It is fair to say that statistics has made its greatest progress by having to move away from certainty [...] If we really want to make progress, we need to identify our next step away from certainty." (John W Tukey, "What Have Statisticians Been Forgetting", 1967)

"'Every student of the art of data analysis repeatedly needs to build upon his previous statistical knowledge and to reform that foundation through fresh insights and emphasis." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"Every graph is at least an indication, by contrast with some common instances of numbers." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"Nothing can substitute for relatively direct assessment of variability." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"No one knows how to appraise a procedure safely except by using different bodies of data from those that determined it." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"The problems of different fields are much more alike than their practitioners think, much more alike than different." (John W Tukey, "Analyzing Data: Sanctification or Detective Work?", 1969)

"[...] bending the question to fit the analysis is to be shunned at all costs." (John W Tukey, "Analyzing Data: Sanctification or Detective Work?", 1969)

"Data analysis is in important ways an antithesis of pure mathematics." (John W Tukey, "Data Analysis, Computation and Mathematics", 1972)

"Undoubtedly, the swing to exploratory data analysis will go somewhat too far. However : It is better to ride a damped pendulum than to be stuck in the mud." (John W Tukey, "Exploratory Data Analysis as Part of a Larger Whole", 1973)

"The greatest value of a picture is when it forces us to notice what we never expected to see." (John W Tukey, "Exploratory Data Analysis", 1977)

"[...] exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as for those we believe might be there. Except for its emphasis on graphs, its tools are secondary to its purpose." (John W Tukey, [comment] 1979)

"There is NO question of teaching confirmatory OR exploratory - we need to teach both." (John W Tukey, "We Need Both Exploratory and Confirmatory", 1980)

"Finding the question is often more important than finding the answer." (John W Tukey, "We Need Both Exploratory and Confirmatory", 1980)

"[...] any hope that we are smart enough to find even transiently optimum solutions to our data analysis problems is doomed to failure, and, indeed, if taken seriously, will mislead us in the allocation of effort, thus wasting both intellectual and computational effort." (John W Tukey, "Choosing Techniques for the Analysis of Data", 1981)

"Detailed study of the quality of data sources is an essential part of applied work. [...] Data analysts need to understand more about the measurement processes through which their data come. To know the name by which a column of figures is headed is far from being enough." (John W Tukey, "An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects", 1982)

"Exploratory data analysis, EDA, calls for a relatively free hand in exploring the data, together with dual obligations: (•) to look for all plausible alternatives and oddities - and a few implausible ones, (graphic techniques can be most helpful here) and (•) to remove each appearance that seems large enough to be meaningful - ordinarily by some form of fitting, adjustment, or standardization [...] so that what remains, the residuals, can be examined for further appearances." (John W Tukey, "Introduction to Styles of Data Analysis Techniques", 1982)

"The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." (John W Tukey, "Sunset Salvo", The American Statistician Vol. 40 (1), 1986)

"The worst, i.e., most dangerous, feature of 'accepting the null hypothesis' is the giving up of explicit uncertainty. […] Mathematics can sometimes be put in such black-and-white terms, but our knowledge or belief about the external world never can." (John W Tukey, "The Philosophy of Multiple Comparisons", Statistical Science Vol. 6 (1), 1991)

"Statistics is the science, the art, the philosophy, and the technique of making inferences from the particular to the general." (John W Tukey)

28 April 2006

🖍️William E Deming - Collected Quotes

"It is important to realize that it is not the one measurement, alone, but its relation to the rest of the sequence that is of interest." (William E Deming, "Statistical Adjustment of Data", 1938)

"The definition of random in terms of a physical operation is notoriously without effect on the mathematical operations of statistical theory because so far as these mathematical operations are concerned random is purely and simply an undefined term." (Walter A Shewhart & William E Deming, "Statistical Method from the Viewpoint of Quality Control", 1939)

"Experience without theory teaches nothing." (William E Deming, "Out of the Crisis", 1986)

"It is important to realize that it is not the one measurement, alone, but its relation to the rest of the sequence that is of interest." (William E Deming, "Statistical Adjustment of Data", 1943)

"Sampling is the science and art of controlling and measuring the reliability of useful statistical information through the theory of probability." (William E Deming, "Some Theory of Sampling", 1950)

"Why waste knowledge?… No company can afford to waste knowledge. Failure of management to breakdown barriers between activities… is one way to waste knowledge. People that are not working together are not contributing their best to the company. People as they work together, feeling secure in the job reinforce their knowledge and efforts. Their combined output, when they are working together, is more than the sum of their separate. " (W Edwards Deming," Quality, Productivity and Competitive Position", 1982)

"Experience by itself teaches nothing [...] Without theory, experience has no meaning. Without theory, one has no questions to ask. Hence without theory there is no learning." (William E Deming, "The New Economics for Industry, Government, Education", 1993)

"Knowledge is theory. We should be thankful if action of management is based on theory. Knowledge has temporal spread. Information is not knowledge. The world is drowning in information but is slow in acquisition of knowledge. There is no substitute for knowledge." (William E Deming, "The New Economics for Industry, Government, Education", 1993)

"What is a system? A system is a network of interdependent components that work together to try to accomplish the aim of the system. A system must have an aim. Without an aim, there is no system. The aim of the system must be clear to everyone in the system. The aim must include plans for the future. The aim is a value judgment." (William E Deming, "The New Economics for Industry, Government, Education", 1993)

"The only useful function of a statistician is to make predictions, and thus to provide a basis for action." (William E Deming)

"Too little attention is given to the need for statistical control, or to put it more pertinently, since statistical control (randomness) is so rarely found, too little attention is given to the interpretation of data that arise from conditions not in statistical control." (William E Deming)

🖍️Donald J Wheeler - Collected Quotes

"Averages, ranges, and histograms all obscure the time-order for the data. If the time-order for the data shows some sort of definite pattern, then the obscuring of this pattern by the use of averages, ranges, or histograms can mislead the user. Since all data occur in time, virtually all data will have a time-order. In some cases this time-order is the essential context which must be preserved in the presentation." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Before you can improve any system you must listen to the voice of the system (the Voice of the Process). Then you must understand how the inputs affect the outputs of the system. Finally, you must be able to change the inputs (and possibly the system) in order to achieve the desired results. This will require sustained effort, constancy of purpose, and an environment where continual improvement is the operating philosophy." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data are collected as a basis for action. Yet before anyone can use data as a basis for action the data have to be interpreted. The proper interpretation of data will require that the data be presented in context, and that the analysis technique used will filter out the noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data are generally collected as a basis for action. However, unless potential signals are separated from probable noise, the actions taken may be totally inconsistent with the data. Thus, the proper use of data requires that you have simple and effective methods of analysis which will properly separate potential signals from probable noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No comparison between two values can be global. A simple comparison between the current figure and some previous value and convey the behavior of any time series. […] While it is simple and easy to compare one number with another number, such comparisons are limited and weak. They are limited because of the amount of data used, and they are weak because both of the numbers are subject to the variation that is inevitably present in weak world data. Since both the current value and the earlier value are subject to this variation, it will always be difficult to determine just how much of the difference between the values is due to variation in the numbers, and how much, if any, of the difference is due to real changes in the process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No matter what the data, and no matter how the values are arranged and presented, you must always use some method of analysis to come up with an interpretation of the data.
While every data set contains noise, some data sets may contain signals. Therefore, before you can detect a signal within any given data set, you must first filter out the noise." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"We analyze numbers in order to know when a change has occurred in our processes or systems. We want to know about such changes in a timely manner so that we can respond appropriately. While this sounds rather straightforward, there is a complication - the numbers can change even when our process does not. So, in our analysis of numbers, we need to have a way to distinguish those changes in the numbers that represent changes in our process from those that are essentially noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"When a system is predictable, it is already performing as consistently as possible. Looking for assignable causes is a waste of time and effort. Instead, you can meaningfully work on making improvements and modifications to the process. When a system is unpredictable, it will be futile to try and improve or modify the process. Instead you must seek to identify the assignable causes which affect the system. The failure to distinguish between these two different courses of action is a major source of confusion and wasted effort in business today." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"When a process displays unpredictable behavior, you can most easily improve the process and process outcomes by identifying the assignable causes of unpredictable variation and removing their effects from your process." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"While all data contain noise, some data contain signals. Before you can detect a signal, you must filter out the noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] you simply cannot make sense of any number without a contextual basis. Yet the traditional attempts to provide this contextual basis are often flawed in their execution. [...] Data have no meaning apart from their context. Data presented without a context are effectively rendered meaningless." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"A control chart is a tool for maintaining the status-quo - it was created to monitor a process after that process has been brought to a satisfactory level of operation."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Data analysis is not generally thought of as being simple or easy, but it can be. The first step is to understand that the purpose of data analysis is to separate any signals that may be contained within the data from the noise in the data. Once you have filtered out the noise, anything left over will be your potential signals. The rest is just details." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Descriptive statistics are built on the assumption that we can use a single value to characterize a single property for a single universe. […] Probability theory is focused on what happens to samples drawn from a known universe. If the data happen to come from different sources, then there are multiple universes with different probability models. If you cannot answer the homogeneity question, then you will not know if you have one probability model or many. [...] Statistical inference assumes that you have a sample that is known to have come from one universe." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] " It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the data must be normally distributed before they can be placed on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the observations must be independent - data with autocorrelation are inappropriate for process behavior charts." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the process must be operating in control before you can place the data on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. Any data analyst needs to know how to organize and use these four questions in order to obtain meaningful and correct results. [...] THE DESCRIPTION QUESTION: Given a collection of numbers, are there arithmetic values that will summarize the information contained in those numbers in some meaningful way?
THE PROBABILITY QUESTION: Given a known universe, what can we say about samples drawn from this universe? [...]
THE INFERENCE QUESTION: Given an unknown universe, and given a sample that is known to have been drawn from that unknown universe, and given that we know everything about the sample, what can we say about the unknown universe? [...]
THE HOMOGENEITY QUESTION: Given a collection of observations, is it reasonable to assume that they came from one universe, or do they show evidence of having come from multiple universes?" (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The simplicity of the process behavior chart can be deceptive. This is because the simplicity of the charts is based on a completely different concept of data analysis than that which is used for the analysis of experimental data. When someone does not understand the conceptual basis for process behavior charts they are likely to view the simplicity of the charts as something that needs to be fixed. Out of these urges to fix the charts all kinds of myths have sprung up resulting in various levels of complexity and obstacles to the use of one of the most powerful analysis techniques ever invented." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "The standard deviation statistic is more efficient than the range and therefore we should use the standard deviation statistic when computing limits for a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)