SQL Troubles: estimation

Showing posts with label estimation. Show all posts

08 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part III: Failure through the Looking Glass)

Business Intelligence Series

There’s a huge volume of material available on project failure – resources that document why individual projects failed, why in general projects fail, why project members, managers and/or executives think projects fail, and there seems to be no other more rewarding activity at the end of a project than to theorize why a project failed, the topic culminating occasionally with the blaming game. Success may generate applause, though it's failure that attracts and stirs the most waves (irony, disapproval, and other similar behavior) and everybody seems to be an expert after the consumed endeavor.

The mere definition of a project failure – not fulfilling project’s objectives within the set budget and timeframe - is a misnomer because budgets and timelines are estimated based on the information available at the beginning of the project, the amount of uncertainty for many projects being considerable, and data projects are no exceptions from it. The higher the uncertainty the less probable are the two estimates. Even simple projects can reveal uncertainty especially when the broader context of the projects is considered.

Even if it’s not a common practice, one way to cope with uncertainty is to add a tolerance for the estimates, though even this practice probably will not always accommodate the full extent of the unknown as the tolerances are usually small. The general expectation is to have an accurate and precise landing, which for big or exploratory projects is seldom possible!

Moreover, the assumptions under which the estimates hold are easily invalidated in praxis – resources’ availability, first time right, executive’s support to set priorities, requirements’ quality, technologies’ maturity, etc. If one looks beyond the reasons why projects fail in general, quite often the issues are more organizational than technological, the lack of knowledge and experience being some of the factors.

Conversely, many projects will not get approved if the estimates don’t look positive, and therefore people are pressured in one way or another to make the numbers fit the expectations. Some projects, given their importance, need to be done even if the numbers don’t look good or can’t be quantified correctly. Other projects represent people’s subsistence on the job, respectively people's self-occupation to create motion, though they can occasionally have also a positive impact for the organizations. These kinds of aspects almost never make it in statistics or surveys. Neither do the big issues people are afraid to talk about. Where to consider that in the light of politics and office’s grapevine the facts get distorted!

Data projects reflect all the symptoms of failure projects have in general, though when words like AI, Statistics or Machine Learning are used, the chances for failure are even higher given that the respective fields require a higher level of expertise, the appropriate use of technologies and adherence to the scientific process for the results to be valid. If projects can benefit from general recipes, respectively established procedures and methods, their range of applicability decreases when the mentioned areas are involved.

Many data projects have an exploratory nature – seeing what’s possible - and therefore a considerable percentage will not reach production. Moreover, even those that reach that far might arrive to be stopped or discarded sooner or later if they don’t deliver the expected value, and probably many of the models created in the process are biased, irrelevant, or incorrectly apply the theory. Where to add that the mere use of tools and algorithms is not Data Science or Data Analysis.

The challenge for many data projects is to identify which Project Management (PM) best practices to consider. Following all or no practices at all just increases the risks of failure!

Previous Post <<||>> Next Post

06 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part II: There's Value in Failure)

Business Intelligence Series

"Results are nothing; the energies which produce them

and which again spring from them are everything."

(Wilhelm von Humboldt, "On Language", 1836)

When the data is not available and is needed on a continuous basis then usually the solution is to redesign the processes and make sure the data becomes available at the needed quality level. Redesign involves additional costs for the business; therefore, it might be tempting to cancel or postpone data projects, at least until they become feasible, though they’re seldom feasible.

Just because there’s a set of data, this doesn’t mean that there is important knowledge to be extracted from it, respectively that the investment is feasible. There’s however value in building experience in the internal resources, in identifying the challenges and the opportunities, in identifying what needs to be changed for harnessing the data. Unfortunately, organizations expect that somebody else will do the work for them instead of doing the jump by themselves, and this approach more likely will fail. It’s like expecting to get enlightened after a few theoretical sessions with a guru than walking the path by oneself.

This is reflected also in organizations’ readiness to do the required endeavors for making the jump on the maturity scale. If organizations can’t approach such topics systematically and address the assumptions, opportunities, and risks adequately, respectively to manage the various aspects, it’s hard to believe that their data journey will be positive.

A data journey shouldn’t be about politics even if some minds need to be changed in the process, at management as well as at lower level. If the leadership doesn’t recognize the importance of becoming an enabler for such initiatives, then the organization probably deserves to keep the status quo. The drive for change should come from the leadership even if we talk about data culture, data strategy, decision-making, or any critical aspect.

An organization will always need to find the balance between time, scope, cost, and quality, and this applies to operations, tactics, and strategies as well as to projects. There are hard limits and lot of uncertainty associated with data projects and the tasks involved, limits reflected in cost and time estimations (which frankly are just expert’s rough guesses that can change for the worst in the light of new information). Therefore, especially in data projects one needs to be able to compromise, to change scope and timelines as seems fit, and why not, to cancel the projects if the objectives aren’t feasible anymore, respectively if compromises can’t be reached.

An organization must be able to take the risks and invest in failure, otherwise the opportunities for growth don’t change. Being able to split a roadmap into small iterative steps that allow besides breaking down the complexity and making progress to evaluate the progress and the knowledge resulted, respectively incorporate the feedback and knowledge in the next steps, can prove to be what organizations lack in coping with the high uncertainty. Instead, organizations seem to be fascinated by the big bang, thinking that technology can automatically fill the organizational gaps.

Doing the same thing repeatedly and expecting different results is called insanity. Unfortunately, this is what organizations and service providers do in what concerns Project Management in general and data projects in particular. Building something without a foundation, without making sure that the employees have the skillset, maturity and culture to manage the data-related tasks, challenges and opportunities is pure insanity!

Bottom line, harnessing the data requires a certain maturity and it starts with recognizing and pursuing opportunities, setting goals, following roadmaps, learning to fail and getting value from failure, respectively controlling the failure. Growth or instant enlightenment without a fair amount of sweat is possible, though that’s an exception for few in sight!

Previous Post <<||>> Next Post

07 March 2021

💼Project Management: Methodologies (Part II: Agile Manifesto Reloaded II - Requirements Management)

Independently of its scope and the methodology used, each software development project is made of the same blocks/phases arranged eventually differently. It starts with Requirements Managements (RM) subprocesses in which the functional and non-functional requirements are gathered, consolidated, prioritized and brought to a form which facilitates their understanding and estimation. It’s an iterative process as there can be overlapping in functionality, requirements that don’t bring any significant benefit when compared with the investment, respectively new aspects are discovered during the internal discussions or with the implementer.

As output of this phase, it’s important having a list of requirements that reflect customer’s needs in respect to the product(s) to be implemented. Once frozen, the list defines project’s scope and is used for estimating the costs, sketching a draft of the final solution, respectively of reaching a contractual agreement with the implementer. Ideally the set of requirements should be completed and be coherent while reflecting customer’s needs. It allows thus in theory to agree upon costs as well about an architecture and other important aspects (responsibilities/accountability).

Typically, each new requirement considered after this stage needs to go through a Change Management (CM) process in which it gets formulated to the needed level of detail, a cost, effort and impact analysis is performed, respectively the budget for it is approved or the change gets rejected. Ideally small changes can be considered as part of a buffer budget upfront, however in the end each change comes with a cost and project delays.

Some changes can come late in the project and can have an important impact on the whole architecture when important aspects were missed upfront. Moreover, when the number of changes goes beyond a certain limit it can lead to what is known as scope creep, with important consequences on project’s costs, timeline and quality. Therefore, to minimize the impact on the project, the number of changes needs to be kept to a minimum, typically considering only the critical changes, while the others can be still implemented after project’s end.

The agile manifesto’s principles impose an important constraint on the requirements - changing requirements is a good practice even late in the process – an assumption - best requirements emerge from self-organizing teams, and probably one implication – the requirements need to be defined together with the implementer.

The way changing requirements are handled seem to provide more flexibility though it’s actually a constraint imposed on the CM process which interfaces with the RM processes. Without a proper CM in place, any requirement might arrive to be implemented, independently on whether is feasible or not. This can easily make project’s costs explode, sometimes unnecessarily, while accommodating extreme behavior like changing the same functionality frequently, handling exceptions extensively, etc.

It’s usually helpful to define the requirements together with the implementer, as this can bring more quality in the process, even if more time needs to be invested. However, starting from a solid set of requirements is a critical factor for project’s success. The manifesto makes no direct statement about this. Just iterates that good requirements emerge from self-organizing teams which is not necessarily the case.

The users who in theory can define the requirements best are the ones who have the deepest knowledge about an organization’s processes and IT architecture, typically the key users and/or IT experts. Self-organization revolves around how a team organizes itself and handles the various activities, though there’s no guarantee that it will address the important aspects, no matter how motivated the team is, how constant the pace, how excellent the technical details were handled or how good the final product works.

Previous Post <<||>>Next Post

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

Motion is the action or process of moving or being moved between an initial and a final or intermediate point. From the tinniest endeavors to the movement of the planets and beyond, everything is governed by motion. If the laws of nature seem to reveal an inner structural perfection, the activities people perform are quite often far from perfect, which is acceptable if we consider that (almost) everything is a learning process. What is probably less acceptable is the volume of inefficient motion we can easily categorize sometimes as waste.

The waste associated with motion can take many forms: sorting through a pile of tools to find the right one, searching for information, moving back and forth to reach a destination or achieve a goal, etc. Suboptimal motion can have important effects for an organization resulting in reduced productivity, respectively higher costs.

If for repetitive activities that involve a certain degree of similarity can be found typically a way to optimize the motion, the higher the uncertainty of the steps involved, the more difficult it becomes to optimize it. It’s the case of discovery endeavors in which the path between start and destination can’t be traced beforehand, respectively when the destination or path in between can’t be depicted to the needed level of detail. A strategy’s implementation, ERP implementations and other complex projects, especially the ones dealing with new technologies and/or incomplete knowledge, tend to be exploratory in nature and thus fall under this latter type a motion.

In other words, one must know at minimum the starting point, the destination, how to reach it and what it takes to reach it – resources, knowledge, skillset. When one has all this information one can go on and estimate how long it will take to reach the destination, though the estimate reflects the information available as well estimator’s skills in translating the information into a realistic roadmap. Each new information has the potential of impacting considerably the whole process, in extremis to the degree that one must start the journey anew. The complexity of such projects and the volume of uncertainty can make estimation difficult if not impossible, no matter how good estimators' skills are. At best an estimator can come with a best- and worst-case estimation, both however dependent on the assumptions made.

Moreover, complex projects are sensitive to the initial conditions or auspices under which they start. This sensitivity can turn a project in a totally different direction or pace, that can be reinforced positively or negatively as the project progresses. It’s a continuous interplay between internal and external factors and components that can create synergies or have adverse effects with the potential of reaching tipping points.

Related to the initial conditions, as the praxis sometimes shows, for entities found in continuous movement (like organizations) it’s also important to know from where one’s coming (and at what speed), as the previous impulse (driving force) can be further used or stirred as needed. Metaphorically, a project will need a certain time to find the right pace if it lacks the proper impulse.

Unless the team is trained to play and plays like an orchestra, the impact of deviations from expectations can be hardly quantified. To minimize the waste, ideally a project’s journey should minimally deviate from the optimal path, which can be challenging to achieve as a project’s mass can pull the project in one direction or the other. The more the project advances the bigger the mass, fact which can make a project unstoppable. When such high-mass projects are stopped, their impulse can continue to haunt the organization years after.

Previous Post <<||>> Next Post

22 December 2018

🔭Data Science: Significance (Just the Quotes)

"It is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant." (Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per centp oint). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance," (Ronald A Fisher, 1926)

"An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation." (Ronald A Fisher, "The Statistical Method in Psychical Research", Proceedings of the Society for Psychical Research 39, 1929)

"What the use of P [the significance level] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred." (Harold Jeffreys, "Theory of Probability", 1939)

"As usual we may make the errors of I) rejecting the null hypothesis when it is true, II) accepting the null hypothesis when it is false. But there is a third kind of error which is of interest because the present test of significance is tied up closely with the idea of making a correct decision about which distribution function has slipped furthest to the right. We may make the error of III) correctly rejecting the null hypothesis for the wrong reason." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"Errors of the third kind happen in conventional tests of differences of means, but they are usually not considered, although their existence is probably recognized. It seems to the author that there may be several reasons for this among which are 1) a preoccupation on the part of mathematical statisticians with the formal questions of acceptance and rejection of null hypotheses without adequate consideration of the implications of the error of the third kind for the practical experimenter, 2) the rarity with which an error of the third kind arises in the usual tests of significance." (Frederick Mosteller, "A k-Sample Slippage Test for an Extreme Population", The Annals of Mathematical Statistics 19, 1948)

"If significance tests are required for still larger samples, graphical accuracy is insufficient, and arithmetical methods are advised. A word to the wise is in order here, however. Almost never does it make sense to use exact binomial significance tests on such data - for the inevitable small deviations from the mathematical model of independence and constant split have piled up to such an extent that the binomial variability is deeply buried and unnoticeable. Graphical treatment of such large samples may still be worthwhile because it brings the results more vividly to the eye." (Frederick Mosteller & John W Tukey, "The Uses and Usefulness of Binomial Probability Paper?", Journal of the American Statistical Association 44, 1949)

"In the examples we have given [...] our judgment whether P was small enough to justify us in suspecting a significant difference [...] has been more or less intuitive. Most people would agree [...] that a probability of .0001 is so small that the evidence is very much in favour. . . . Suppose we had obtained P = 0.1. [...] Where, if anywhere, can we draw the line? The odds against the observed event which influence a decision one way or the other depend to some extent on the caution of the investigator. Some people (not necessarily statisticians) would regard odds of ten to one as sufficient. Others would be more conservative and reserve judgment until the odds were much greater. It is a matter of personal taste." (G U Yule & M G Kendall, "An introduction to the theoryof statistics" 14th ed., 1950)

"It will, of course, happen but rarely that the proportions will be identical, even if no real association exists. Evidently, therefore, we need a significance test to reassure ourselves that the observed difference of proportion is greater than could reasonably be attributed to chance. The significance test will test the reality of the association, without telling us anything about the intensity of association. It will be apparent that we need two distinct things: (a) a test of significance, to be used on the data first of all, and (b) some measure of the intensity of the association, which we shall only be justified in using if the significance test confirms that the association is real." (Michael J Moroney, "Facts from Figures", 1951)

"The main purpose of a significance test is to inhibit the natural enthusiasm of the investigator." (Frederick Mosteller, "Selected Quantitative Techniques", 1954)

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric Statistics", Journal of the American Statistical Association 52, 1957)

"[...] to make measurements and then ignore their magnitude would ordinarily be pointless. Exclusive reliance on tests of significance obscures the fact that statistical significance does not imply substantive significance." (I Richard Savage, "Nonparametric Statistics", Journal of the American Statistical Association 52, 1957)

"[...] the tests of null hypotheses of zero differences, of no relationships, are frequently weak, perhaps trivial statements of the researcher’s aims [...] in many cases, instead of the tests of significance it would be more to the point to measure the magnitudes of the relationships, attaching proper statements of their sampling variation. The magnitudes of relationships cannot be measured in terms of levels of significance." (Leslie Kish, "Some statistical problems in research design", American Sociological Review 24, 1959)

"There are instances of research results presented in terms of probability values of ‘statistical significance’ alone, without noting the magnitude and importance of the relationships found. These attempts to use the probability levels of significance tests as measures of the strengths of relationships are very common and very mistaken." (Leslie Kish, "Some statistical problems in research design", American Sociological Review 24, 1959)

"The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true." (William W Rozeboom, "The fallacy of the null–hypothesis significance test", Psychological Bulletin 57, 1960)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"[...] the test of significance has been carrying too much of the burden of scientific inference. It may well be the case that wise and ingenious investigators can find their way to reasonable conclusions from data because and in spite of their procedures. Too often, however, even wise and ingenious investigators [...] tend to credit the test of significance with properties it does not have." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"[...] we need to get on with the business of generating [...] hypotheses and proceed to do investigations and make inferences which bear on them, instead of [...] testing the statistical null hypothesis in any number of contexts in which we have every reason to suppose that it is false in the first place." (David Bakan, "The test of significance in psychological research", Psychological Bulletin 66, 1966)

"Overemphasis on tests of significance at the expense especially of interval estimation has long been condemned." (David R Cox, "The role of significance tests", Scandanavian Journal of Statistics 4, 1977)

"The central point is that statistical significance is quite different from scientific significance and that therefore estimation [...] of the magnitude of effects is in general essential regardless of whether statistically significant departure from the null hypothesis is achieved." (David R Cox, "The role of significance tests", Scandanavian Journal of Statistics 4, 1977)

"Science usually amounts to a lot more than blind trial and error. Good statistics consists of much more than just significance tests; there are more sophisticated tools available for the analysis of results, such as confidence statements, multiple comparisons, and Bayesian analysis, to drop a few names. However, not all scientists are good statisticians, or want to be, and not all people who are called scientists by the media deserve to be so described." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"The idea of statistical significance is valuable because it often keeps us from announcing results that later turn out to be nonresults. A significant result tells us that enough cases were observed to provide reasonable assurance of a real effect. It does not necessarily mean, though, that the effect is big enough to be important." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"It is very bad practice to summarise an important investigation solely by a value of P." (David R Cox, "Statistical significance tests", British Journal of Clinical Pharmacology 14, 1982)

"The criterion for publication should be the achievement of reasonable precision and not whether a significant effect has been found." (David R Cox, "Statistical significance tests", British Journal of Clinical Pharmacology 14, 1982)

"The continued very extensive use of significance tests is alarming." (David R Cox, "Some general aspects of the theory of statistics", International Statistical Review 54, 1986)

"It has been widely felt, probably for thirty years and more, that significance tests are overemphasized and often misused and that more emphasis should be put on estimation and prediction. While such a shift of emphasis does seem to be occurring, for example in medical statistics, the continued very extensive use of significance tests is on the one hand alarming and on the other evidence that they are aimed, even if imperfectly, at some widely felt need." (David R Cox, "Some general aspects of the theory of statistics", International Statistical Review 54, 1986)

"A tendency to drastically underestimate the frequency of coincidences is a prime characteristic of innumerates, who generally accord great significance to correspondences of all sorts while attributing too little significance to quite conclusive but less flashy statistical evidence." (John A Paulos, "Innumeracy: Mathematical Illiteracy and its Consequences", 1988)

"Which I would like to stress are: (1) A significant effect is not necessarily the same thing as an interesting effect. (2) A non-significant effect is not necessarily the same thing as no difference." (Christopher Chatfield, "Problem solving: a statistician’s guide", 1988)

"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world. [...] If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what’s the big deal about rejecting it?" (Jacob Cohen,"Things I Have Learned (So Far)", American Psychologist, 1990)

"I do not think that significance testing should be completely abandoned [...] and I don’t expect that it will be. But I urge researchers to provide estimates, with confidence intervals: scientific advance requires parameters with known reliability estimates. Classical confidence intervals are formally equivalent to a significance test, but they convey more information." (Nigel G Yoccoz, "Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology", Bulletin of the Ecological Society of America Vol. 72 (2), 1991)

"Rejection of a true null hypothesis at the 0.05 level will occur only one in 20 times. The overwhelming majority of these false rejections will be based on test statistics close to the borderline value. If the null hypothesis is false, the inter-ocular traumatic test ['hit between the eyes'] will often suffice to reject it; calculation will serve only to verify clear intuition." (Ward Edwards et al, "Bayesian Statistical Inference for Psychological Research", 1992)

"Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired. This tautology has created considerable damage as regards the cumulation of knowledge." (Bruce Thompson, "Two and One-Half Decades of Leadership in Measurement and Evaluation", Journal of Counseling & Development 70 (3), 1992)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"Graphic misrepresentation is a frequent misuse in presentations to the nonprofessional. The granddaddy of all graphical offenses is to omit the zero on the vertical axis. As a consequence, the chart is often interpreted as if its bottom axis were zero, even though it may be far removed. This can lead to attention-getting headlines about 'a soar' or 'a dramatic rise (or fall)'. A modest, and possibly insignificant, change is amplified into a disastrous or inspirational trend." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"When significance tests are used and a null hypothesis is not rejected, a major problem often arises - namely, the result may be interpreted, without a logical basis, as providing evidence for the null hypothesis." (David F Parkhurst, "Statistical Significance Tests: Equivalence and Reverse Tests Should Reduce Misinterpretation", BioScience Vol. 51 (12), 2001)

"If you flip a coin three times and it lands on heads each time, it's probably chance. If you flip it a hundred times and it lands on heads each time, you can be pretty sure the coin has heads on both sides. That's the concept behind statistical significance - it's the odds that the correlation (or other finding) is real, that it isn't just random chance." (T Colin Campbell, "The China Study", 2004)

"Many statistics texts do not mention this and students often ask, ‘What if you get a probability of exactly 0.05?’ Here the result would be considered not significant, since significance has been defined as a probability of less than 0.05 (<0.05). Some texts define a significant result as one where the probability is less than or equal to 0.05 ( 0.05). In practice this will make very little difference, but since Fisher proposed the ‘less than 0.05’ definition, which is also used by most scientific publications, it will be used here." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"The dual meaning of the word significant brings into focus the distinction between drawing a mathematical inference and practical inference from statistical results." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"A common statistical error is to summarize comparisons by statistical significance and then draw a sharp distinction between significant and nonsignificant results. The approach of summarizing by statistical significance has a number of pitfalls, most of which are covered in standard statistics courses but one that we believe is less well known. We refer to the fact that changes in statistical significance are not themselves significant. A small change in a group mean, a regression coefficient, or any other statistical quantity can be neither statistically significant nor practically important, but such a change can lead to a large change in the significance level of that quantity relative to a null hypothesis." (Andrew Gelman & Hal Stern, "The Difference between 'Significant' and 'Not Significant' Is Not Itself Statistically Significant", The American Statistician Vol. 60 (4), 2006

"A type of error used in hypothesis testing that arises when incorrectly rejecting the null hypothesis, although it is actually true. Thus, based on the test statistic, the final conclusion rejects the Null hypothesis, but in truth it should be accepted. Type I error equates to the alpha (α) or significance level, whereby the generally accepted default is 5%." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

"For the study of the topology of the interactions of a complex system it is of central importance to have proper random null models of networks, i.e., models of how a graph arises from a random process. Such models are needed for comparison with real world data. When analyzing the structure of real world networks, the null hypothesis shall always be that the link structure is due to chance alone. This null hypothesis may only be rejected if the link structure found differs significantly from an expectation value obtained from a random model. Any deviation from the random null model must be explained by non-random processes." (Jörg Reichardt, "Structure in Complex Networks", 2009)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using] models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"If the group is large enough, even very small differences can become statistically significant." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"With fast computers and plentiful data, finding statistical significance is trivial. If you look hard enough, it can even be found in tables of random numbers." (Gary Smith, "Standard Deviations", 2014)

"In short, statistical significance does not mean your result has any practical significance. As for statistical insignificance, it doesn’t tell you much. A statistically insignificant difference could be nothing but noise, or it could represent a real effect that can be pinned down only with more data." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

"Statistical significance is a concept used by scientists and researchers to set an objective standard that can be used to determine whether or not a particular relationship 'statistically' exists in the data. Scientists test for statistical significance to distinguish between whether an observed effect is present in the data (given a high degree of probability), or just due to chance. It is important to note that finding a statistically significant relationship tells us nothing about whether a relationship is a simple correlation or a causal one, and it also can’t tell us anything about whether some omitted factor is driving the result." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Statistical significance refers to the probability that something is true. It’s a measure of how probable it is that the effect we’re seeing is real (rather than due to chance occurrence), which is why it’s typically measured with a p-value. P, in this case, stands for probability. If you accept p-values as a measure of statistical significance, then the lower your p-value is, the less likely it is that the results you’re seeing are due to chance alone." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

More quotes on "Significance" at the-web-of-knowledge.blogspot.com.

21 December 2018

🔭Data Science: Estimation (Just the Quotes)

"The scientific value of a theory of this kind, in which we make so many assumptions, and introduce so many adjustable constants, cannot be estimated merely by its numerical agreement with certain sets of experiments. If it has any value it is because it enables us to form a mental image of what takes place in a piece of iron during magnetization." (James C Maxwell, "Treatise on Electricity and Magnetism" Vol. II, 1873)

"It [probability] is the very guide of life, and hardly can we take a step or make a decision of any kind without correctly or incorrectly making an estimation of probabilities." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"A statistical estimate may be good or bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual observer’s impression, and the nature of things can only be disproved by statistical methods." (Arthur L Bowley, "Elements of Statistics", 1901)

"Great numbers are not counted correctly to a unit, they are estimated; and we might perhaps point to this as a division between arithmetic and statistics, that whereas arithmetic attains exactness, statistics deals with estimates, sometimes very accurate, and very often sufficiently so for their purpose, but never mathematically exact." (Arthur L Bowley, "Elements of Statistics", 1901)

“Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible.” (Sir Arthur L Bowley, “Elements of Statistics”, 1901)

"A good estimator will be unbiased and will converge more and more closely (in the long run) on the true value as the sample size increases. Such estimators are known as consistent. But consistency is not all we can ask of an estimator. In estimating the central tendency of a distribution, we are not confined to using the arithmetic mean; we might just as well use the median. Given a choice of possible estimators, all consistent in the sense just defined, we can see whether there is anything which recommends the choice of one rather than another. The thing which at once suggests itself is the sampling variance of the different estimators, since an estimator with a small sampling variance will be less likely to differ from the true value by a large amount than an estimator whose sampling variance is large." (Michael J Moroney, "Facts from Figures", 1951)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"We realize that if someone just 'grabs a handful', the individuals in the handful almost always resemble one another (on the average) more than do the members of a simple random sample. Even if the 'grabs' [sampling] are randomly spread around so that every individual has an equal chance of entering the sample, there are difficulties. Since the individuals of grab samples resemble one another more than do individuals of random samples, it follows (by a simple mathematical argument) that the means of grab samples resemble one another less than the means of random samples of the same size. From a grab sample, therefore, we tend to underestimate the variability in the population, although we should have to overestimate it in order to obtain valid estimates of variability of grab sample means by substituting such an estimate into the formula for the variability of means of simple random samples. Thus using simple random sample formulas for grab sample means introduces a double bias, both parts of which lead to an unwarranted appearance of higher stability." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"There are considerable dangers in overemphasizing the role of significance tests in the interpretation of data." (David R Cox, "The role of significance tests", Scandanavian Journal of Statistics 4, 1977)

"The usefulness of the models in constructing a testable theory of the process is severely limited by the quickly increasing number of parameters which must be estimated in order to compare the predictions of the models with empirical results" (Anatol Rapoport, "Prisoner's Dilemma: A study in conflict and cooperation", 1965)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"In physics it is usual to give alternative theoretical treatments of the same phenomenon. We construct different models for different purposes, with different equations to describe them. Which is the right model, which the 'true' set of equations? The question is a mistake. One model brings out some aspects of the phenomenon; a different model brings out others. Some equations give a rougher estimate for a quantity of interest, but are easier to solve. No single model serves all purposes best." (Nancy Cartwright, "How the Laws of Physics Lie", 1983)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"A mechanistic model has the following advantages: 1. It contributes to our scientiﬁc understanding of the phenomenon under study. 2. It usually provides a better basis for extrapolation (at least to conditions worthy of further experimental investigation if not through the entire range of all input variables). 3. It tends to be parsimonious (i. e, frugal) in the use of parameters and to provide better estimates of the response." (George E P Box, "Empirical Model-Building and Response Surfaces", 1987)

"A model for simulating dynamic system behavior requires formal policy descriptions to specify how individual decisions are to be made. Flows of information are continuously converted into decisions and actions. No plea about the inadequacy of our understanding of the decision-making processes can excuse us from estimating decision-making criteria. To omit a decision point is to deny its presence - a mistake of far greater magnitude than any errors in our best estimate of the process." (Jay W Forrester, "Policies, decisions and information sources for modeling", 1994)

"In constructing a model, we always attempt to maximize its usefulness. This aim is closely connected with the relationship among three key characteristics of every systems model: complexity, credibility, and uncertainty. This relationship is not as yet fully understood. We only know that uncertainty (predictive, prescriptive, etc.) has a pivotal role in any efforts to maximize the usefulness of systems models. Although usually (but not always) undesirable when considered alone, uncertainty becomes very valuable when considered in connection to the other characteristics of systems models: in general, allowing more uncertainty tends to reduce complexity and increase credibility of the resulting model. Our challenge in systems modelling is to develop methods by which an optimal level of allowable uncertainty can be estimated for each modelling problem." (George J Klir & Bo Yuan, "Fuzzy Sets and Fuzzy Logic: Theory and Applications", 1995)

"Delay time, the time between causes and their impacts, can highly influence systems. Yet the concept of delayed effect is often missed in our impatient society, and when it is recognized, it’s almost always underestimated. Such oversight and devaluation can lead to poor decision making as well as poor problem solving, for decisions often have consequences that don’t show up until years later. Fortunately, mind mapping, fishbone diagrams, and creativity/brainstorming tools can be quite useful here." (Stephen G Haines, "The Manager's Pocket Guide to Strategic and Business Planning", 1998)

“Accurate estimates depend at least as much upon the mental model used in forming the picture as upon the number of pieces of the puzzle that have been collected.” (Richards J. Heuer Jr, “Psychology of Intelligence Analysis”, 1999)

“[…] we underestimate the share of randomness in about everything […] The degree of resistance to randomness in one’s life is an abstract idea, part of its logic counterintuitive, and, to confuse matters, its realizations nonobservable.” (Nassim N Taleb, “Fooled by Randomness”, 2001)

"Most long-range forecasts of what is technically feasible in future time periods dramatically underestimate the power of future developments because they are based on what I call the 'intuitive linear' view of history rather than the 'historical exponential' view." (Ray Kurzweil, "The Singularity is Near", 2005)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"As uncertainties of scientific data values are nearly as important as the data values themselves, it is usually not acceptable that a best estimate is only accompanied by an estimated uncertainty. Therefore, only the size of nondominant uncertainties should be estimated. For estimating the size of a nondominant uncertainty we need to find its upper limit, i.e., we want to be as sure as possible that the uncertainty does not exceed a certain value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Due to the theory that underlies uncertainties an infinite number of data values would be necessary to determine the true value of any quantity. In reality the number of available data values will be relatively small and thus this requirement can never be fully met; all one can get is the best estimate of the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the aim of all data analysis that a result is given in form of the best estimate of the true value. Only in simple cases is it possible to use the data value itself as result and thus as best estimate." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the nature of an uncertainty that it is not known and can never be known, whether the best estimate is greater or less than the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The methodology of feedback design is borrowed from cybernetics (control theory). It is based upon methods of controlled system model’s building, methods of system states and parameters estimation (identification), and methods of feedback synthesis. The models of controlled system used in cybernetics differ from conventional models of physics and mechanics in that they have explicitly specified inputs and outputs. Unlike conventional physics results, often formulated as conservation laws, the results of cybernetical physics are formulated in the form of transformation laws, establishing the possibilities and limits of changing properties of a physical system by means of control." (Alexander L Fradkov, "Cybernetical Physics: From Control of Chaos to Quantum Control", 2007)

"A good estimator has to be more than just consistent. It also should be one whose variance is less than that of any other estimator. This property is called minimum variance. This means that if we run the experiment several times, the 'answers' we get will be closer to one another than 'answers' based on some other estimator." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"An estimate (the mathematical definition) is a number derived from observed values that is as close as we can get to the true parameter value. Useful estimators are those that are 'better' in some sense than any others." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Estimators are functions of the observed values that can be used to estimate specific parameters. Good estimators are those that are consistent and have minimum variance. These properties are guaranteed if the estimator maximizes the likelihood of the observations." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"One final warning about the use of statistical models (whether linear or otherwise): The estimated model describes the structure of the data that have been observed. It is unwise to extend this model very far beyond the observed data." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"One kind of probability - classic probability - is based on the idea of symmetry and equal likelihood […] In the classic case, we know the parameters of the system and thus can calculate the probabilities for the events each system will generate. […] A second kind of probability arises because in daily life we often want to know something about the likelihood of other events occurring […]. In this second case, we need to estimate the parameters of the system because we don’t know what those parameters are. […] A third kind of probability differs from these first two because it’s not obtained from an experiment or a replicable event - rather, it expresses an opinion or degree of belief about how likely a particular event is to occur. This is called subjective probability […]." (Daniel J Levitin, "Weaponized Lies", 2017)

"Samples give us estimates of something, and they will almost always deviate from the true number by some amount, large or small, and that is the margin of error. […] The margin of error does not address underlying flaws in the research, only the degree of error in the sampling procedure. But ignoring those deeper possible flaws for the moment, there is another measurement or statistic that accompanies any rigorously defined sample: the confidence interval." (Daniel J Levitin, "Weaponized Lies", 2017)

"The margin of error is how accurate the results are, and the confidence interval is how confident you are that your estimate falls within the margin of error." (Daniel J Levitin, "Weaponized Lies", 2017)

23 November 2018

🔭Data Science: Missing Data (Just the Quotes)

"Place little faith in an average or a graph or a trend when those important figures are missing." (Darell Huff, "How to Lie with Statistics", 1954)

"Missing data values pose a particularly sticky problem for symbols. For instance, if the ray corresponding to a missing value is simply left off of a star symbol, the result will be almost indistinguishable from a minimum (i.e., an extreme) value. It may be better either (i) to impute a value, perhaps a median for that variable, or a fitted value from some regression on other variables, (ii) to indicate that the value is missing, possibly with a dashed line, or (iii) not to draw the symbol for a particular observation if any value is missing." (John M Chambers et al, "Graphical Methods for Data Analysis", 1983)

"The progress of science requires more than new data; it needs novel frameworks and contexts. And where do these fundamentally new views of the world arise? They are not simply discovered by pure observation; they require new modes of thought. And where can we find them, if old modes do not even include the right metaphors? The nature of true genius must lie in the elusive capacity to construct these new modes from apparent darkness. The basic chanciness and unpredictability of science must also reside in the inherent difficulty of such a task." (Stephen J Gould, "The Flamingo's Smile: Reflections in Natural History", 1985)

"We often think, naïvely, that missing data are the primary impediments to intellectual progress - just find the right facts and all problems will dissipate. But barriers are often deeper and more abstract in thought. We must have access to the right metaphor, not only to the requisite information. Revolutionary thinkers are not, primarily, gatherers of facts, but weavers of new intellectual structures." (Stephen J Gould, "The Flamingo's Smile: Reflections in Natural History", 1985)

"[...] as the planning process proceeds to a specific financial or marketing state, it is usually discovered that a considerable body of 'numbers' is missing, but needed numbers for which there has been no regular system of collection and reporting; numbers that must be collected outside the firm in some cases. This serendipity usually pays off in a much better management information system in the form of reports which will be collected and reviewed routinely." (William H. Franklin Jr., Financial Strategies, 1987)

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat ’94: Proceedings, 1994)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"If you have only a small proportion of cases with missing data, you can simply throw out those cases for purposes of estimation; if you want to make predictions for cases with missing inputs, you don’t have the option of throwing those cases out." (Warren S Sarle, "Prediction with missing inputs", 1998)

"Every statistical analysis is an interpretation of the data, and missingness affects the interpretation. The challenge is that when the reasons for the missingness cannot be determined there is basically no way to make appropriate statistical adjustments. Sensitivity analyses are designed to model and explore a reasonable range of explanations in order to assess the robustness of the results." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"The best rule is: Don't have any missing data, Unfortunately, that is unrealistic. Therefore, plan for missing data and develop strategies to account for them. Do this before starting the study. The strategy should state explicitly how the type of missingness will be examined, how it will be handled, and how the sensitivity of the results to the missing data will be assessed." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"A sin of omission – leaving something out – is a strong one and not always recognized; itʼs hard to ask for something you donʼt know is missing. When looking into the data, even before it is graphed and charted, there is potential for abuse. Simply not having all the data or the correct data before telling your story can cause problems and unhappy endings." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"Having NUMBERSENSE means: (•) Not taking published data at face value; (•) Knowing which questions to ask; (•) Having a nose for doctored statistics. [...] NUMBERSENSE is that bit of skepticism, urge to probe, and desire to verify. It’s having the truffle hog’s nose to hunt the delicacies. Developing NUMBERSENSE takes training and patience. It is essential to know a few basic statistical concepts. Understanding the nature of means, medians, and percentile ranks is important. Breaking down ratios into components facilitates clear thinking. Ratios can also be interpreted as weighted averages, with those weights arranged by rules of inclusion and exclusion. Missing data must be carefully vetted, especially when they are substituted with statistical estimates. Blatant fraud, while difficult to detect, is often exposed by inconsistency." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Quality without science and research is absurd. You can't make inferences that something works when you have 60 percent missing data." (Peter Pronovost, "Safe Patients, Smart Hospitals", 2010)

"The only thing we know for sure about a missing data point is that it is not there, and there is nothing that the magic of statistics can do change that. The best that can be managed is to estimate the extent to which missing data have influenced the inferences we wish to draw." (Howard Wainer, "14 Conversations About Three Things", Journal of Educational and Behavioral Statistics Vol. 35(1, 2010)

"There are several key issues in the field of statistics that impact our analyses once data have been imported into a software program. These data issues are commonly referred to as the measurement scale of variables, restriction in the range of data, missing data values, outliers, linearity, and nonnormality." (Randall E Schumacker & Richard G Lomax, "A Beginner’s Guide to Structural Equation Modeling" 3rd Ed., 2010)

"Missing data is the blind spot of statisticians. If they are not paying full attention, they lose track of these little details. Even when they notice, many unwittingly sway things our way. Most ranking systems ignore missing values." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"When we find data quality issues due to valid data during data exploration, we should note these issues in a data quality plan for potential handling later in the project. The most common issues in this regard are missing values and outliers, which are both examples of noise in the data." (John D. Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Unless we’re collecting data ourselves, there’s a limit to how much we can do to combat the problem of missing data. But we can and should remember to ask who or what might be missing from the data we’re being told about. Some missing numbers are obvious […]. Other omissions show up only when we take a close look at the claim in question." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"[Making reasoned macro calls] starts with having the best and longest-time-series data you can find. You may have to take some risks in terms of the quality of data sources, but it amazes me how people are often more willing to act based on little or no data than to use data that is a challenge to assemble." (Robert J Shiller)

12 April 2012

🚧Project Management: Program Evaluation and Review Technique (Definitions)

"A method of depicting, scheduling, and prioritizing a complex set of activities in a way that supports effective project management. This method provides excellent visibility into a project's progress and potential obstacles and risks." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"A technique for estimating that applies as weighted average of optimistic, pessimistic, and most likely estimates when there is uncertainty with the individual activity estimates." (Project Management Institute, "Practice Standard for Project Estimating", 2010)

"A model for project or process management to evaluate tasks involved in the project or process in order to find the shortest duration possible." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Evaluates the probable duration of a project by calculating a weighted average of the best-case estimate, most likely case estimate, and worst-case estimate." (Bonnie Biafore, "Successful Project Management: Applying Best Practices and Real-World Techniques with Microsoft® Project", 2011)

"A statistical approach to estimating that uses a weighted average of three values: optimistic, pessimistic, and most likely." (Bonnie Biafore & Teresa Stover, "Your Project Management Coach: Best Practices for Managing Projects in the Real World", 2012)

"An estimating technique that starts with a network chart and combines optimistic, best estimate and pessimistic estimates to produce an overall estimate of the most likely duration and standard deviation (spread of likely durations) for a project activity." (Mike Clayton, "Brilliant Project Leader", 2012)

"An event-oriented network analysis technique used to estimate project duration when there is a high degree of uncertainty with individual activity duration estimates. PERT applies the critical path method to a weighted average duration estimate. It is considered a probabilistic method." (Peter Oakander et al, "CPM Scheduling for Construction: Best Practices and Guidelines", 2014)

05 October 2006

⛩️Steve C McConnell - Collected Quotes

"An algorithm gives you the instructions directly. A heuristic tells you how to discover the instructions for yourself, or at least where to look for them." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"At the software-architecture level, the complexity of a problem is reduced by dividing the system into subsystems. Humans have an easier time comprehending several simple pieces of information than one complicated piece." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Because successful programming depends on minimizing complexity, a skilled programmer will build in as much flexibility as needed to meet the software's requirements but will not add flexibility - and related complexity - beyond what's required." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"By far the most common project risks in software development are poor requirements and poor project planning, thus preparation tends to focus on improving requirements and project plans."(Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Complexity in all forms - complicated algorithms, large data sets, intricate communications protocols, and so on - is prone to errors. If an error does occur, it will be easier to find if it isn't spread through the code but is localized within a class. Changes arising from fixing the error won't affect other code because only one class will have to be fixed - other code won't be touched. If you find a better, simpler, or more reliable algorithm, it will be easier to replace the old algorithm if it has been isolated into a class. During development, it will be easier to try several designs and keep the one that works best." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Design is sloppy because a good solution is often only subtly different from a poor one." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Encapsulation says that, not only are you allowed to take a simpler view of a complex concept, you are not allowed to look at any of the details of the complex concept. What you see is what you get - it's all you get!" (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"From time to time, a complex algorithm will lead to a longer routine, and in those circumstances, the routine should be allowed to grow organically up to 100–200 lines. (A line is a noncomment, nonblank line of source code.) Decades of evidence say that routines of such length are no more error prone than shorter routines. Let issues such as the routine's cohesion, depth of nesting, number of variables, number of decision points, number of comments needed to explain the routine, and other complexity-related considerations dictate the length of the routine rather than imposing a length restriction per se." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"If you're passing a parameter among several routines, that might indicate a need to factor those routines into a class that share the parameter as object data. Streamlining parameter passing isn't a goal, per se, but passing lots of data around suggests that a different class organization might work better." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"In software, consultants sometimes tell you to buy into certain software-development methods to the exclusion of other methods. That’s unfortunate because if you buy into any single methodology 100 percent, you’ll see the whole world in terms of that methodology. In some instances, you’ll miss opportunities to use other methods better suited to your current problem." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"[...] inheritance is a powerful tool for reducing complexity because a programmer can focus on the generic attributes of an object without worrying about the details. If a programmer must be constantly thinking about semantic differences in subclass implementations, then inheritance is increasing complexity rather than reducing it." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Inheritance is the idea that one class is a specialization of another class. The purpose of inheritance is to create simpler code by defining a base class that specifies common elements of two or more derived classes. The common elements can be routine interfaces, implementations, data members, or data types. Inheritance helps avoid the need to repeat code and data in multiple locations by centralizing it within a base class. When you decide to use inheritance, you have to make several decisions: For each member routine, will the routine be visible to derived classes? Will it have a default implementation? Will the default implementation be overridable? For each data member (including variables, named constants, enumerations, and so on), will the data member be visible to derived classes?" (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Modularity's goal is to make each routine or class like a 'black box': You know what goes in, and you know what comes out, but you don't know what happens inside." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"The concept of modularity is related to information hiding, encapsulation, and other design heuristics. But sometimes thinking about how to assemble a system from a set of black boxes provides insights that information hiding and encapsulation don't, so the concept is worth having in your back pocket." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"The underlying message of all these rules is that inheritance tends to work against the primary technical imperative you have as a programmer, which is to manage complexity. For the sake of controlling complexity, you should maintain a heavy bias against inheritance." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"One of the main differences between programs you develop in school and those you develop as a professional is that the design problems solved by school programs are rarely, if ever, wicked." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Programming assignments in school are devised to move you in a beeline from beginning to end. You'd probably want to tar and feather a teacher who gave you a programming assignment, then changed the assignment as soon as you finished the design, and then changed it again just as you were about to turn in the completed program. But that very process is an everyday reality in professional programming." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Testing by itself does not improve software quality. Test results are an indicator of quality, but in and of themselves, they don't improve it. Trying to improve software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. What you eat before you step onto the scale determines how much you will weigh, and the software development techniques you use determine how many errors testing will find. If you want to lose weight, don't buy a new scale; change your diet. If you want to improve your software, don't test more; develop better." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction", 1993)

"The more independent the subsystems are, the more you make it safe to focus on one bit of complexity at a time. Carefully defined objects separate concerns so that you can focus on one thing at a time. Packages provide the same benefit at a higher level of aggregation." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"The most challenging part of programming is conceptualizing the problem, and many errors in programming are conceptual errors." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"The source code is often the only accurate description of the software. On many projects, the only documentation available to programmers is the code itself. Requirements specifications and design documents can go out of date, but the source code is always up to date. Consequently, it's imperative that the code be of the highest possible quality." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"The words available in a programming language for expressing your programming thoughts certainly determine how you express your thoughts and might even determine what thoughts you can express." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"Watch for coupling that's too tight. 'Coupling' refers to how tight the connection is between two classes. In general, the looser the connection, the better. Several general guidelines flow from this concept: Minimize accessibility of classes and members. Avoid friend classes, because they're tightly coupled. Make data private rather than protected in a base class to make derived classes less tightly coupled to the base class. Avoid exposing member data in a class's public interface. Be wary of semantic violations of encapsulation. Observe the 'Law of Demeter' [...]. Coupling goes hand in glove with abstraction and encapsulation. Tight coupling occurs when an abstraction is leaky, or when encapsulation is broken." (Steve C McConnell," Code Complete: A Practical Handbook of Software Construction", 1993)

"A typical software project can present more opportunities to learn from mistakes than some people get in a lifetime." (Steve McConnell, "Rapid Development", 1996)

"Even when you have skilled, motivated, hard-working people, the wrong team structure can undercut their efforts instead of catapulting them to success. A poor team structure can increase development time, reduce quality, damage morale, increase turnover, and ultimately lead to project cancellation." (Steve McConnell, "Rapid Development", 1996)

"Motivation is undoubtedly the single greatest influence on how well people perform. Most productivity studies have found that motivation has a stronger influence on productivity than any other factor." (Steve McConnell, "Rapid Development", 1996)

"It's better to wait for a productive programmer to become available than it is to wait for the first available programmer to become productive." (Steve McConnell, "Software Project Survival Guide", 1997)

"Software projects fail for one of two general reasons: the project team lacks the knowledge to conduct a software project successfully, or the project team lacks the resolve to conduct a project effectively." (Steve McConnell, "Software Project Survival Guide", 1997)

"The default movement on a software project should be in the direction of taking elements of the software away to make it simpler rather than adding elements to make it more complex." (Steve McConnell, "Software Project Survival Guide", 1997)

"The job of the average manager requires a shift in focus every few minutes. The job of the average software developer requires that the developer not shift focus more often than every few hours." (Steve McConnell, "Software Project Survival Guide", 1997)

"Trying to apply formal methods to all software projects is just as bad as trying to apply code-and-fix development to all projects." (Steve McConnell, "After the Gold Rush: Creating a True Profession of Software Engineering", 1999)

"A brute force solution that works is better than an elegant solution that doesn't work." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"Building software implies various stages of planning, preparation and execution that vary in kind and degree depending on what's being built." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"Design patterns provide the cores of ready-made solutions that can be used to solve many of software’s most common problems. Some software problems require solutions that are derived from first principles. But most problems are similar to past problems, and those can be solved using similar solutions, or patterns." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"In addition to their complexity-management benefit, design patterns can accelerate design discussions by allowing designers to think and discuss at a larger level of granularity." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"In software, the chain isn't as strong as its weakest link; it's as weak as all the weak links multiplied together." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"Design is heuristic. Dogmatic adherence to any single methodology hurts creativity and hurts your programs." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"On small, informal projects, a lot of design is done while the programmer sits at the keyboard. 'Design' might be just writing a class interface in pseudocode before writing the details. It might be drawing diagrams of a few class relationships before coding them. It might be asking another programmer which design pattern seems like a better choice. Regardless of how it’s done, small projects benefit from careful design just as larger projects do, and recognizing design as an explicit activity maximizes the benefit you will receive from it." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"Simplicity is achieved in two general ways: minimizing the amount of essential complexity that anyone's brain has to deal with at any one time, and keeping accidental complexity from proliferating needlessly." (Steve C McConnell, "Code Complete: A Practical Handbook of Software Construction" 2nd Ed., 2004)

"A good estimate is an estimate that provides a clear enough view of the project reality to allow the project leadership to make good decisions about how to control the project to hit its targets." (Steve McConnell, "Software Estimation: Demystifying the Black Art", 2006)

"Be sure you understand whether you're presenting uncertainty in an estimate or uncertainty that affects your ability to meet a commitment." (Steve McConnell, "Software Estimation: Demystifying the Black Art", 2006)

"Don't expect better estimation practices alone to provide more accurate estimates for chaotic projects. You can't accurately estimate an out-of-control process." (Steve McConnell, "Software Estimation: Demystifying the Black Art", 2006)

"Don't intentionally underestimate. The penalty for underestimation is more severe than the penalty for overestimation. Address concerns about overestimation through planning and control, not by biasing your estimates." (Steve McConnell, "Software Estimation: Demystifying the Black Art", 2006)

"Not all estimation methods are equal. When looking for convergence or spread among estimates, give more weight to the techniques that tend to produce the most accurate results." (Steve McConnell, "Software Estimation: Demystifying the Black Art", 2006)

"Treat estimation discussions as problem solving, not negotiation. Recognize that all project stakeholders are on the same side of the table. Everyone wins, or everyone loses." (Steve McConnell, "Software Estimation: Demystifying the Black Art", 2006)

"The primary purpose of software estimation is not to predict a project's outcome; it is to determine whether a project's targets are realistic enough to allow the project to be controlled to meet them." (Steve McConnell, "Software Estimation: Demystifying the Black Art", 2006)

SQL Troubles

Pages

08 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part III: Failure through the Looking Glass)

06 April 2024

🧭Business Intelligence: Why Data Projects Fail to Deliver Real-Life Impact (Part II: There's Value in Failure)

07 March 2021

💼Project Management: Methodologies (Part II: Agile Manifesto Reloaded II - Requirements Management)

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

22 December 2018

🔭Data Science: Significance (Just the Quotes)

21 December 2018

🔭Data Science: Estimation (Just the Quotes)

23 November 2018

🔭Data Science: Missing Data (Just the Quotes)

12 April 2012

🚧Project Management: Program Evaluation and Review Technique (Definitions)

05 October 2006

⛩️Steve C McConnell - Collected Quotes

About Me