SQL Troubles: pitfalls

Showing posts with label pitfalls. Show all posts

17 April 2006

🖍️Gary Smith - Collected Quotes

"A computer makes calculations quickly and correctly, but doesn’t ask if the calculations are meaningful or sensible. A computer just does what it is told." (Gary Smith, "Standard Deviations", 2014)

"A study that leaves out data is waving a big red flag. A decision to include orxclude data sometimes makes all the difference in the world. This decision should be based on the relevance and quality of the data, not on whether the data support or undermine a conclusion that is expected or desired." (Gary Smith, "Standard Deviations", 2014)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"Comparisons are the lifeblood of empirical studies. We can’t determine if a medicine, treatment, policy, or strategy is effective unless we compare it to some alternative. But watch out for superficial comparisons: comparisons of percentage changes in big numbers and small numbers, comparisons of things that have nothing in common except that they increase over time, comparisons of irrelevant data. All of these are like comparing apples to prunes." (Gary Smith, "Standard Deviations", 2014)

"Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data." (Gary Smith, "Standard Deviations", 2014)

"Data without theory can fuel a speculative stock market bubble or create the illusion of a bubble where there is none. How do we tell the difference between a real bubble and a false alarm? You know the answer: we need a theory. Data are not enough. […] Data without theory is alluring, but misleading." (Gary Smith, "Standard Deviations", 2014)

"Don’t just do the calculations. Use common sense to see whether you are answering the correct question, the assumptions are reasonable, and the results are plausible. If a statistical argument doesn’t make sense, think about it carefully - you may discover that the argument is nonsense." (Gary Smith, "Standard Deviations", 2014)

"Graphs can help us interpret data and draw inferences. They can help us see tendencies, patterns, trends, and relationships. A picture can be worth not only a thousand words, but a thousand numbers. However, a graph is essentially descriptive - a picture meant to tell a story. As with any story, bumblers may mangle the punch line and the dishonest may lie." (Gary Smith, "Standard Deviations", 2014)

"Graphs should not be mere decoration, to amuse the easily bored. A useful graph displays data accurately and coherently, and helps us understand the data. Chartjunk, in contrast, distracts, confuses, and annoys. Chartjunk may be well-intentioned, but it is misguided. It may also be a deliberate attempt to mystify." (Gary Smith, "Standard Deviations", 2014)

"How can we tell the difference between a good theory and quackery? There are two effective antidotes: common sense and fresh data. If it is a ridiculous theory, we shouldn’t be persuaded by anything less than overwhelming evidence, and even then be skeptical. Extraordinary claims require extraordinary evidence. Unfortunately, common sense is an uncommon commodity these days, and many silly theories have been seriously promoted by honest researchers." (Gary Smith, "Standard Deviations", 2014)

"If somebody ransacks data to find a pattern, we still need a theory that makes sense. On the other hand, a theory is just a theory until it is tested with persuasive data." (Gary Smith, "Standard Deviations", 2014)

"[…] many gamblers believe in the fallacious law of averages because they are eager to find a profitable pattern in the chaos created by random chance." (Gary Smith, "Standard Deviations", 2014)

"Numbers are not inherently tedious. They can be illuminating, fascinating, even entertaining. The trouble starts when we decide that it is more important for a graph to be artistic than informative." (Gary Smith, "Standard Deviations", 2014)

"Provocative assertions are provocative precisely because they are counterintuitive - which is a very good reason for skepticism." (Gary Smith, "Standard Deviations", 2014)

"Remember that even random coin flips can yield striking, even stunning, patterns that mean nothing at all. When someone shows you a pattern, no matter how impressive the person’s credentials, consider the possibility that the pattern is just a coincidence. Ask why, not what. No matter what the pattern, the question is: Why should we expect to find this pattern?" (Gary Smith, "Standard Deviations", 2014)

"Self-selection bias occurs when people choose to be in the data - for example, when people choose to go to college, marry, or have children. […] Self-selection bias is pervasive in 'observational data', where we collect data by observing what people do. Because these people chose to do what they are doing, their choices may reflect who they are. This self-selection bias could be avoided with a controlled experiment in which people are randomly assigned to groups and told what to do." (Gary Smith, "Standard Deviations", 2014)

"The omission of zero magnifies the ups and downs in the data, allowing us to detect changes that might otherwise be ambiguous. However, once zero has been omitted, the graph is no longer an accurate guide to the magnitude of the changes. Instead, we need to look at the actual numbers." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"We are genetically predisposed to look for patterns and to believe that the patterns we observe are meaningful. […] Don’t be fooled into thinking that a pattern is proof. We need a logical, persuasive explanation and we need to test the explanation with fresh data." (Gary Smith, "Standard Deviations", 2014)

"We are hardwired to make sense of the world around us - to notice patterns and invent theories to explain these patterns. We underestimate how easily pat - terns can be created by inexplicable random events - by good luck and bad luck." (Gary Smith, "Standard Deviations", 2014)

"We are seduced by patterns and we want explanations for these patterns. When we see a string of successes, we think that a hot hand has made success more likely. If we see a string of failures, we think a cold hand has made failure more likely. It is easy to dismiss such theories when they involve coin flips, but it is not so easy with humans. We surely have emotions and ailments that can cause our abilities to go up and down. The question is whether these fluctuations are important or trivial." (Gary Smith, "Standard Deviations", 2014)

"We naturally draw conclusions from what we see […]. We should also think about what we do not see […]. The unseen data may be just as important, or even more important, than the seen data. To avoid survivor bias, start in the past and look forward." (Gary Smith, "Standard Deviations", 2014)

"We encounter regression in many contexts - pretty much whenever we see an imperfect measure of what we are trying to measure. Standardized tests are obviously an imperfect measure of ability." (Gary Smith, "Standard Deviations", 2014)

"With fast computers and plentiful data, finding statistical significance is trivial. If you look hard enough, it can even be found in tables of random numbers." (Gary Smith, "Standard Deviations", 2014)

"[...] a mathematically elegant procedure can generate worthless predictions. Principal components regression is just the tip of the mathematical iceberg that can sink models used by well-intentioned data scientists. Good data scientists think about their tools before they use them." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"A neural-network algorithm is simply a statistical procedure for classifying inputs (such as numbers, words, pixels, or sound waves) so that these data can mapped into outputs. The process of training a neural-network model is advertised as machine learning, suggesting that neural networks function like the human mind, but neural networks estimate coefficients like other data-mining algorithms, by finding the values for which the model’s predictions are closest to the observed values, with no consideration of what is being modeled or whether the coefficients are sensible." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Clowns fool themselves. Scientists don’t. Often, the easiest way to differentiate a data clown from a data scientist is to track the successes and failures of their predictions. Clowns avoid experimentation out of fear that they’re wrong, or wait until after seeing the data before stating what they expected to find. Scientists share their theories, question their assumptions, and seek opportunities to run experiments that will verify or contradict them. Most new theories are not correct and will not be supported by experiments. Scientists are comfortable with that reality and don’t try to ram a square peg in a round hole by torturing data or mangling theories. They know that science works, but only if it’s done right." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Data-mining tools, in general, tend to be mathematically sophisticated, yet often make implausible assumptions. Too often, the assumptions are hidden in the math and the people who use the tools are more impressed by the math than curious about the assumptions. Instead of being blinded by math, good data scientists use assumptions and models that make sense. Good data scientists use math, but do not worship it. They know that math is an invaluable tool, but it is not a substitute for common sense, wisdom, or expertise." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Deep neural networks have an input layer and an output layer. In between, are “hidden layers” that process the input data by adjusting various weights in order to make the output correspond closely to what is being predicted. [...] The mysterious part is not the fancy words, but that no one truly understands how the pattern recognition inside those hidden layers works. That’s why they’re called 'hidden'. They are an inscrutable black box - which is okay if you believe that computers are smarter than humans, but troubling otherwise." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Effective data scientists know that they are trying to convey accurate information in an easily understood way. We have never seen a pie chart that was an improvement over a simple table. Even worse, the creative addition of pictures, colors, shading, blots, and splotches may produce chartjunk that confuses the reader and strains the eyes." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists are careful when they compare samples of different sizes. It is easier for small groups to be lucky. It’s also easier for small groups to be unlucky." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists consider the reliability of the data, while data clowns don’t. It’s also important to know if there are unreported 'silent data'. If something is surprising about top-ranked groups, ask to see the bottom-ranked groups. Consider the possibility of survivorship bias and self-selection bias. Incomplete, inaccurate, or unreliable data can make fools out of anyone." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists do not cherry pick data by excluding data that do not support their claims. One of the most bitter criticisms of statisticians is that, 'Figures don’t lie, but liars figure.' An unscrupulous statistician can prove most anything by carefully choosing favorable data and ignoring conflicting evidence." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists know that, because of inevitable ups and downs in the data for almost any interesting question, they shouldn’t draw conclusions from small samples, where flukes might look like evidence." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists know that some predictions are inherently difficult and we should not expect anything close to 100 percent accuracy. It is better to construct a reasonable model and acknowledge its uncertainty than to expect the impossible." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Good data scientists know that they need to get the assumptions right. It is not enough to have fancy math. Clever math with preposterous premises can be disastrous. [...] Good data scientists think about what they are modeling before making assumptions." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"In addition to overfitting the data by sifting through a kitchen sink of variables, data scientists can overfit the data by trying a wide variety of nonlinear models." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"It is certainly good data science practice to set aside data to test models. However, suppose that we data mine lots of useless models, and test them all on set-aside data. Just as some useless models are certain to fit the original data, some, by luck alone, are certain to fit the set-aside data too. Finding a model that fits both the original data and the set-aside data is just another form of data mining. Instead of discovering a model that fits half the data, we discover a model that fits all the data. That makes the problem less likely, but doesn’t solve it." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"It is tempting to think that because computers can do some things extremely well, they must be highly intelligent, but being useful for specific tasks is very different from having a general intelligence that applies the lessons learned and the skills required for one task to more complex tasks, or completely different tasks." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Machines do not know which features to ignore and which to focus on, since that requires real knowledge of the real world. In the absence of such knowledge, computers focus on idiosyncrasies in the data that maximize their success with the training data, without considering whether these idiosyncrasies are useful for making predictions with fresh data. Because they don’t truly understand Real-World, computers cannot distinguish between the meaningful and the meaningless." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Mathematicians love math and many non-mathematicians are intimidated by math. This is a lethal combination that can lead to the creation of wildly unrealistic mathematical models. [...] A good mathematical model starts with plausible assumptions and then uses mathematics to derive the implications. A bad model focuses on the math and makes whatever assumptions are needed to facilitate the math." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Monte Carlo simulations handle uncertainty by using a computer’s random number generator to determine outcomes. Done over and over again, the simulations show the distribution of the possible outcomes. [...] The beauty of these Monte Carlo simulations is that they allow users to see the probabilistic consequences of their decisions, so that they can make informed choices. [...] Monte Carlo simulations are one of the most valuable applications of data science because they can be used to analyze virtually any uncertain situation where we are able to specify the nature of the uncertainty [...]" (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Neural-network algorithms do not know what they are manipulating, do not understand their results, and have no way of knowing whether the patterns they uncover are meaningful or coincidental. Nor do the programmers who write the code know exactly how they work and whether the results should be trusted. Deep neural networks are also fragile, meaning that they are sensitive to small changes and can be fooled easily." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"One of the paradoxical things about computers is that they can excel at things that humans consider difficult (like calculating square roots) while failing at things that humans consider easy (like recognizing stop signs). They do not understand the world the way humans do. They have neither common sense nor wisdom. They are our tools, not our masters. Good data scientists know that data analysis still requires expert knowledge." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Outliers are sometimes clerical errors, measurement errors, or flukes that, if not corrected or omitted, will distort the data. At other times, they are the most important observations. Either way, good data scientists look at their data before analyzing them." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Regression toward the mean is NOT the fallacious law of averages, also known as the gambler’s fallacy. The fallacious law of averages says that things must balance out - that making a free throw makes a player more likely to miss the next shot; a coin flip that lands heads makes tails more likely on the next flip; and good luck now makes bad luck more likely in the future." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Statistical correlations are a poor substitute for expertise. The best way to build models of the real world is to start with theories that are appealing and then test these models. Models that make sense can be used to make useful predictions." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The binomial distribution applies to things like coin flips, where every flip has the same constant probability of occurring. Jay saw several problems. [...] the binomial distribution assumes that the outcomes are independent, the way that a coin flip doesn’t depend on previous flips. [...] The binomial distribution is elegant mathematics, but it should be used when its assumptions are true, not because the math is elegant." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The label neural networks suggests that these algorithms replicate the neural networks in human brains that connect electrically excitable cells called neurons. They don’t. We have barely scratched the surface in trying to figure out how neurons receive, store, and process information, so we cannot conceivably mimic them with computers." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The logic of regression is simple, but powerful. Our lives are filled with uncertainties. The difference between what we expect to happen and what actually does happen is, by definition, unexpected. We can call these unexpected surprises chance, luck, or some other convenient shorthand. The important point is that, no matter how reasonable or rational our expectations, things sometimes turn out to be higher or lower, larger or smaller, stronger or weaker than expected." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The plausibility of the assumptions is more important than the accuracy of the math. There is a well-known saying about data analysis: 'Garbage in, garbage out.' No matter how impeccable the statistical analysis, bad data will yield useless output. The same is true of mathematical models that are used to make predictions. If the assumptions are wrong, the predictions are worthless." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"The principle behind regression toward the mean is that extraordinary performances exaggerate how far the underlying trait is from average. [...] Regression toward the mean also works for the worst performers. [...] Regression toward the mean is a purely statistical phenomenon that has nothing at all to do with ability improving or deteriorating over time." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"Useful data analysis requires good data. [...] Good data scientists also consider the reliability of their data. [...] If the data tell you something crazy, there’s a good chance you would be crazy to believe the data." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

SQL Troubles

Pages

17 April 2006

🖍️Gary Smith - Collected Quotes

About Me