SQL Troubles: 🖍️John H Johnson

08 April 2006

🖍️John H Johnson - Collected Quotes

"A correlation is simply a bivariate relationship - a fancy way of saying that there is a relationship between two ('bi') variables ('variate'). And a bivariate relationship doesn’t prove that one thing caused the other. Think of it this way: you can observe that two things appear to be related statistically, but that doesn’t tell you the answer to any of the questions you might really care about - why is there a relationship and what does it mean to us as a consumer of data?" (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"A good chart can tell a story about the data, helping you understand relationships among data so you can make better decisions. The wrong chart can make a royal mess out of even the best data set." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Although some people use them interchangeably, probability and odds are not the same and people often misuse the terms. Probability is the likelihood that an outcome will occur. The odds of something happening, statistically speaking, is the ratio of favorable outcomes to unfavorable outcomes." (John H Johnson & Mike Gluck, "Everydata: The misibivarinformation hidden in the little data you consume every day", 2016)

"[…] average isn’t something that should be considered in isolation. Your average is only as good as the data that supports it. If your sample isn’t representative of the full population, if you cherry- picked the data, or if there are other issues with your data, your average may be misleading." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Big data is sexy. It makes the headlines. […] But, as you’ve seen already, it’s the little data - the small bits and bytes of data that you’re bombarded with in your everyday life - that often has a huge effect on your health, your wallet, your job, your relationships, and so much more, every single day. From food labels to weather forecasts, your bank account to your doctor’s office, everydata is all around you." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Confirmation bias can affect nearly every aspect of the way you look at data, from sampling and observation to forecasting - so it’s something to keep in mind anytime you’re interpreting data. When it comes to correlation versus causation, confirmation bias is one reason that some people ignore omitted variables - because they’re making the jump from correlation to causation based on preconceptions, not the actual evidence." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Essentially, magnitude is the size of the effect. It’s a way to determine if the results are meaningful. Without magnitude, it’s hard to get a sense of how much something matters. […] the magnitude of an effect can change, depending on the relationship." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"First, you need to think about whether the universe of data that is being studied or collected is representative of the underlying population. […] Second, you need to consider what you are analyzing in the data that has been collected - are you analyzing all of the data, or only part of it? […] You always have to ask - can you accurately extend your findings from the sample to the general population? That’s called external validity - when you can extend the results from your sample to draw meaningful conclusions about the full population." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Forecasting is difficult because we don’t know everything about how the world works. There are unforeseen events. Unknown processes. Random occurrences. People are unpredictable, and things don’t always stay the same. The data you’re studying can change - as can your understanding of the underlying process." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Having a large sample size doesn’t guarantee better results if it’s the wrong large sample." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"If the underlying data isn’t sampled accurately, it’s like building a house on a foundation that’s missing a few chunks of concrete. Maybe it won’t matter. But if the missing concrete is in the wrong spot - or if there is too much concrete missing - the whole house can come falling down." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"If you’re looking at an average, you are - by definition - studying a specific sample set. If you’re comparing averages, and those averages come from different sample sets, the differences in the sample sets may well be manifested in the averages. Remember, an average is only as good as the underlying data." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"If your conclusions change dramatically by excluding a data point, then that data point is a strong candidate to be an outlier. In a good statistical model, you would expect that you can drop a data point without seeing a substantive difference in the results. It’s something to think about when looking for outliers." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"In the real world, statistical issues rarely exist in isolation. You’re going to come across cases where there’s more than one problem with the data. For example, just because you identify some sampling errors doesn’t mean there aren’t also issues with cherry picking and correlations and averages and forecasts - or simply more sampling issues, for that matter. Some cases may have no statistical issues, some may have dozens. But you need to keep your eyes open in order to spot them all." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Keep in mind that a weighted average may be different than a simple (non- weighted) average because a weighted average - by definition - counts certain data points more heavily. When you’re thinking about an average, try to determine if it’s a simple average or a weighted average. If it’s weighted, ask yourself how it’s being weighted, and see which data points count more than others." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"[…] remember that, as with many statistical issues, sampling in and of itself is not a good or a bad thing. Sampling is a powerful tool that allows us to learn something, when looking at the full population is not feasible (or simply isn’t the preferred option). And you shouldn’t be misled to think that you always should use all the data. In fact, using a sample of data can be incredibly helpful." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Statistical significance is a concept used by scientists and researchers to set an objective standard that can be used to determine whether or not a particular relationship 'statistically' exists in the data. Scientists test for statistical significance to distinguish between whether an observed effect is present in the data (given a high degree of probability), or just due to chance. It is important to note that finding a statistically significant relationship tells us nothing about whether a relationship is a simple correlation or a causal one, and it also can’t tell us anything about whether some omitted factor is driving the result." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Statistical significance refers to the probability that something is true. It’s a measure of how probable it is that the effect we’re seeing is real (rather than due to chance occurrence), which is why it’s typically measured with a p-value. P, in this case, stands for probability. If you accept p-values as a measure of statistical significance, then the lower your p-value is, the less likely it is that the results you’re seeing are due to chance alone." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"This idea of looking for answers is related to confirmation bias, which is the tendency to interpret data in a way that reinforces your preconceptions. With confirmation bias, you aren’t just looking for an answer - you’re looking for a specific answer." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The more uncertainty there is in your sample, the more uncertainty there will be in your forecast. A prediction is only as good as the information that goes into it, and in statistics, we call the basis for our forecasts a model. The model represents all the inputs - the factors you determine will predict the future outcomes, the underlying sample data you rely upon, and the relationship you apply mathematically. In other words, the model captures how you think various factors relate to one another." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The process of making statistical conclusions about the data is called drawing an inference. In any statistical analysis, if you’re going to draw an inference, the goal is to make sure you have the right data to answer the question you are asking." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The strength of an average is that it takes all the values in your data set and simplifies them down to a single number. This strength, however, is also the great danger of an average. If every data point is exactly the same (picture a row of identical bricks) then an average may, in fact, accurately reflect something about each one. But if your population isn’t similar along many key dimensions - and many data sets aren’t - then the average will likely obscure data points that are above or below the average, or parts of the data set that look different from the average. […] Another way that averages can mislead is that they typically only capture one aspect of the data." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"The tricky part is that there aren’t really any hard- and- fast rules when it comes to identifying outliers. Some economists say an outlier is anything that’s a certain distance away from the mean, but in practice it’s fairly subjective and open to interpretation. That’s why statisticians spend so much time looking at data on a case-by-case basis to determine what is - and isn’t - an outlier." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Using a sample to estimate results in the full population is common in data analysis. But you have to be careful, because even small mistakes can quickly become big ones, given that each observation represents many others. There are also many factors you need to consider if you want to make sure your inferences are accurate." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

SQL Troubles

Pages

08 April 2006

🖍️John H Johnson - Collected Quotes

No comments:

About Me