17 November 2016

Gary Smith - Collected Quotes

"A computer makes calculations quickly and correctly, but doesn’t ask if the calculations are meaningful or sensible. A computer just does what it is told." (Gary Smith, "Standard Deviations", 2014)

"A study that leaves out data is waving a big red flag. A decision to include orxclude data sometimes makes all the difference in the world. This decision should be based on the relevance and quality of the data, not on whether the data support or undermine a conclusion that is expected or desired." (Gary Smith, "Standard Deviations", 2014)

"Another way to secure statistical significance is to use the data to discover a theory. Statistical tests assume that the researcher starts with a theory, collects data to test the theory, and reports the results - whether statistically significant or not. Many people work in the other direction, scrutinizing the data until they find a pattern and then making up a theory that fits the pattern." (Gary Smith, "Standard Deviations", 2014)

"Comparisons are the lifeblood of empirical studies. We can’t determine if a medicine, treatment, policy, or strategy is effective unless we compare it to some alternative. But watch out for superficial comparisons: comparisons of percentage changes in big numbers and small numbers, comparisons of things that have nothing in common except that they increase over time, comparisons of irrelevant data. All of these are like comparing apples to prunes." (Gary Smith, "Standard Deviations", 2014)

"Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data." (Gary Smith, "Standard Deviations", 2014)

"Data without theory can fuel a speculative stock market bubble or create the illusion of a bubble where there is none. How do we tell the difference between a real bubble and a false alarm? You know the answer: we need a theory. Data are not enough. […] Data without theory is alluring, but misleading." (Gary Smith, "Standard Deviations", 2014)

"Don’t just do the calculations. Use common sense to see whether you are answering the correct question, the assumptions are reasonable, and the results are plausible. If a statistical argument doesn’t make sense, think about it carefully - you may discover that the argument is nonsense." (Gary Smith, "Standard Deviations", 2014)

"Graphs can help us interpret data and draw inferences. They can help us see tendencies, patterns, trends, and relationships. A picture can be worth not only a thousand words, but a thousand numbers. However, a graph is essentially descriptive - a picture meant to tell a story. As with any story, bumblers may mangle the punch line and the dishonest may lie." (Gary Smith, "Standard Deviations", 2014)

"Graphs should not be mere decoration, to amuse the easily bored. A useful graph displays data accurately and coherently, and helps us understand the data. Chartjunk, in contrast, distracts, confuses, and annoys. Chartjunk may be well-intentioned, but it is misguided. It may also be a deliberate attempt to mystify." (Gary Smith, "Standard Deviations", 2014)

"How can we tell the difference between a good theory and quackery? There are two effective antidotes: common sense and fresh data. If it is a ridiculous theory, we shouldn’t be persuaded by anything less than overwhelming evidence, and even then be skeptical. Extraordinary claims require extraordinary evidence. Unfortunately, common sense is an uncommon commodity these days, and many silly theories have been seriously promoted by honest researchers." (Gary Smith, "Standard Deviations", 2014)

"If somebody ransacks data to find a pattern, we still need a theory that makes sense. On the other hand, a theory is just a theory until it is tested with persuasive data." (Gary Smith, "Standard Deviations", 2014)

 "[…] many gamblers believe in the fallacious law of averages because they are eager to find a profitable pattern in the chaos created by random chance." (Gary Smith, "Standard Deviations", 2014)

"Numbers are not inherently tedious. They can be illuminating, fascinating, even entertaining. The trouble starts when we decide that it is more important for a graph to be artistic than informative." (Gary Smith, "Standard Deviations", 2014)

"Provocative assertions are provocative precisely because they are counterintuitive - which is a very good reason for skepticism." (Gary Smith, "Standard Deviations", 2014)

"Remember that even random coin flips can yield striking, even stunning, patterns that mean nothing at all. When someone shows you a pattern, no matter how impressive the person’s credentials, consider the possibility that the pattern is just a coincidence. Ask why, not what. No matter what the pattern, the question is: Why should we expect to find this pattern?" (Gary Smith, "Standard Deviations", 2014)

"Self-selection bias occurs when people choose to be in the data - for example, when people choose to go to college, marry, or have children. […] Self-selection bias is pervasive in 'observational data', where we collect data by observing what people do. Because these people chose to do what they are doing, their choices may reflect who they are. This self-selection bias could be avoided with a controlled experiment in which people are randomly assigned to groups and told what to do." (Gary Smith, "Standard Deviations", 2014)

"The omission of zero magnifies the ups and downs in the data, allowing us to detect changes that might otherwise be ambiguous. However, once zero has been omitted, the graph is no longer an accurate guide to the magnitude of the changes. Instead, we need to look at the actual numbers." (Gary Smith, "Standard Deviations", 2014)

"These practices - selective reporting and data pillaging - are known as data grubbing. The discovery of statistical significance by data grubbing shows little other than the researcher’s endurance. We cannot tell whether a data grubbing marathon demonstrates the validity of a useful theory or the perseverance of a determined researcher until independent tests confirm or refute the finding. But more often than not, the tests stop there. After all, you won’t become a star by confirming other people’s research, so why not spend your time discovering new theories? The data-grubbed theory consequently sits out there, untested and unchallenged." (Gary Smith, "Standard Deviations", 2014)

"We are genetically predisposed to look for patterns and to believe that the patterns we observe are meaningful. […] Don’t be fooled into thinking that a pattern is proof. We need a logical, persuasive explanation and we need to test the explanation with fresh data." (Gary Smith, "Standard Deviations", 2014)

"We are hardwired to make sense of the world around us - to notice patterns and invent theories to explain these patterns. We underestimate how easily pat - terns can be created by inexplicable random events - by good luck and bad luck." (Gary Smith, "Standard Deviations", 2014)

"We are seduced by patterns and we want explanations for these patterns. When we see a string of successes, we think that a hot hand has made success more likely. If we see a string of failures, we think a cold hand has made failure more likely. It is easy to dismiss such theories when they involve coin flips, but it is not so easy with humans. We surely have emotions and ailments that can cause our abilities to go up and down. The question is whether these fluctuations are important or trivial." (Gary Smith, "Standard Deviations", 2014)

"We naturally draw conclusions from what we see […]. We should also think about what we do not see […]. The unseen data may be just as important, or even more important, than the seen data. To avoid survivor bias, start in the past and look forward." (Gary Smith, "Standard Deviations", 2014)

"We encounter regression in many contexts - pretty much whenever we see an imperfect measure of what we are trying to measure. Standardized tests are obviously an imperfect measure of ability." (Gary Smith, "Standard Deviations", 2014)

"With fast computers and plentiful data, finding statistical significance is trivial. If you look hard enough, it can even be found in tables of random numbers." (Gary Smith, "Standard Deviations", 2014)

06 November 2016

SQL Server Administration: System.OutOfMemoryException in SQL Server Management Studio and other 32-bit Drawbacks

    I was playing this week with a few datasets downloaded from the web on various topics, trying to torture the data until they’ll confess something. A few of the datasets were prepared for load into a MySQL database as individual INSERT INTO statements. They were containing between 100000 and a few millions of records. While looking at the big but slim datasets in SSMS (SQL Server Management Studio) and reconciling the differences between MySQL and SQL Server I got several times the System.OutOfMemoryException exception, SSMS crashing one or two times. That should be ok, given the number of records, though I was surprised that I got the same error message while executing the INSERT INTO statements for one of the smallest datasets which had about 300000 records:

    „An error occurred while executing batch. Error message is: Exception of type 'System.OutOfMemoryException' was thrown”

    Kb 2874903 brings some light into the topic – SSMS is still a 32-bit process and thus limited to 2GB of memory. The Kb offers three methods to avoid this issue. The first two, outputting the query results to text or file didn’t worked. The third method based on using sqlcmd utility worked smoothly with a syntax like the one below:
sqlcmd -i “<file_name.sql>” -d “<database name>”

    So it doesn’t matter that you’re having a supercomputer and that working with big datasets becomes a necessity nowadays, this limitation can make data loading just a little bit more complicated. On one side, it’s true that when dealing with such datasets is probably recommended to use directly sqlcmd to execute the scripts. On the other side, independently from this type of problem, even if understandable from the need of keeping backwards compatibility with 32-bit platforms/solutions, it’s hard to digest the fact that Microsoft keeps some of its products 32-bit based when SQL Server is targeting 64-bit platforms. One has same problem when using BIDS (Business Intelligence Development Studio), developing SSRS, SSIS or SSAS solutions under 32-bit and having maybe to deploy the code as 64-bit (e.g. SQL Server Agent). From my point of view most of the issues I had were when dealing with proprietary drivers like the ones for Oracle or even for MS Office. In addition in SSIS there could be features that are only available in 32-bit versions, or have limitations on 64-bit computers (see [5]). As it seems also the SQL Server Data Tools (SSDT) will have similar drawbacks…

   Anyway, sqlcmd utility saved the day with a minimum of overhead. Unfortunately it’s not always that easy to solvethe compatibility issues between 32-bit and 64-bit software and platforms.

Update 20.06.2017:
   One can synchronize the runtime version between BIDS and SQL Server Agent pretty easy. In BIDS under "Configuration Properties/Debug Option" at Project level, there is the “Run64BitRuntime” Property. Set to false it will run your package on 32-bit version. In a SQL Server Agent Package, there is the “Use 32 bit runtime” Checkbox under “Execution options” at step level. Checking this checkbox will run your package on 32-bit version.

   A hint that the two values might be out of synch is the following error message raised when running the package:
"Attempt to load Oracle client libraries threw BadImageFormatException. This problem will occur when running in 64 bit mode with the 32 bit Oracle client components installed."
[1] Microsoft Support (2013) Kb 2874903: "System.OutOfMemoryException" exception when you execute a query in SQL Server Management Studio https://support.microsoft.com/en-us/kb/2874903
[2] MSDN (2016) SQL Server 2016: sqlcmd Utility https://msdn.microsoft.com/en-us/library/ms162773.aspx
[3] MSDN (2016) SQL Server 2016: Use the sqlcmd Utility https://msdn.microsoft.com/en-us/library/ms180944.aspx
[4] MSDN (2012) Introducing Business Intelligence Development Studio  https://msdn.microsoft.com/en-us/library/ms173767.aspx
[5] SQL Server 2008 R2: 64 bit Considerations for Integration Services https://technet.microsoft.com/en-us/library/ms141766(v=sql.105).aspx

