25 October 2018

🔭Data Science: Conclusions (Just the Quotes)

"Before anything can be reasoned upon to a conclusion, certain facts, principles, or data, to reason from, must be established, admitted, or denied." (Thomas Paine, "Rights of Man", 1791) 

"In order to supply the defects of experience, we will have recourse to the probable conjectures of analogy, conclusions which we will bequeath to our posterity to be ascertained by new observations, which, if we augur rightly, will serve to establish our theory and to carry it gradually nearer to absolute certainty." (Johann H Lambert, "The System of the World", 1800)

"Such is the tendency of the human mind to speculation, that on the least idea of an analogy between a few phenomena, it leaps forward, as it were, to a cause or law, to the temporary neglect of all the rest; so that, in fact, almost all our principal inductions must be regarded as a series of ascents and descents, and of conclusions from a few cases, verified by trial on many." (Sir John Herschel, "A Preliminary Discourse on the Study of Natural Philosophy" , 1830)

"Just as data gathered by an incompetent observer are worthless - or by a biased observer, unless the bias can be measured and eliminated from the result - so also conclusions obtained from even the best data by one unacquainted with the principles of statistics must be of doubtful value." (William F White, "A Scrap-Book of Elementary Mathematics: Notes, Recreations, Essays", 1908)

"Ordinarily, facts do not speak for themselves. When they do speak for themselves, the wrong conclusions are often drawn from them. Unless the facts are presented in a clear and interesting manner, they are about as effective as a phonograph record with the phonograph missing." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitutes for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925) 

"Observed facts must be built up, woven together, ordered, arranged, systematized into conclusions and theories by reflection and reason, if they are to have full bearing on life and the universe. Knowledge is the accumulation of facts. Wisdom is the establishment of relations. And just because the latter process is delicate and perilous, it is all the more delightful." (Gamaliel Bradford, "Darwin", 1926)

"The statistician’s job is to draw general conclusions from fragmentary data. Too often the data supplied to him for analysis are not only fragmentary but positively incoherent, so that he can do next to nothing with them. Even the most kindly statistician swears heartily under his breath whenever this happens". (Michael J Moroney, "Facts from Figures", 1927)

"All statistical analysis in business must aim at the control of action. The possible conclusions are: 1. Certain action must be taken. 2. No action is required. 3. Certain tendencies must be watched. 4. The analysis is not significant and either (a) certain further facts are required, or (b) there are no indications that further facts should be obtained." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"Starting from statistical observations, it is possible to arrive at conclusions which not less reliable or useful than those obtained in any other exact science. It is only necessary to apply a clear and precise concept of probability to such observations. " (Richard von Mises, "Probability, Statistics, and Truth", 1939)

"The characteristic which distinguishes the present-day professional statistician, is his interest and skill in the measurement of the fallibility of conclusions." (George W Snedecor, "On a Unique Feature of Statistics", [address] 1948)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Another thing to watch out for is a conclusion in which a correlation has been inferred to continue beyond the data with which it has been demonstrated." (Darell Huff, "How to Lie with Statistics", 1954)

"The statistics themselves prove nothing; nor are they at any time a substitute for logical thinking. There are […] many simple but not always obvious snags in the data to contend with. Variations in even the simplest of figures may conceal a compound of influences which have to be taken into account before any conclusions are drawn from the data." (Alfred R Ilersic, "Statistics", 1959)

"Predictions, prophecies, and perhaps even guidance – those who suggested this title to me must have hoped for such-even though occasional indulgences in such actions by statisticians has undoubtedly contributed to the characterization of a statistician as a man who draws straight lines from insufficient data to foregone conclusions!" (John W Tukey, "Where do We Go From Here?", Journal of the American Statistical Association, Vol. 55, No. 289, 1960)

"Model-making, the imaginative and logical steps which precede the experiment, may be judged the most valuable part of scientific method because skill and insight in these matters are rare. Without them we do not know what experiment to do. But it is the experiment which provides the raw material for scientific theory. Scientific theory cannot be built directly from the conclusions of conceptual models." (Herbert G Andrewartha," Introduction to the Study of Animal Population", 1961)

"Almost all efforts at data analysis seek, at some point, to generalize the results and extend the reach of the conclusions beyond a particular set of data. The inferential leap may be from past experiences to future ones, from a sample of a population to the whole population, or from a narrow range of a variable to a wider range. The real difficulty is in deciding when the extrapolation beyond the range of the variables is warranted and when it is merely naive. As usual, it is largely a matter of substantive judgment - or, as it is sometimes more delicately put, a matter of 'a priori nonstatistical considerations'." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

"Just like the spoken or written word, statistics and graphs can lie. They can lie by not telling the full story. They can lead to wrong conclusions by omitting some of the important facts. [...] Always look at statistics with a critical eye, and you will not be the victim of misleading information." (Dyno Lowenstein, "Graphs", 1976)

"Crude measurement usually yields misleading, even erroneous conclusions no matter how sophisticated a technique is used." (Henry T Reynolds, "Analysis of Nominal Data", 1977)

"The word ‘induction’ has two essentially different meanings. Scientific induction is a process by which scientists make observations of particular cases, such as noticing that some crows are black, then leap to the universal conclusion that all crows are black. The conclusion is never certain. There is always the possibility that at least one unobserved crow is not black." (Martin Gardner, "Aha! Insight", 1978)

"Being experimental, however, doesn't necessarily make a scientific study entirely credible. One weakness of experimental work is that it can be out of touch with reality when its controls are so rigid that conclusions are valid only in the experimental situation and don't carryover into the real world." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"In everyday life, 'estimation' means a rough and imprecise procedure leading to a rough and imprecise result. You 'estimate' when you cannot measure exactly. In statistics, on the other hand, 'estimation' is a technical term. It means a precise and accurate procedure, leading to a result which may be imprecise, but where at least the extent of the imprecision is known. It has nothing to do with approximation. You have some data, from which you want to draw conclusions and produce a 'best' value for some particular numerical quantity (or perhaps for several quantities), and you probably also want to know how reliable this value is, i.e. what the error is on your estimate." (Roger J Barlow, "Statistics: A guide to the use of statistical methods in the physical sciences", 1989)

"Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model." (Scott Zeger, "Statistical reasoning in epidemiology", American Journal of Epidemiology, 1991)

"When looking at the end result of any statistical analysis, one must be very cautious not to over interpret the data. Care must be taken to know the size of the sample, and to be certain the method for gathering information is consistent with other samples gathered. […] No one should ever base conclusions without knowing the size of the sample and how random a sample it was. But all too often such data is not mentioned when the statistics are given - perhaps it is overlooked or even intentionally omitted." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"Nature behaves in ways that look mathematical, but nature is not the same as mathematics. Every mathematical model makes simplifying assumptions; its conclusions are only as valid as those assumptions. The assumption of perfect symmetry is excellent as a technique for deducing the conditions under which symmetry-breaking is going to occur, the general form of the result, and the range of possible behaviour. To deduce exactly which effect is selected from this range in a practical situation, we have to know which imperfections are present." (Ian Stewart & Martin Golubitsky, "Fearful Symmetry", 1992)

"Visualization is an approach to data analysis that stresses a penetrating look at the structure of data. No other approach conveys as much information. […] Conclusions spring from data when this information is combined with the prior knowledge of the subject under investigation." (William S Cleveland, "Visualizing Data", 1993)

"Visualization is an effective framework for drawing inferences from data because its revelation of the structure of data can be readily combined with prior knowledge to draw conclusions. By contrast, because of the formalism of probablistic methods, it is typically impossible to incorporate into them the full body of prior information." (William S Cleveland, "Visualizing Data", 1993)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"'Garbage in, garbage out' is a sound warning for those in the computer field; it is every bit as sound in the use of statistics. Even if the “garbage” which comes out leads to a correct conclusion, this conclusion is still tainted, as it cannot be supported by logical reasoning. Therefore, it is a misuse of statistics. But obtaining a correct conclusion from faulty data is the exception, not the rule. Bad basic data (the 'garbage in') almost always leads to incorrect conclusions (the 'garbage out'). Unfortunately, incorrect conclusions can lead to bad policy or harmful actions." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"Information needs representation. The idea that it is possible to communicate information in a 'pure' form is fiction. Successful risk communication requires intuitively clear representations. Playing with representations can help us not only to understand numbers (describe phenomena) but also to draw conclusions from numbers (make inferences). There is no single best representation, because what is needed always depends on the minds that are doing the communicating." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"Nonetheless, the basic principles regarding correlations between variables are not that difficult to understand. We must look for patterns that reveal potential relationships and for evidence that variables are actually related. But when we do spot those relationships, we should not jump to conclusions about causality. Instead, we need to weigh the strength of the relationship and the plausibility of our theory, and we must always try to discount the possibility of spuriousness." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data, reason, and calculation can only produce conclusions; they do not inspire action. Good numbers are not the result of managing numbers." (Ronald J Baker, "Measure what Matters to Customers: Using Key Predictive Indicators", 2006)

"It is in the nature of human beings to bend information in the direction of desired conclusions." (John Naisbitt, "Mind Set!: Reset Your Thinking and See the Future", 2006) 

"Perception requires imagination because the data people encounter in their lives are never complete and always equivocal. [...] We also use our imagination and take shortcuts to fill gaps in patterns of nonvisual data. As with visual input, we draw conclusions and make judgments based on uncertain and incomplete information, and we conclude, when we are done analyzing the patterns, that out picture is clear and accurate. But is it?" (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"Traditional statistics is strong in devising ways of describing data and inferring distributional parameters from sample. Causal inference requires two additional ingredients: a science-friendly language for articulating causal knowledge, and a mathematical machinery for processing that knowledge, combining it with data and drawing new causal conclusions about a phenomenon." (Judea Pearl, "Causal inference in statistics: An overview", Statistics Surveys 3, 2009)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Any factor you don’t account for can become a confounding factor. A confounding factor is any variable that confuses the conclusions of your study, or makes them ambiguous. [...] Confounding factors can really screw up an otherwise perfectly good statistical analysis." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"Any time you collect data, you have uncertainty to deal with. This uncertainty comes from two places: (1) inherent variation in the values a random variable can take on and (2) the fact that for most studies, you can’t capture the entire population and so you must rely on a sample to make your conclusions." (Kristin H Jarman, "The Art of Data Analysis: How to answer almost any question using basic statistics", 2013)

"A study that leaves out data is waving a big red flag. A decision to include or exclude data sometimes makes all the difference in the world. This decision should be based on the relevance and quality of the data, not on whether the data support or undermine a conclusion that is expected or desired." (Gary Smith, "Standard Deviations", 2014)

"We naturally draw conclusions from what we see […]. We should also think about what we do not see […]. The unseen data may be just as important, or even more important, than the seen data. To avoid survivor bias, start in the past and look forward." (Gary Smith, "Standard Deviations", 2014)

"If your conclusions change dramatically by excluding a data point, then that data point is a strong candidate to be an outlier. In a good statistical model, you would expect that you can drop a data point without seeing a substantive difference in the results. It’s something to think about when looking for outliers." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"In terms of characteristics, a data scientist has an inquisitive mind and is prepared to explore and ask questions, examine assumptions and analyse processes, test hypotheses and try out solutions and, based on evidence, communicate informed conclusions, recommendations and caveats to stakeholders and decision makers." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Just because there’s a number on it, it doesn’t mean that the number was arrived at properly. […] There are a host of errors and biases that can enter into the collection process, and these can lead millions of people to draw the wrong conclusions. Although most of us won’t ever participate in the collection process, thinking about it, critically, is easy to learn and within the reach of all of us." (Daniel J Levitin, "Weaponized Lies", 2017)

"But [bootstrap-based] simulations are clumsy and time-consuming, especially with large data sets, and in more complex circumstances it is not straightforward to work out what should be simulated. In contrast, formulae derived from probability theory provide both insight and convenience, and always lead to the same answer since they don’t depend on a particular simulation. But the flip side is that this theory relies on assumptions, and we should be careful not to be deluded by the impressive algebra into accepting unjustified conclusions." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Good data scientists know that, because of inevitable ups and downs in the data for almost any interesting question, they shouldn’t draw conclusions from small samples, where flukes might look like evidence." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Each decision about what data to gather and how to analyze them is akin to standing on a pathway as it forks left and right and deciding which way to go. What seems like a few simple choices can quickly multiply into a labyrinth of different possibilities. Make one combination of choices and you’ll reach one conclusion; make another, equally reasonable, and you might find a very different pattern in the data." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"If the data that go into the analysis are flawed, the specific technical details of the analysis don’t matter. One can obtain stupid results from bad data without any statistical trickery. And this is often how bullshit arguments are created, deliberately or otherwise. To catch this sort of bullshit, you don’t have to unpack the black box. All you have to do is think carefully about the data that went into the black box and the results that came out. Are the data unbiased, reasonable, and relevant to the problem at hand? Do the results pass basic plausibility checks? Do they support whatever conclusions are drawn?" (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Inference is to bring about a new thought, which in logic amounts to drawing a conclusion, and more generally involves using what we already know, and what we see or observe, to update prior beliefs. […] Inference is also a leap of sorts, deemed reasonable […] Inference is a basic cognitive act for intelligent minds. If a cognitive agent (a person, an AI system) is not intelligent, it will infer badly. But any system that infers at all must have some basic intelligence, because the very act of using what is known and what is observed to update beliefs is inescapably tied up with what we mean by intelligence. If an AI system is not inferring at all, it doesn’t really deserve to be called AI." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"Any time you run regression analysis on arbitrary real-world observational data, there’s a significant risk that there’s hidden confounding in your dataset and so causal conclusions from such analysis are likely to be (causally) biased." (Aleksander Molak, "Causal Inference and Discovery in Python", 2023)

23 October 2018

🔭Data Science: Simulations (Just the Quotes)

"The mathematical and computing techniques for making programmed decisions replace man but they do not generally simulate him." (Herbert A Simon, "Corporations 1985", 1960)

"The main object of cybernetics is to supply adaptive, hierarchical models, involving feedback and the like, to all aspects of our environment. Often such modelling implies simulation of a system where the simulation should achieve the object of copying both the method of achievement and the end result. Synthesis, as opposed to simulation, is concerned with achieving only the end result and is less concerned (or completely unconcerned) with the method by which the end result is achieved. In the case of behaviour, psychology is concerned with simulation, while cybernetics, although also interested in simulation, is primarily concerned with synthesis." (Frank H George, "Soviet Cybernetics, the militairy and Professor Lerner", New Scientist, 1973)

"Computer based simulation is now in wide spread use to analyse system models and evaluate theoretical solutions to observed problems. Since important decisions must rely on simulation, it is essential that its validity be tested, and that its advocates be able to describe the level of authentic representation which they achieved." (Richard Hamming, 1975)

"When a real situation involves chance we have to use probability mathematics to understand it quantitatively. Direct mathematical solutions sometimes exist […] but most real systems are too complicated for direct solutions. In these cases the computer, once taught to generate random numbers, can use simulation to get useful answers to otherwise impossible problems." (Robert Hooke, "How to Tell the Liars from the Statisticians", 1983)

"The real leverage in most management situations lies in understanding dynamic complexity, not detail complexity. […] Unfortunately, most 'systems analyses' focus on detail complexity not dynamic complexity. Simulations with thousands of variables and complex arrays of details can actually distract us from seeing patterns and major interrelationships. In fact, sadly, for most people 'systems thinking' means 'fighting complexity with complexity', devising increasingly 'complex' (we should really say 'detailed') solutions to increasingly 'complex' problems. In fact, this is the antithesis of real systems thinking." (Peter M Senge, "The Fifth Discipline: The Art and Practice of the Learning Organization", 1990)

"A model for simulating dynamic system behavior requires formal policy descriptions to specify how individual decisions are to be made. Flows of information are continuously converted into decisions and actions. No plea about the inadequacy of our understanding of the decision-making processes can excuse us from estimating decision-making criteria. To omit a decision point is to deny its presence - a mistake of far greater magnitude than any errors in our best estimate of the process." (Jay W Forrester, "Policies, decisions and information sources for modeling", 1994)

"A field of study that includes a methodology for constructing computer simulation models to achieve better under-standing of social and corporate systems. It draws on organizational studies, behavioral decision theory, and engineering to provide a theoretical and empirical base for structuring the relationships in complex systems." (Virginia Anderson & Lauren Johnson, "Systems Thinking Basics: From Concepts to Casual Loops", 1997)

"What it means for a mental model to be a structural analog is that it embodies a representation of the spatial and temporal relations among, and the causal structures connecting the events and entities depicted and whatever other information that is relevant to the problem-solving talks. […] The essential points are that a mental model can be nonlinguistic in form and the mental mechanisms are such that they can satisfy the model-building and simulative constraints necessary for the activity of mental modeling." (Nancy J Nersessian, "Model-based reasoning in conceptual change", 1999)

"A neural network is a particular kind of computer program, originally developed to try to mimic the way the human brain works. It is essentially a computer simulation of a complex circuit through which electric current flows." (Keith J Devlin & Gary Lorden, "The Numbers behind NUMB3RS: Solving crime with mathematics", 2007)

"[...] a model is a tool for taking decisions and any decision taken is the result of a process of reasoning that takes place within the limits of the human mind. So, models have eventually to be understood in such a way that at least some layer of the process of simulation is comprehensible by the human mind. Otherwise, we may find ourselves acting on the basis of models that we don’t understand, or no model at all.” (Ugo Bardi, “The Limits to Growth Revisited”, 2011)

"Not only the mathematical way of thinking, but also simulations assisted by mathematical methods, is quite effective in solving problems. The latter is utilized in various fields, including detection of causes of troubles, optimization of expected performances, and best possible adjustments of usage conditions. Conversely, without the aid of mathematical methods, our problem-solving effort will get stuck most probably [...]" (Shiro Hiruta, "Mathematics Contributing to Innovation of Management", [in "What Mathematics Can Do for You"] 2013)

"System dynamics [...] uses models and computer simulations to understand behavior of an entire system, and has been applied to the behavior of large and complex national issues. It portrays the relationships in systems as feedback loops, lags, and other descriptors to explain dynamics, that is, how a system behaves over time. Its quantitative methodology relies on what are called 'stock-and-flow diagrams' that reflect how levels of specific elements accumulate over time and the rate at which they change. Qualitative systems thinking constructs evolved from this quantitative discipline." (Karen L Higgins, "Economic Growth and Sustainability: Systems Thinking for a Complex World", 2015)

"Optimization is more than finding the best simulation results. It is itself a complex and evolving field that, subject to certain information constraints, allows data scientists, statisticians, engineers, and traders alike to perform reality checks on modeling results." (Chris Conlan, "Automated Trading with R: Quantitative Research and Platform Development", 2016)

"But [bootstrap-based] simulations are clumsy and time-consuming, especially with large data sets, and in more complex circumstances it is not straightforward to work out what should be simulated. In contrast, formulae derived from probability theory provide both insight and convenience, and always lead to the same answer since they don’t depend on a particular simulation. But the flip side is that this theory relies on assumptions, and we should be careful not to be deluded by the impressive algebra into accepting unjustified conclusions." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

22 October 2018

💠🛠️SQL Server Administration: Troubleshooting Root Blocking Sessions


Unless special monitoring tools like Redgate’s SQL Monitor are available, when it comes to troubleshooting blocking sessions one can use the sp_who2 and sp_whoisactive stored procedures or the further available DMVs, which can prove useful upon case. In the case one just needs to find the root session that blocks other sessions, as it happens in poorly designed and/or heavily used databases, one can follow the chain of blockings provided by the sp_who2, though it requires time. Therefore, based on sp_who2 logic I put together a script that with the help of common tables expressions walks through the hierarchical chain to identify the root blocking session(s):

 ;WITH CTERequests
 AS ( -- taking a snapshot of requests  
	 SELECT ER.session_id  
	 , ER.request_id
	 , ER.blocking_session_id  
	 , ER.database_id   	  
	 , ER.status      
	 , ER.sql_handle   
	 , ER.command 
	 , ER.start_time
	 , ER.total_elapsed_time/(1000*1000) total_elapsed_time
	 , ER.cpu_time/(1000*1000) cpu_time
	 , ER.wait_type 
	 , ER.wait_time/(1000*1000) wait_time
	 , ER.reads 
	 , ER.writes 
	 , ER.granted_query_memory/128.0 granted_query_memory_mb
	 FROM sys.dm_exec_requests ER with (nolock)
	 WHERE ER.session_id > 50user sessions 
 ),  
 CTE  
 AS ( -- building the tree  
	 SELECT ER.session_id  
	 , ER.request_id
	 , ER.blocking_session_id  
	 , ER.database_id 
	 , ER.status  
	 , ER.command   
	 , ER.sql_handle   
	 , Cast(NULL as varbinary(64)) blocking_sql_handle   
	 , 0 level  
	 , CAST(ER.session_id as nvarchar(max)) sessions_path 
	 , ER.start_time 
	 , ER.total_elapsed_time
	 , ER.cpu_time
	 , ER.wait_type 
	 , ER.wait_time
	 , ER.reads 
	 , ER.writes 
	 , ER.granted_query_memory_mb
	 FROM CTERequests ER
	 -- WHERE ER.blocking_session_id <>0 -- only blocked sessions 
	 UNION ALL  
	 SELECT ER.session_id  
	 , ER.request_id
	 , BR.blocking_session_id  
	 , ER.database_id  
	 , ER.status  
	 , ER.command   
	 , ER.sql_handle   
	 , BR.sql_handle blocking_sql_handle   
	 , ER.level + 1 level  
	 , ER.sessions_path + '/' + Cast(BR.session_id as nvarchar(10)) sessions_path 
	 , ER.start_time
	 , ER.total_elapsed_time
	 , ER.cpu_time
	 , ER.wait_type 
	 , ER.wait_time
	 , ER.reads 
	 , ER.writes 
	 , ER.granted_query_memory_mb
	 FROM CTERequests br  
		  JOIN CTE ER 
			ON BR.session_id = ER.blocking_session_id   
		   AND BR.session_id <> ER.session_id  
 )  
 -- the final query   
 SELECT CTE.session_id 
 , CTE.blocking_session_id   
 , CTE.request_id
 , DB_NAME(CTE.database_id) database_name  
 , CTE.status   
 , CTE.start_time
 , BES.last_request_end_time   
 , CTE.level   
 , CTE.sessions_path   
 , CTE.command   
 , CTE.sql_handle   
 , est.text sql_text  
 , CTE.blocking_sql_handle   
 , best.text blocking_sql_text 
 , CTE.total_elapsed_time
 , CTE.cpu_time
 , CTE.wait_type 
 , CTE.wait_time
 , CTE.reads 
 , CTE.writes 
 , CTE.granted_query_memory_mb
 , TSU.request_internal_objects_alloc_page_count/128.0 allocated_mb
 , TSU.request_internal_objects_dealloc_page_count/128.0 deallocated_mb 
 , ES.login_name   
 , ES.[host_name]  
 , BES.[host_name] blocking_host_name  
 FROM CTE 
      LEFT JOIN sys.dm_exec_sessions ES with (nolock)
        ON CTE.session_id = ES.session_id   
      LEFT JOIN sys.dm_exec_sessions BES with (nolock)
        ON CTE.blocking_session_id = BES.session_id        
      LEFT JOIN ( -- allocated space
	  SELECT session_id
	  , request_id
	  , SUM(internal_objects_alloc_page_count) AS request_internal_objects_alloc_page_count
	  , SUM(internal_objects_dealloc_page_count)AS request_internal_objects_dealloc_page_count 
	  FROM sys.dm_db_task_space_usage with (nolock)
	  WHERE session_id > 50 
	  GROUP BY session_id
	  , request_id
      ) TSU
        ON TSU.request_id = CTE.request_id
       AND TSU.session_id = CTE.session_id 
      OUTER APPLY sys.dm_exec_sql_text(CTE.sql_handle) as est  
      OUTER APPLY sys.dm_exec_sql_text(CTE.blocking_sql_handle) as best 
 WHERE CTE.level IN (SELECT MAX(B.level) -- longest path per session
                 FROM CTE B  
                 WHERE CTE.session_id = B.session_id  
                 GROUP by B.session_id)  
 ORDER BY CTE.session_id

Notes:
The script considers only the user sessions (session_id>50), however the constraint can be left out in order to get also the system sessions.
Over the years I discovered also 1-2 exceptions where the logic didn’t work as expected. Unfortunately, not having historical data made it difficult to find the cause.
Killing the root blocking session usually makes the other blockings disappear, however it’s not always the case.
There can multiple root blocking sessions as well.
For the meaning of the above attributes see the MSDN documentation: sys.dm_exec_requests, sys.dm_exec_sessions, sys.dm_db_task_space_usage, sys.dm_exec_sql_text.
Be careful when killing the sessions (see my other blog post)!

Happy coding!

20 October 2018

💫ERP Systems: AX 2009 vs. D365 FO - Product Prices with two Different Queries

   In Dynamics AX 2009 as well in Dynamics 365 for Finance and Operations (D365 FO) the Inventory, Purchase and Sales Prices are stored in InventTableModule table on separate rows, independently of the Products, stored in InventTable. Thus for each Product there are three rows that need to be joined. To get all the Prices one needs to write a query like this one:

SELECT ITM.DataAreaId 
, ITM.ItemId 
, ITM.ItemName 
, ILP.UnitId InventUnitId
, IPP.UnitId PurchUnitId
, ISP.UnitId SalesUnitId
, ILP.Price InventPrice
, IPP.Price PurchPrice
, ISP.Price SalesPrice
FROM dbo.InventTable ITM
     JOIN dbo.InventTableModule ILP
       ON ITM.ItemId = ILP.ItemId
      AND ITM.DATAAREAID = ILP.DATAAREAID 
      AND ILP.ModuleType = 0 -- Inventory
     JOIN dbo.InventTableModule IPP
       ON ITM.ItemId = IPP.ItemId
      AND ITM.DATAAREAID = IPP.DATAAREAID 
      AND IPP.ModuleType = 1 -- Purchasing
     JOIN dbo.InventTableModule ISP
       ON ITM.ItemId = ISP.ItemId
      AND ITM.DATAAREAID = ISP.DATAAREAID 
      AND ISP.ModuleType = 2 -- Sales 
WHERE ITM.DataAreaId = 'abc'

Looking at the query plan one can see that SQL Server uses three Merge Joins together with a Clustered Index Seek, and recommends using a Nonclustered Index to increase the performance:

image

   To avoid needing to perform three joins, one can rewrite the query with the help of a GROUP BY, thus reducing the number of Merge Joins from three to one:

SELECT ITD.DataAreaID 
, ITD.ItemId 
, ITM.ItemName  
, ITD.InventPrice
, ITD.InventPriceUnit
, ITD.InventUnitId
, ITD.PurchPrice
, ITD.PurchPriceUnit
, ITD.PurchUnitId
, ITD.SalesPrice
, ITD.SalesPriceUnit
, ITD.SalesUnitId
FROM dbo.InventTable ITM
     LEFT JOIN (—price details
     SELECT ITD.ITEMID
     , ITD.DATAAREAID 
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.Price END) InventPrice
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.PriceUnit END) InventPriceUnit
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.UnitId END) InventUnitId
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.Price END) PurchPrice
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.PriceUnit END) PurchPriceUnit
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.UnitId END) PurchUnitId
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.Price END) SalesPrice
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.PriceUnit END) SalesPriceUnit
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.UnitId END) SalesUnitId
     FROM dbo.InventTableModule ITD
     GROUP BY ITD.ITEMID
     , ITD.DATAAREAID 
    ) ITD
       ON ITD.ITEMID = ITM.ITEMID
      AND ITD.DATAAREAID = ITM.DATAAREAID
WHERE ITD.DataAreaID = 'abc'
ORDER BY ITD.ItemId

And here’s the plan for it:

image

It seems that despite the overhead introduced by the GROUP BY, the second query performs better, at least when all or almost all records are performed.

In Dynamics AX there were seldom the occasions when I needed to write similar queries. Probably, most of the cases are related to the use of window functions (e.g. the first n Vendors or Vendor Prices). On the other side, a bug in previous versions of Oracle, which was limiting the number of JOINs that could be used, made me consider such queries also when was maybe not the case.  

What about you? Did you had the chance to write similar queries? What made you consider the second type of query?

Happy coding!

18 October 2018

💎🏭SQL Reloaded: String_Split Function

Today, as I was playing with a data model – although simplistic, the simplicity kicked back when I had to deal with fields in which several values were encoded within the same column. The “challenge” resided in the fact that the respective attributes were quintessential in analyzing and matching the data with other datasets. Therefore, was needed a mix between flexibility and performance. It was the point where the String_Split did its magic. Introduced with SQL Server 2016 and available only under compatibility level 130 and above, the function splits a character expression using specified separator.
Here’s a simplified example of the code I had to write:

-- cleaning up
-- DROP TABLE dbo.ItemList

-- test data 
SELECT A.*
INTO dbo.ItemList 
FROM (
VALUES (1, '1001:a:blue')
, (2, '1001:b:white')
, (3, '1002:a:blue')
, (4, '1002:b:white')
, (5, '1002:c:red')
, (6, '1003:b:white')
, (7, '1003:c:red')) A(Id, List)

-- checking the data
SELECT *
FROM dbo.ItemList 

-- prepared data
SELECT ITM.Id 
, ITM.List 
, DAT.ItemId
, DAT.Size
, DAT.Country
FROM dbo.ItemList ITM
   LEFT JOIN (-- transformed data 
 SELECT DAT.id
 , [1] AS ItemId
 , [2] AS Size
 , [3] AS Country
 FROM(
  SELECT ITM.id
  , TX.Value
  , ROW_NUMBER() OVER (PARTITION BY ITM.id ORDER BY ITM.id) Ranking
  FROM dbo.ItemList ITM
  CROSS APPLY STRING_SPLIT(ITM.List, ':') TX
  ) DAT
 PIVOT (MAX(DAT.Value) FOR DAT.Ranking IN ([1],[2],[3])) 
 DAT
  ) DAT
   ON ITM.Id = DAT.Id 

And, here’s the output:

image_thumb[2]

For those dealing with former versions of SQL Server the functionality provided by the String_Split can be implemented with the help of user-defined functions, either by using the old fashioned loops (see this), cursors (see this) or more modern common table expressions (see this). In fact, these implementations are similar to the String_Split function, the difference being made mainly by the performance.

Notes:
The queries work also in SQL databases in Microsoft Fabric (see file in GitHub repository). You might want to use another schema (e.g. Test), not to interfere with the existing code. 

Happy coding!

13 October 2018

🔭Data Science: Groups (Just the Quotes)

"The object of statistical science is to discover methods of condensing information concerning large groups of allied facts into brief and compendious expressions suitable for discussion. The possibility of doing this is based on the constancy and continuity with which objects of the same species are found to vary." (Sir Francis Galton, "Inquiries into Human Faculty and Its Development, Statistical Methods", 1883)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"'Correlation' is a term used to express the relation which exists between two series or groups of data where there is a causal connection. In order to have correlation it is not enough that the two sets of data should both increase or decrease simultaneously. For correlation it is necessary that one set of facts should have some definite causal dependence upon the other set [...]" (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919) 

"The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of anyone individual, together with the number in the group. The populations which are the object of statistical study always display variations in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the aims of modern statisticians and those of their predecessors." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"An average is a single value which is taken to represent a group of values. Such a representative value may be obtained in several ways, for there are several types of averages. […] Probably the most commonly used average is the arithmetic average, or arithmetic mean." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"The fact that index numbers attempt to measure changes of items gives rise to some knotty problems. The dispersion of a group of products increases with the passage of time, principally because some items have a long-run tendency to fall while others tend to rise. Basic changes in the demand is fundamentally responsible. The averages become less and less representative as the distance from the period increases." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Random events often come like the raisins in a box of cereal - in groups, streaks, and clusters. And although Fortune is fair in potentialities, she is not fair in outcomes." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"[...] statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment." (Kaiser Fung, "Numbers Rule the World", 2010)

"The issue of group differences is fundamental to statistical thinking. The heart of this matter concerns which groups should be aggregated and which shouldn’t." (Kaiser Fung, "Numbers Rule the World", 2010)

"Be careful not to confuse clustering and stratification. Even though both of these sampling strategies involve dividing the population into subgroups, both the way in which the subgroups are sampled and the optimal strategy for creating the subgroups are different. In stratified sampling, we sample from every stratum, whereas in cluster sampling, we include only selected whole clusters in the sample. Because of this difference, to increase the chance of obtaining a sample that is representative of the population, we want to create homogeneous groups for strata and heterogeneous (reflecting the variability in the population) groups for clusters." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"If the group is large enough, even very small differences can become statistically significant." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Self-selection bias occurs when people choose to be in the data - for example, when people choose to go to college, marry, or have children. […] Self-selection bias is pervasive in 'observational data', where we collect data by observing what people do. Because these people chose to do what they are doing, their choices may reflect who they are. This self-selection bias could be avoided with a controlled experiment in which people are randomly assigned to groups and told what to do." (Gary Smith, "Standard Deviations", 2014)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"If you study one group and assume that your results apply to other groups, this is extrapolation. If you think you are studying one group, but do not manage to obtain a representative sample of that group, this is a different problem. It is a problem so important in statistics that it has a special name: selection bias. Selection bias arises when the individuals that you sample for your study differ systematically from the population of individuals eligible for your study." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

06 October 2018

🔭Data Science: Users (Just the Quotes)

"Knowledge is a potential for a certain type of action, by which we mean that the action would occur if certain tests were run. For example, a library plus its user has knowledge if a certain type of response will be evoked under a given set of stipulations." (C West Churchman, "The Design of Inquiring Systems", 1971)

"To conceive of knowledge as a collection of information seems to rob the concept of all of its life. Knowledge resides in the user and not in the collection. It is how the user reacts to a collection of information that matters." (C West Churchman, "The Design of Inquiring Systems", 1971)

"Averages, ranges, and histograms all obscure the time-order for the data. If the time-order for the data shows some sort of definite pattern, then the obscuring of this pattern by the use of averages, ranges, or histograms can mislead the user. Since all data occur in time, virtually all data will have a time-order. In some cases this time-order is the essential context which must be preserved in the presentation." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Visualizations can be used to explore data, to confirm a hypothesis, or to manipulate a viewer. [...] In exploratory visualization the user does not necessarily know what he is looking for. This creates a dynamic scenario in which interaction is critical. [...] In a confirmatory visualization, the user has a hypothesis that needs to be tested. This scenario is more stable and predictable. System parameters are often predetermined." (Usama Fayyad et al, "Information Visualization in Data Mining and Knowledge Discovery", 2002)

"Dashboards and visualization are cognitive tools that improve your 'span of control' over a lot of business data. These tools help people visually identify trends, patterns and anomalies, reason about what they see and help guide them toward effective decisions. As such, these tools need to leverage people's visual capabilities. With the prevalence of scorecards, dashboards and other visualization tools now widely available for business users to review their data, the issue of visual information design is more important than ever." (Richard Brath & Michael Peters, "Dashboard Design: Why Design is Important," DM Direct, 2004)

"Data are raw facts and figures that by themselves may be useless. To be useful, data must be processed into finished information, that is, data converted into a meaningful and useful context for specific users. An increasing challenge for managers is being able to identify and access useful information." (Richard L Daft & Dorothy Marcic, "Understanding Management" 5th Ed., 2006)

"[...] construction of a data model is precisely the selective relevant depiction of the phenomena by the user of the theory required for the possibility of representation of the phenomenon."  (Bas C van Fraassen, "Scientific Representation: Paradoxes of Perspective", 2008)

"The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That’s the beginning of data science." (Mike Loukides, "What Is Data Science?", 2011)

"What is really important is to remember that no matter how creative and innovative you wish to be in your graphics and visualizations, the first thing you must do, before you put a finger on the computer keyboard, is ask yourself what users are likely to try to do with your tool." (Alberto Cairo, "The Functional Art", 2011)

"As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is 'data vomit'. It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product. Their role in the system is to generate data, either to assist in ad targeting or to be sold to the highest bidder, or both." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Successful information design in movement systems gives the user the information he needs - and only the information he needs - at every decision point." (Joel Katz, "Designing Information: Human factors and common sense in information design", 2012) 

"You can give your data product a better chance of success by carefully setting the users’ expectations. [...] One under-appreciated facet of designing data products is how the user feels after using the product. Does he feel good? Empowered? Or disempowered and dejected?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Creating effective visualizations is hard. Not because a dataset requires an exotic and bespoke visual representation - for many problems, standard statistical charts will suffice. And not because creating a visualization requires coding expertise in an unfamiliar programming language [...]. Rather, creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Designing effective visualizations presents a paradox. On the one hand, visualizations are intended to help users learn about parts of their data that they don’t know about. On the other hand, the more we know about the users’ needs and the context of their data, the better we can design a visualization to serve them." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

04 October 2018

🔭Data Science: Data Products (Just the Quotes)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"Discovery is the key to building great data products, as opposed to products that are merely good." (Mike Loukides, "The Evolution of Data Products", 2011)

"New interfaces for data products are all about hiding the data itself, and getting to what the user wants." (Mike Loukides, "The Evolution of Data Products", 2011)

"[...] a good definition of a data product is a product that facilitates an end goal through the use of data. It’s tempting to think of a data product purely as a data problem. After all, there’s nothing more fun than throwing a lot of technical expertise and fancy algorithmic work at a difficult problem." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is 'data vomit'. It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Generalizing beyond advertising, when building any data product in which the data is obfuscated (where there isn’t a clear relationship between the user and the result), you can compromise on precision, but not on recall. But when the data is exposed, focus on high precision." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"Ideas for data products tend to start simple and become complex; if they start complex, they become impossible." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes." (Jeremy Howard et al, "Designing Great Data Products", 2012)

"The best way to avoid data vomit is to focus on actionability of data. That is, what action do you want the user to take? If you want them to be impressed with the number of things that you can do with the data, then you’re likely producing data vomit. If you’re able to lead them to a clear set of actions, then you’ve built a product with a clear focus." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"The key aspect of making a data product is putting the 'product' first and 'data' second. Saying it another way, data is one mechanism by which you make the product user-focused. With all products, you should ask yourself the following three questions: (1) What do you want the user to take away from this product? (2) What action do you want the user to take because of the product? (3) How should the user feel during and after using your product?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"You can give your data product a better chance of success by carefully setting the users’ expectations. [...] One under-appreciated facet of designing data products is how the user feels after using the product. Does he feel good? Empowered? Or disempowered and dejected?" (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"To explain a data mesh in one sentence, a data mesh is a centrally managed network of decentralized data products. The data mesh breaks the central data lake into decentralized islands of data that are owned by the teams that generate the data. The data mesh architecture proposes that data be treated like a product, with each team producing its own data/output using its own choice of tools arranged in an architecture that works for them. This team completely owns the data/output they produce and exposes it for others to consume in a way they deem fit for their data." (Aniruddha Deswandikar,"Engineering Data Mesh in Azure Cloud", 2024)

"Data product usage is growing quickly, doubling every year. Obviously, since we made the investment, we'll work with our customer to find applications." (Ken Shelton)

01 October 2018

🔭Data Science: Summaries (Just the Quotes)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"Comparable objectives in data analysis are (l) to achieve more specific description of what is loosely known or suspected; (2) to find unanticipated aspects in the data, and to suggest unthought-of-models for the data's summarization and exposure; (3) to employ the data to assess the (always incomplete) adequacy of a contemplated model; (4) to provide both incentives and guidance for further analysis of the data; and (5) to keep the investigator usefully stimulated while he absorbs the feeling of his data and considers what to do next." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Summarizing data is a process of constrained and partial a process that essentially and inevitably corresponds to description - some sort of fitting, though it need not necessarily involve formal criteria or well-defined computations." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Fitting lines to relationships between variables is the major tool of data analysis. Fitted lines often effectively summarize the data and, by doing so, help communicate the analytic results to others. Estimating a fitted line is also the first step in squeezing further information from the data." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"A good description of the data summarizes the systematic variation and leaves residuals that look structureless. That is, the residuals exhibit no patterns and have no exceptionally large values, or outliers. Any structure present in the residuals indicates an inadequate fit. Looking at the residuals laid out in an overlay helps to spot patterns and outliers and to associate them with their source in the data." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"[...] data often has some errors, outliers and other strange values, but these do not necessarily need to be individually identified and excluded. It also points to the benefits of using summary measures that are not unduly affected by odd observations [...] are known as robust measures, and include the median and the inter-quartile range." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient [...]. A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards [...]." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"A data visualization, or dashboard, is great for summarizing or describing what has gone on in the past, but if people don’t know how to progress beyond looking just backwards on what has happened, then they cannot diagnose and find the ‘why’ behind it." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.