14 December 2018

Data Science: Datasets (Just the Quotes)

"Of course statistical graphics, just like statistical calculations, are only as good as what goes into them. An ill-specified or preposterous model or a puny data set cannot be rescued by a graphic (or by calculation), no matter how clever or fancy. A silly theory means a silly graphic." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"No matter what the data, and no matter how the values are arranged and presented, you must always use some method of analysis to come up with an interpretation of the data.
While every data set contains noise, some data sets may contain signals. Therefore, before you can detect a signal within any given data set, you must first filter out the noise." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Since the average is a measure of location, it is common to use averages to compare two data sets. The set with the greater average is thought to ‘exceed’ the other set. While such comparisons may be helpful, they must be used with caution. After all, for any given data set, most of the values will not be equal to the average." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Enabling insight into large and complex datasets is a prevalent theme in current visualization research for which different approaches are pursued. Topology-based methods are built on the idea of abstracting characteristic structures such as the topological skeleton from the data and to construct the visualization accordingly." (Helwig Hauser et al [Eds.], "Topology-based Methods in Visualization", 2007)

"Most mainstream data-mining techniques ignore the fact that real-world datasets are combinations of underlying data, and build single models from them. If such datasets can first be separated into the components that underlie them, we might expect that the quality of the models will improve significantly. (David Skillicorn, "Understanding Complex Datasets: Data Mining with Matrix Decompositions", 2007)

"For a given dataset there is not a great deal of advice which can be given on content and context. hose who know their own data should know best for their specific purposes. It is advisable to think hard about what should be shown and to check with others if the graphic makes the desired impression. Design should be let to designers, though some basic guidelines should be followed: consistency is important (sets of graphics should be in similar style and use equivalent scaling); proximity is helpful (place graphics on the same page, or on the facing page, of any text that refers to them); and layout should be checked (graphics should be neither too small nor too large and be attractively positioned relative to the whole page or display)."(Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"The main goal of data visualization is its ability to visualize data, communicating information clearly and effectively. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex dataset by communicating its key aspects in a more intuitive way. Yet designers often tend to discard the balance between design and function, creating gorgeous data visualizations which fail to serve its main purpose - communicate information." (Vitaly Friedman, "Data Visualization and Infographics", Smashing Magazine, 2008)

"There are two main reasons for using graphic displays of datasets: either to present or to explore data. Presenting data involves deciding what information you want to convey and drawing a display appropriate for the content and for the intended audience. [...] Exploring data is a much more individual matter, using graphics to find information and to generate ideas.Many displays may be drawn. They can be changed at will or discarded and new versions prepared, so generally no one plot is especially important, and they all have a short life span." (Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"To extract useful information from such large and structured data sets, a first step is to be able to visualize their structure, identifying interesting patterns, trends, and complex relationships between the items. The main idea of visual data exploration is to produce a representation of the data in such a way that the human eye can gain insight into their structure and patterns." (George Michailidis, "Data Visualization Through Their Graph Representations" [in "Handbook of Data Visualization"], 2008)

"[...] the form of a technological object must depend on the tasks it should help with. This is one of the most important principles to remember when dealing with infographics and visualizations: The form should be constrained by the functions of your presentation. There may be more than one form a data set can adopt so that readers can perform operations with it and extract meanings, but the data cannot adopt any form. Choosing visual shapes to encode information should not be based on aesthetics and personal tastes alone." (Alberto Cairo, "The Functional Art", 2011)

"If you look too hard at a set of data, you will find something - but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset. Data mining techniques can be very powerful, and the need to detect and avoid overfitting is one of the most important concepts to grasp when applying data mining to real problems. The concept of overfitting and its avoidance permeates data science processes, algorithms, and evaluation methods." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"No subjective metric can escape strategic gaming [...] The possibility of mischief is bottomless. Fighting ratings is fruitless, as they satisfy a very human need. If one scheme is beaten down, another will take its place and wear its flaws. Big Data just deepens the danger. The more complex the rating formulas, the more numerous the opportunities there are to dress up the numbers. The larger the data sets, the harder it is to audit them." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Visualization can be appreciated purely from an aesthetic point of view, but it’s most interesting when it’s about data that’s worth looking at. That’s why you start with data, explore it, and then show results rather than start with a visual and try to squeeze a dataset into it. It’s like trying to use a hammer to bang in a bunch of screws. […] Aesthetics isn’t just a shiny veneer that you slap on at the last minute. It represents the thought you put into a visualization, which is tightly coupled with clarity and affects interpretation." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are." (Seth Stephens-Davidowitz, "Everybody Lies: What the Internet Can Tell Us About Who We Really Are", 2017)

"Effects without an understanding of the causes behind them, on the other hand, are just bunches of data points floating in the ether, offering nothing useful by themselves. Big Data is information, equivalent to the patterns of light that fall onto the eye. Big Data is like the history of stimuli that our eyes have responded to. And as we discussed earlier, stimuli are themselves meaningless because they could mean anything. The same is true for Big Data, unless something transformative is brought to all those data sets… understanding." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"One way to lie with statistics is to compare things - datasets, populations, types of products - that are different from one another, and pretend that they’re not. As the old idiom says, you can’t compare apples with oranges." (Daniel J Levitin, "Weaponized Lies", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Creating effective visualizations is hard. Not because a dataset requires an exotic and bespoke visual representation - for many problems, standard statistical charts will suffice. And not because creating a visualization requires coding expertise in an unfamiliar programming language [...]. Rather, creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"[…] creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Every dataset has subtleties; it can be far too easy to slip down rabbit holes of complications. Being systematic about the operationalization can help focus our conversations with experts, only introducing complications when needed." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope. (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Using data science, we can uncover the important patterns in a data set, and these patterns can reveal the important attributes in the domain. The reason why data science is used in so many domains is that it doesn’t matter what the problem domain is: if the right data are available and the problem can be clearly defined, then data science can help."  (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Each of us is sweating data, and those data are being mopped up and wrung out into oceans of information. Algorithms and large datasets are being used for everything from finding us love to deciding whether, if we are accused of a crime, we go to prison before the trial or are instead allowed to post bail. We all need to understand what these data are and how they can be exploited." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"It’d be nice to fondly imagine that high-quality statistics simply appear in a spreadsheet somewhere, divine providence from the numerical heavens. Yet any dataset begins with somebody deciding to collect the numbers. What numbers are and aren’t collected, what is and isn’t measured, and who is included or excluded are the result of all-too-human assumptions, preconceptions, and oversights." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Many people have strong intuitions about whether they would rather have a vital decision about them made by algorithms or humans. Some people are touchingly impressed by the capabilities of the algorithms; others have far too much faith in human judgment. The truth is that sometimes the algorithms will do better than the humans, and sometimes they won’t. If we want to avoid the problems and unlock the promise of big data, we’re going to need to assess the performance of the algorithms on a case-by-case basis. All too often, this is much harder than it should be. […] So the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency, and debate." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.