Showing posts with label transformations. Show all posts
Showing posts with label transformations. Show all posts

21 July 2025

📊Graphical Representation: Sense-making in Data Visualizations (Part 1: An Introduction)

Graphical Representation Series
Graphical Representation Series

Introduction

Creating simple charts or more complex data visualizations may appear trivial for many, though their authors shouldn't forget that readers have different backgrounds, degrees of literacy, many of them not being maybe able to make sense of graphical displays, at least not without some help.

Beginners start with a limited experience and build upon it, then, on the road to mastery, they get acquainted with the many possibilities, a deeper sense is achieved and the choices become a few. Independently of one's experience, there are seldom 'yes' and 'no' answers for the various choices, but everything is a matter of degree that varies with one's experience, available time, audience's expectations, and many more aspects might be considered in time.  

The following questions are intended to expand, respectively narrow down our choices when dealing with data visualizations from a data professional's perspective. The questions are based mainly on [1] though they were extended to include a broader perspective. 

General Questions

Where does the data come from? Is the source reliable, representative (for the whole population in scope)? Is the data source certified? Are yhe data actual? 

Are there better (usable) sources? What's the effort to consider them? Does the data overlap? To what degree? Are there any benefits in merging the data? How much this changes the overall picture? Are the changes (in trends) explainable? 

Was the data collected? How, from where, and using what method? [1] What methodology/approach was used?

What's the dataset about? Can one recognize the data, the (data) entities, respectively the structures behind? How big is the fact table (in terms of rows and columns)? How many dimensions are in scope?

What transformations, calculations or modifications have been applied? What was left out and what's the overall impact?

Any significant assumptions were made? [1] Were the assumptions clearly stated? Are they entitled? Is it more to them? 

Were any transformation applied? Do the transformations change any data characteristics? Were they adequately documented/explained? Do they make sense? Was it something important left out? What's the overall impact?

What criteria were used to include/exclude data from the display? [1] Are the criteria adequately explained/documented? Do they make sense?

Are similar data publicly available? Is it (freely) accessible/usable? To what degree? How much do the datasets overlap? Is there any benefit to analyze/use the respective data? Are the characteristics comparable? To what degree?

Dataviz Questions

What's the title/subtitle of the chart? Is it meaningful for the readers? Does the title reflect the data, respectively the findings adequately? Can it be better formulated? Is it an eye-catcher? Does it meet the expectations? 

What data is shown? Of what type? At what level is the data aggregated? 

What chart (type) is being used? [1] Are the readers familiar with the chart type? Does it needs further introduction/clarifications? Are there better means to represent the data? Does the chart offer the appropriate perspective? Does it make sense to offer different (complementary) perspective(s)? To what degree other perspectives help?

What items of data do the marks represent? What value associations do the attributes represent? [1] Are the marks visible? Are the marks adequately presented (e.g. due to missing data)? 

What range of values are displayed? [1] What approximation the values support? To what degree can the values be rounded without losing meaning?

Is the data categorical, ordinal or continuous? 

Are the axes property chosen/displayed/labeled? Is the scale properly chosen (linear, semilogarithmic, logarithmic), respectively displayed? Do they emphasize, diminish, distort, simplify, or clutter the information? 

What features (shapes, patterns, differences or connections) are observable, interesting or vital for understanding the chart? [1] 

Where are the largest, mid-sized and smallest values? (aka ‘stepped magnitude’ judgements). [1] 

Where lie the most/least values? Where is the average or normal? (aka ‘global comparison’ judgements)” [1] How are the values distributed? Are there any outliers present? Are they explainable? 

What features are expected or unexpected? [1] To what degree are they unexpected?  

What features are important given the subject? [1] 

What shapes and patterns strike readers as being semantically aligned with the subject? [1] 

What is the overall feeling when looking at the final result? Is the chart overcrowded? Can anything be left out/included? 

What colors were used? [1] Are the colors adequately chosen, respectively meaningful? Do they follow the general recommendations?  

What colors, patterns, forms do readers see first? What impressions come next, respectively last longer?  

Are the various elements adequately/intuitively positioned/distinguishable? What's the degree of overlapping/proximity? Do the elements respect an intuitive hierarchy? Do they match readers' expectations, respectively the best practices in scope? Are the deviations entitled? 

Is the space properly used? To what degree? Are there major gaps? 

Know Your Audience

What audience targets the visualization? Which are its characteristics (level of experience with data visualizations; authors, experts or casual attendees)? Are there any accidental attendees? How likely is the audience to pay attention? 

What is audience’s relationship with the subject matter? What knowledge do they have or, conversely, lack about the subject? What assistance might they need to interpret the meaning of the subject? Do they have the capacity to comprehend what it means to them? [1]

Why do the audience wants/needs to understand the topic? Are they familiar, respectively actively interested or more passive? Is it able to grasp the intended meaning? [1] To what degree? What kind of challenges might be involved, of what nature?

What is their motivation? Do they have a direct, expressed need or are they more passive and indifferent? Is it needed a way to persuade them or even seduce them to engage? [1] Can this be done without distorting the data and its meaning(s)?

What are their visualization literacy skill set? Do they require assistance perceiving the chart(s)? Are they sufficiently comfortable with operating features of interactivity? Do they have any visual accessibility issues (e.g. red–green color blindness)? Do they need to be (re)factored into the design? [1]

Reflections

What has been learnt? Has it reinforced or challenged existing knowledge? [1] Was new knowledge gained? How valuable is this knowledge? Can it be reused? In which contexts? 

Do the findings meet one's expectations? To what degree? Were the expectations entitled? On what basis? What's missing? What's gaps' relevance? 

What feelings have been stirred? Has the experience had an impact emotionally? [1] To what degree? Is the impact positive/negative? Is the reaction entitled/explainable? Are there any factors that distorted the reactions? Are they explainable? Do they make sense? 

What does one do with this understanding? Is it just knowledge acquired or something to inspire action (e.g. making a decision or motivating a change in behavior)? [1] How relevant/valuable is the information for us? Can it be used/misused? To what degree? 

Are the data and its representation trustworthy? [1] To what degree?

References:
[1] Andy Kirk, "Data Visualisation: A Handbook for Data Driven Design" 2nd Ed., 2019

01 April 2024

📊R Language: Data Transformations (Part I: Temperatures' comparison between F° and C°)

The time series used for weather analysis use either Fahrenheit (F°) or Celsius (C°) for the temperature values. Looking at the A and B plots below that represent the values of the same dataset in F°, respectively C°, there seems to be no difference between the two plots independently on whether one works with F° or C°, however the scales are different. Once one uses the same scale for both values (see C) the plots are distorted according to the formula used for transformation.

Comments:
(1) Typically, it makes sense to adapt the temperature scale to the audience, though on the Web there will be always a mix of audiences (and that's why weather websites allow to choose one of the values). 
(2) Not starting from 0 might show in the end the same trend at same scale, though the behavior can change occasionally. As long as the Y-axis is correctly labeled, this shouldn't be a problem. Conversely, it's better to control the scale and provide the min-max values for the axis accordingly.
(3) When creating such plots, it's important to be aware of the distortion that might be introduced by transformations. For linear transformations of the type a*x+b, the value of the "a" coefficient tells how much the resulting values are stretched or contracted.

I used as exemplification the airquality dataset which contains data for 1973, the temperature being given in F°. Unfortunately, the dataset contains only the day and the month, so the date must be constructed and added to the dataset. For simplification, I've added the calculated temperature in C° as column as well:

#reviewing the data
help("airquality")

#preparing the data
head(airquality)
airquality$date <- with(airquality, as.Date(ISOdate(1973, Month, Day))) #adding the date
airquality$TempC <- with(airquality, (Temp - 32) * 5/9) #adding the temperature in C°
head(airquality)

And, here's the code used to generate the plots:

#Temperatures' comparison between F° and C°
par(mfrow = c(2,2)) #1x2 matrix display

plot(airquality$date, airquality$Temp, ylab="Temperature (F°)", xlab="date", type="l", col="blue", main="A")

plot(airquality$date, airquality$TempC, ylab="Temperature (C°)", xlab="date", type="l", col="brown", main="B")

plot(airquality$date, airquality$Temp, ylab="Temperature (F°) vs (C°)", xlab="date", ylim=c(0,100), type="l", col="blue", main="C")
lines(airquality$date, airquality$TempC, col="brown")

# using inline formula
plot(airquality$date, (airquality$Temp - 32) * 5/9, ylab="(Temp-32)*5/9", xlab="date", ylim=c(0,100), type="l", col="brown", main="D")

mtext("© sql-troubles@blogspot.com @sql_troubles, 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)
title("Temperatures' comparison between F° and C°", line = -1, outer = TRUE)

In the fourth plot I directly used the formula for transforming the values from F° and C°. If the values based on the formula need to be used repeatedly, it's probably better to add a column to the dataset.

Unfortunately, the standard library has its limitations when creating visualizations. While writing this post I tried to work also with the plotly library, which offers a richer set of tools and can be used to create wonderful visualizations (though it proves also more complex to use). 

install.packages("plotly")
library("plotly")

Here's the code used to plot the below graphic (the points have labels, much like in Power BI):

fig <- plot_ly(airquality, type = 'scatter', mode = 'lines+markers')%>%
  add_trace(x = ~date, y = ~Temp, name = 'Temp (F)')%>%
  add_trace(x = ~date, y = ~TempC, name = 'Temp (C)')%>%
  layout(showlegend = F, title="Temperatures' comparison between K° and C°")

fig
The temperatures via Plotly

Happy coding!

Previous Post <<||>> Next Post

15 November 2018

🔭Data Science: Transformations (Just the Quotes)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"Compound errors can begin with any of the standard sorts of bad statistics - a guess, a poor sample, an inadvertent transformation, perhaps confusion over the meaning of a complex statistic. People inevitably want to put statistics to use, to explore a number's implications. [...] The strengths and weaknesses of those original numbers should affect our confidence in the second-generation statistics." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"All forms of complex causation, and especially nonlinear transformations, admittedly stack the deck against prediction. Linear describes an outcome produced by one or more variables where the effect is additive. Any other interaction is nonlinear. This would include outcomes that involve step functions or phase transitions. The hard sciences routinely describe nonlinear phenomena. Making predictions about them becomes increasingly problematic when multiple variables are involved that have complex interactions. Some simple nonlinear systems can quickly become unpredictable when small variations in their inputs are introduced." (Richard N Lebow, "Forbidden Fruit: Counterfactuals and International Relations", 2010)

"Either a logarithmic or a square-root transformation of the data would produce a new series more amenable to fit a simple trigonometric model. It is often the case that periodic time series have rounded minima and sharp-peaked maxima. In these cases, the square root or logarithmic transformation seems to work well most of the time.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Transformations of data alter statistics. For example, the mean of a data set can be found, but it is not easy to relate the mean of a data set to the mean of the logarithm of that data set. The median is far friendlier to transformations. If the median of a data set is found, then the logarithm of the data set is analyzed; the median of the log transformed data will be the log of the original median.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014) 

"Transforming data to measurements of a different kind can clarify and simplify hypotheses that have already been generated and can reveal patterns that would otherwise be hidden." (Forrest W Young et al, "Visual Statistics: Seeing data with dynamic interactive graphics", 2016)

"Feature generation (or engineering, as it is often called) is where the bulk of the time is spent in the machine learning process. As social science researchers or practitioners, you have spent a lot of time constructing features, using transformations, dummy variables, and interaction terms. All of that is still required and critical in the machine learning framework. One difference you will need to get comfortable with is that instead of carefully selecting a few predictors, machine learning systems tend to encourage the creation of lots of features and then empirically use holdout data to perform regularization and model selection. It is common to have models that are trained on thousands of features." (Rayid Ghani & Malte Schierholz, "Machine Learning", 2017)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills, "Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

24 December 2011

📉Graphical Representation: Transformations (Just the Quotes)

"Data should not be forced into an uncomfortable or improper mold. For example, data that is appropriate for line graphs is not usually appropriate for circle charts and in any case not without some arithmetic transformation. Only graphs that are designed to fit the data can be used profitably." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"Logging size transforms the original skewed distribution into a more symmetrical one by pulling in the long right tail of the distribution toward the mean. The short left tail is, in addition, stretched. The shift toward symmetrical distribution produced by the log transform is not, of course, merely for convenience. Symmetrical distributions, especially those that resemble the normal distribution, fulfill statistical assumptions that form the basis of statistical significance testing in the regression model." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Logging skewed variables also helps to reveal the patterns in the data. […] the rescaling of the variables by taking logarithms reduces the nonlinearity in the relationship and removes much of the clutter resulting from the skewed distributions on both variables; in short, the transformation helps clarify the relationship between the two variables. It also […] leads to a theoretically meaningful regression coefficient." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"The prevailing style of management must undergo transformation. A system cannot understand itself. The transformation requires a view from outside. The aim [...] is to provide an outside view - a lens - that I call a system of profound knowledge. It provides a map of theory by which to understand the organizations that we work in." (Dr. W. Edwards Deming, "The New Economics for Industry, Government, Education", 1994)

"The real value of dashboard products lies in their ability to replace hunt‐and‐peck data‐gathering techniques with a tireless, adaptable, information‐flow mechanism. Dashboards transform data repositories into consumable information." (Gregory L Hovis, "Stop Searching for Information Monitor it with Dashboard Technology," DM Direct, 2002)

"Data is transformed into graphics to understand. A map, a diagram are documents to be interrogated. But understanding means integrating all of the data. In order to do this it’s necessary to reduce it to a small number of elementary data. This is the objective of the 'data treatment' be it graphic or mathematic." (Jacques Bertin [interview], 2003)

"A grammar of graphics facilitates coordinated activity in a set of relatively autonomous components. This grammar enables us to develop a system in which adding a graphic to a frame (say, a surface) requires no adjustments or changes in definitions other than the simple message 'add this graphic'. Similarly, we can remove graphics, transform scales, permute attributes, and make other alterations without redefining the basic structure."(Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"Any conclusion drawn from an analysis of a transformed variable must be retranslated into the original domain - which is usually not an easy task. A special handling of outliers, be it a complete removal, or just visual suppression such as hot-selection or shadowing, must have a cogent motivation. At any rate, transformations of data are usually part of a data preprocessing step that might precede a data analysis. Also it can be motivated by initial findings in a data analysis which revealed yet undiscovered problems in the dataset." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009) 

"Data captures actions and characteristics of the real world and transforms them into something that can be examined and explored after the fact." (Zach Gemignani et al, "Data Fluency", 2014)

"Transforming data to measurements of a different kind can clarify and simplify hypotheses that have already been generated and can reveal patterns that would otherwise be hidden." (Forrest W Young et al, "Visual Statistics: Seeing data with dynamic interactive graphics", 2016)

"Many statistical procedures perform more effectively on data that are normally distributed, or at least are symmetric and not excessively kurtotic" (fat-tailed), and where the mean and variance are approximately constant. Observed time series frequently require some form of transformation before they exhibit these distributional properties, for in their 'raw' form they are often asymmetric." (Terence C Mills, "Applied Time Series Analysis: A practical guide to modeling and forecasting", 2019)

"Knowing the semantics of your data helps with sensible data transformations." (Vidya Setlur & Bridget Cogley, "Functional Aesthetics for data visualization", 2022)

"Data becomes more useful once it’s transformed into a data visualization or used in a data story. Data storytelling is the ability to effectively communicate insights from a dataset using narratives and visualizations. It can be used to put data insights into context and inspire action from your audience. Color can be very helpful when you are trying to make information stand out within your data visualizations." (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.