SQL Troubles: summaries

Showing posts with label summaries. Show all posts

29 May 2024

📊Graphical Representation: Graphics We Live By (Part VII: Reading a Conversion Rates Chart with ChatGPT and Copilot)

Graphical Representation Series

One of the areas where ChatGPT, Copilot and other similar AI-based chatbots can help is in summarizing a chart saved as image. Ideally, the chatbots should be able also to approximate the points from the chart as well (an image is made of pixels and thus areas should be easy to delimit). So, I was wondering how far the chatbots can be used for these purposes. I used first an image copied from the web, though I realized that not all visual elements could be read (e.g. Copilot had issues retrieving the values for some months) and I had no basis data for comparisons to identify how big the deviations are.

So, I created a chart in Power BI based on the below chart (see original data):

Conversion Rates Dual Axes Chart

Here's the output based on Copilot over several attempts:

	Original data			First attempt		Second attempt		Third attempt		Fourth attempt
Sorting	Month	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate
1	Jul	8	4	10	1	10	1	8	4	8	4
2	Aug	280	16	275	15	275	15	275	18	275	18
3	Sep	100	13	225	12	225	10	225	12	225	12
4	Oct	280	14	275	12	275	11	275	11	275	11
5	Nov	90	4	75	5	75	6	75	6	75	6
6	Dec	85	3.5	100	5	100	5	100	5	100	5
7	Jan	70	4.5	50	3	50	3	50	4	50	4
8	Feb	30	1.5	50	3	25	2	50	2.5	50	2.5
9	Mar	70	4	25	1	50	2.5	25	1.5	25	1.5
10	Apr	185	11	200	10	200	10	200	10	200	10
11	May	25	3.5	50	4	50	4	50	3.5	50	3.5
12	Jun	195	4	225	10	225	10	225	11	195	4

Please note that same values were repeated to identify whether the chatbot is able to identify the resemblance correctly.

Based on the data several tests were made with ChatGPT4o, where the following prompts were given:

generate the data behind the chart as excel list (see 'First attempt' in the table)
generate the data behind the chart as excel list where blue is represented on the left axis and orange on the right axis (see 'Second attempt');
reevaluate the values from the previous output considering that the values for Jul are 8, respectively 4% (see 'Third attempt');
reevaluate the values from the previous output considering that the values for Jun are 195, respectively 4% (see 'Fourth attempt');
generate the data behind the dual axis chart as list from the last conversation;
summarize the chart by revealing trends and other statistics.

From attempt to attempt, the values get closer to the original data, corrections being made, though there are points in the final output that diverge considerably. ChatGPT does generate a table which can be easily copied to MS Excel, and even gives some instructions on how to generate a dual axis chart.

It's great that one can use data between sessions and improve maybe the output. ChatGPT's summary is better than expected and it leverages the latest corrections. However, the estimations need to improve so that the inferences can be correct.

Here's the output based on the same prompts in Copilot:

	Original data			First attempt		Second attempt		Third attempt		Fourth attempt
Sorting	Month	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate	Conv.	Conv. Rate
1	Jul	8	4	300	18	300	18	8	4	8	4
2	Aug	280	16	250	16	250	16	250	16	250	16
3	Sep	100	13	200	14	200	14	200	14	200	14
4	Oct	280	14	150	12	150	12	150	12	150	12
5	Nov	90	4	100	10	100	10	100	10	100	10
6	Dec	85	3.5	50	8	50	8	50	8	50	8
7	Jan	70	4.5	300	18	300	18	300	18	300	18
8	Feb	30	1.5	250	16	250	16	250	16	250	16
9	Mar	70	4	200	14	200	14	200	14	200	14
10	Apr	185	11	150	12	150	12	150	12	150	12
11	May	25	3.5	100	10	100	10	100	10	100	10
12	Jun	195	4	50	8	50	8	50	8	195	4

Copilot's estimations are higher than the ones made by ChatGPT and deviate more from the original data. No reevaluations are done between prompts for the other values. The summary provides information that can be used to complement ChatGPT's output.

Overall, ChatGPT seems to perform better than Copilot, at least for this example (though we might talk here about different "generations"). Unfortunately, given that the estimations provided by both chatbots deviate considerably from the expectation, the output needs to be revised and corrected, which decreases the usability of such chatbots. In fact, one can use them to generate an initial set of data and correct then the deviations.

The outputs of other chatbots like Google's Gemini or Claude-3-Haiku (via Poe) can't be compared with the ones from ChatGPT or Copilot yet. Claude-3-Haiku does provide estimated values (even with comma), though they deviate considerably from the original data.

It would be interesting to test how other charts and plots are processed by chatbots, respectively whether the various visual elements (e.g. gridlines, ticks, markers) make a difference.

Previous Post <<||>> Next Post

26 February 2024

📊R Language: Data Summaries without Using a DataFrame

Coming back to the R language after several years and trying to remember some basic functions proved to be a bit challenging, even if the syntax is quite simple. Therefore, I considered putting together a few calls as refresher based on Youden-Beale data. To run the below code you'll need to install the R language and RStudio.

In case you don't have the package installed, run the next two lines:

install.packages("ACSWR") #install the Youden-Beale Experiment package
library(ACSWR)	#load the library

str(yb)		#display datasets' structure

'data.frame': 8 obs. of 2 variables:
$ Preparation_1: int 31 20 18 17 9 8 10 7
$ Preparation_2: int 18 17 14 11 10 7 5 6

yb		#display the dataset

Preparation_1 Preparation_2
1      31              18
2      20              17
3      18              14
4      17              11
5         9               10
6       8               7
7       10                5
8      7            6

summary(yb) 	#display the summary for whole dataset

Preparation_1 Preparation_2
Min. : 7.00      Min. : 5.00
1st Qu.: 8.75      1st Qu.: 6.75
Median :13.50   Median :10.50
Mean :15.00      Mean :11.00
3rd Qu.:18.50      3rd Qu.:14.75
Max. :31.00    Max. :18.00

summary(yb$Preparation_1)	#display the summary for first column

Min. 1st Qu. Median Mean 3rd Qu. Max.
7.00 8.75 13.50 15.00 18.50 31.00

summary(yb$Preparation_2)	#display the summary for second column

Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 6.75 10.50 11.00 14.75 18.00

min(yb)	#display the minimum value for the whole dataset

[1] 5

min(yb$Preparation_1)	#display the mininun of first column

[1] 7

min(yb$Preparation_2)	#display the minimum of second column

[1] 5

sum(yb)	#display the sum of all values

[1] 208

sum(yb$Preparation_1)	#display the sum of first column

[1] 120

sum(yb$Preparation_2)	#display the sum of second column

[1] 88

#display the percentiles 
quantile(yb$Preparation_1,seq(0,1,.25))

0% 25% 50% 75% 100%
7.00 8.75 13.50 18.50 31.00

#display the percentiles 
quantile(yb$Preparation_2,seq(0,1,.25))

0% 25% 50% 75% 100%
5.00 6.75 10.50 14.75 18.00

#display the percentiles 
quantile(yb$Preparation_2,seq(0,1,.25))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
7.0 7.7 8.4 9.1 9.8 13.5 17.2 17.9 19.2 23.3 31.0

quantile(yb$Preparation_2,seq(0,1,.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
5.0 5.7 6.4 7.3 9.4 10.5 11.6 13.7 15.8 17.3 18.0

length(yb) 	#display the number of items 
ncol(yb) 	#display the number of columns

[1] 2

sort(yb$Preparation_1) #display the sorted values ascendingly

[1] 7 8 9 10 17 18 20 31

sort(yb$Preparation_1, decreasing = TRUE)

[1] 31 20 18 17 10 9 8 7

#display a vertical poxplot
boxplot(yb, notch=FALSE)
title("A: Vertical Boxplot for Youden-Beale Data")

#display an horizontal poxplot
boxplot(yb, horizontal = TRUE)
title("B: Horizontal Boxplot for Youden-Beale Data")

plot(yb) #scatter diagram

title("Scatter diagram")

lsfit(yb$Preparation_1, yb$Preparation_2)$coefficients #list square fit coefficients

Intercept X

2.8269231 0.5448718

lsfit(yb$Preparation_1, yb$Preparation_2)$residuals #list square fit residuals

[1] -1.7179487  3.2756410  1.3653846 -1.0897436  2.2692308 -0.1858974
[7] -3.2756410 -0.6410256

Happy coding!

Previous Post <<||>> Next Post

01 October 2018

🔭Data Science: Summaries (Just the Quotes)

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"Comparable objectives in data analysis are (l) to achieve more specific description of what is loosely known or suspected; (2) to find unanticipated aspects in the data, and to suggest unthought-of-models for the data's summarization and exposure; (3) to employ the data to assess the (always incomplete) adequacy of a contemplated model; (4) to provide both incentives and guidance for further analysis of the data; and (5) to keep the investigator usefully stimulated while he absorbs the feeling of his data and considers what to do next." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Summarizing data is a process of constrained and partial a process that essentially and inevitably corresponds to description - some sort of fitting, though it need not necessarily involve formal criteria or well-defined computations." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Fitting lines to relationships between variables is the major tool of data analysis. Fitted lines often effectively summarize the data and, by doing so, help communicate the analytic results to others. Estimating a fitted line is also the first step in squeezing further information from the data." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"A good description of the data summarizes the systematic variation and leaves residuals that look structureless. That is, the residuals exhibit no patterns and have no exceptionally large values, or outliers. Any structure present in the residuals indicates an inadequate fit. Looking at the residuals laid out in an overlay helps to spot patterns and outliers and to associate them with their source in the data." (Christopher H Schrnid, "Value Splitting: Taking the Data Apart", 1991)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"[...] data often has some errors, outliers and other strange values, but these do not necessarily need to be individually identified and excluded. It also points to the benefits of using summary measures that are not unduly affected by odd observations [...] are known as robust measures, and include the median and the inter-quartile range." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient [...]. A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards [...]." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"A data visualization, or dashboard, is great for summarizing or describing what has gone on in the past, but if people don’t know how to progress beyond looking just backwards on what has happened, then they cannot diagnose and find the ‘why’ behind it." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

SQL Troubles

Pages