Showing posts with label misuse. Show all posts
Showing posts with label misuse. Show all posts

12 December 2024

🧭💹Business Intelligence: Perspectives (Part XIX: Data Visualization between Art, Pragmatism and Kitsch)

Business Intelligence Series

The data visualizations (aka dataviz) presented in the media, especially the ones coming from graphical artists, have the power to help us develop what is called graphical intelligence, graphical culture, graphical sense, etc., though without a tutor-like experience the process is suboptimal because it depends on our ability of identifying what is important and which are the steps needed for decoding and interpreting such work, respectively for integrating their messages in our overall understanding about the world.

When such skillset is lacking, without explicit annotations or other form of support, the reader might misinterpret or fail to observe important visual cues even for simple visualizations, with all the implications deriving from this – a false understanding, and further aspects deriving from it, this being probably the most important aspect to consider. Unfortunately, even the most elaborate work can fail if the reader doesn’t have a basic understanding of all that’s implied in the process.

The books of Willard Brinton, Ana Rogers, Jacques Bertin, William Cleveland, Leland Wilkinson, Stephen Few, Albert Cairo, Soctt Berinato and many others can help the readers build a general understanding of the dataviz process and how data visualizations or simple graphics can be used/misused effectively, though each reader must follow his/her own journey. It’s also true that the basics can be easily learned, though the deeper one dives, the more interesting and nontrivial the journey becomes. Fortunately, the average reader can stick to the basics and many visualizations are simple enough to be understood.

To grasp the full extent of the implications, one can make comparisons with the domain of poetry where the author uses basic constructs like metaphor, comparisons, rhythm and epithets to create, communicate and imprint in reader’s mind old and new meanings, images and feelings altogether. Artistic data visualizations tend to offer similar charge as poetry does, even if the impact might not appeal so much to our artistic sensibility. Though dataviz from this perspective is or at least resembles an art form.

Many people can write verses, though only a fraction can write good meaningful poetry, from which a smaller fraction get poems, respectively even fewer get books published. Conversely, not everything can be expressed in verses unless one finds good metaphors and other aspects that can be leveraged in the process. Same can be said about good dataviz.

One can argue that in dataviz the author can explore and learn especially by failing fast (seeing what works and what doesn’t). One can also innovate, though the creator has probably a limited set of tools and rules for communication. Enabling readers to see the obvious or the hidden in complex visualizations or contexts requires skill and some kind of mastery of the visual form.

Therefore, dataviz must be more pragmatic and show the facts. In art one has the freedom to distort or move things around to create new meanings, while in dataviz it’s important for the meaning to be rooted in 'truth', at least by definition. The more the creator of a dataviz innovates, the higher the chances of being misunderstood. Moreover, readers need to be educated in interpreting the new meanings and get used to their continuous use.

Kitsch is a term applied to art and design that is perceived as naïve imitation to the degree that it becomes a waste of resources even if somebody pays the tag price. There’s a trend in dataviz to add elements to visualizations that don’t bring any intrinsic value – images, colors and other elements can be misused to the degree that the result resembles kitsch, and the overall value of the visualization is diminished considerably.

19 March 2024

📊R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points)

For a previous post on inflection points I needed a few examples, so I thought to write the code in the R language, which I did. Here's the final output:

Examples of Inflection Points

And, here's the code used to generate the above graphic:

par(mfrow = c(2,2)) #2x2 matrix display

# Example A: Inflection point with bifurcation
curve(x^3+20, -3,3, col = "black", main="(A) Inflection Point with Bifurcation")
curve(-x^2+20, 0, 3, add=TRUE, col="blue")
text (2, 10, "f(x)=-x^2+20, [0,3]", pos=1, offset = 1) #label inflection point
points(0, 20, col = "red", pch = 19) #inflection point 
text (0, 20, "inflection point", pos=1, offset = 1) #label inflection point


# Example B: Inflection point with Up & Down Concavity
curve(x^3-3*x^2-9*x+1, -3,6, main="(B) Inflection point with Up & Down Concavity")
points(1, -10, col = "red", pch = 19) #inflection point 
text (1, -10, "inflection point", pos=4, offset = 1) #label inflection point
text (-1, -10, "concave down", pos=3, offset = 1) 
text (-1, -10, "f''(x)<0", pos=1, offset = 0) 
text (2, 5, "concave up", pos=3, offset = 1)
text (2, 5, "f''(x)>0", pos=1, offset = 0) 


# Example C: Inflection point for multiple curves
curve(x^3-3*x+2, -3,3, col ="black", ylab="x^n-3*x+2, n = 2..5", main="(C) Inflection Point for Multiple Curves")
text (-3, -10, "n=3", pos=1) #label curve
curve(x^2-3*x+2,-3,3, add=TRUE, col="blue")
text (-2, 10, "n=2", pos=1) #label curve
curve(x^4-3*x+2,-3,3, add=TRUE, col="brown")
text (-1, 10, "n=4", pos=1) #label curve
curve(x^5-3*x+2,-3,3, add=TRUE, col="green")
text (-2, -10, "n=5", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=4, offset = 1) #label inflection point
title("", line = -3, outer = TRUE)


# Example D: Inflection Point with fast change
curve(x^5-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 5,7,9", main="(D) Inflection Point with Slow vs. Fast Change")
text (-3, -100, "n=5", pos=1) #label curve
curve(x^7-3*x+2, add=TRUE, col="green")
text (-2.25, -100, "n=7", pos=1) #label curve
curve(x^9-3*x+2, add=TRUE, col="brown")
text (-1.5, -100, "n=9", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=3, offset = 1) #label inflection point

mtext("© sql-troubles@blogspot.com @sql_troubles, 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)
#title("Examples of Inflection Points", line = -1, outer = TRUE)

Mathematically, an inflection point is a point on a smooth (plane) curve at which the curvature changes sign and where the second derivative is 0 [1]. The curvature intuitively measures the amount by which a curve deviates from being a straight line.

In example A, the main function has an inflection point, while the second function defined only for the interval [0,3] is used to represent a descending curve (aka bifurcation) for which the same point is a maximum point.  

In example B, the function was chosen to represent an example with a concave down (for which the second derivative is negative) and a concave up (for which the second derivative is positive) section. So what comes after an inflection point is not necessarily a monotonic increasing function. 

In example C are depicted several functions based on a varying power of the first coefficient which have the same inflection point. One could have shown only the behavior of the functions after the inflection point, while before choosing only one of the functions (see example A).

In example D is the same function as in example C with varying powers of the first coefficient considered, though for higher powers than in example C. I kept the function for n=5 to offer a basis for comparison. Apparently, the strange thing is that around the inflection point the change seems to be small and linear, which is not the case. The two graphics are correct though, because as basis is considered the scale for n=5, while in C the basis is n=3 (one scales the graphic further away from the inflection point). If one adds n=3 as the first function in the example D, the new chart will resemble C. Unfortunately, this behavior can be misused to show something like being linear around the inflection point, which is not the case. 

# Example E: Inflection Point with slow vs. fast change extended
curve(x^3-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 3,5,7,9", main="(E) Inflection Point with Slow vs. Fast Change")
text (-3, -10, "n=3", pos=1) #label curve
curve(x^5-3*x+2,-3,3, add=TRUE, col="brown")
text (-2, -10, "n=5", pos=1) #label curve
curve(x^7-3*x+2, add=TRUE, col="green")
text (-1.5, -10, "n=7", pos=1) #label curve
curve(x^9-3*x+2, add=TRUE, col="orange")
text (-1, -5, "n=9", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=3, offset = 1) #label inflection point

Comments:
(1) I cheated a bit calculating the second derivative manually, which is an easy task for polynomials. There seems to be methods for calculating the inflection point, though the focus was on providing the examples. 
(2) The examples C and D could have been implemented as part of a loop, though I needed anyway to add the labels for each curve individually. Here's the modified code to support a loop:

# Example F: Inflection Point with slow vs. fast change with loop
n <- list(5,7,9)
color <- list("brown", "green", "orange")

curve(x^3-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 3,5,7,9", main="(F) Inflection Point with Slow vs. Fast Change")
for (i in seq_along(n))
{
ind <- as.numeric(n[i])
curve(x^ind-3*x+2,-3,3, add=TRUE, col=toString(color[i]))
}

text (-3, -10, "n=3", pos=1) #label curve
text (-2, -10, "n=5", pos=1) #label curve
text (-1, -5, "n=9", pos=1) #label curve
text (-1.5, -10, "n=7", pos=1) #label curve

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2023) Inflection point (link)

11 February 2023

💎SQL Reloaded: Misusing Views in Serverless SQL Pool

The lack of user-defined tables in serverless SQL pool is probably one of the biggest gaps one needs to overcome as developer in Azure Synapse. On the other side, given that views are "virtual tables", views can be misused to store small chunks of data that don't change that often. 

In the process there are two activities involved - preparing the data, respectively creating the view. Some developers might prefer preparing the data in Excel though, when possible, I prepare the data directly from the source system. For example, the first four columns from the below query can be used for the first approach, while the last column prepares the data as needed by a VALUES.

-- Product subcategories denormalized
SELECT PCT.ProductSubcategoryID
, PCT.ProductCategoryID
, PSC.Name Category
, PCT.Name Subcategory 
, CONCAT(', (''', PCT.ProductSubcategoryID, ''', ''', PCT.ProductCategoryID, ''', ''', PSC.Name, ''', ''', PCT.Name, ''')') [Values] 
FROM Production.ProductSubcategory PCT
     JOIN Production.ProductCategory PSC
	   ON PCT.ProductCategoryID = PSC.ProductCategoryID

Independently of the method used for data preparation, the view can be built as follows:

-- creating the view
CREATE VIEW dbo.ProductCategories
AS
-- Product categories
SELECT Cast(DAT.ProductSubcategoryID as int) ProductSubcategoryID
, Cast(DAT.ProductCategoryID as int) ProductCategoryID
, Cast(DAT.Category as nvarchar(255)) Category
, Cast(DAT.Subcategory  as nvarchar(255)) Subcategory
, Dense_Rank() OVER (PARTITION BY DAT.Category, DAT.Subcategory 
         ORDER BY DAT.ProductSubcategoryID, DAT.ProductCategoryID) Ranking
FROM (-- prepared data
VALUES ('1', '1', 'Bikes', 'Mountain Bikes')
, ('2', '1', 'Bikes', 'Road Bikes')
, ('3', '1', 'Bikes', 'Touring Bikes')
, ('4', '2', 'Components', 'Handlebars')
, ('5', '2', 'Components', 'Bottom Brackets')
, ('6', '2', 'Components', 'Brakes')
) DAT(ProductSubcategoryID, ProductCategoryID, Category, Subcategory)

-- testing the view
SELECT *
FROM dbo.ProductCategories

Observe that for each column was defined explicitly a data type and, even if the definition might change, it's still a better idea than relying on the database engine for inferring the data type. Secondly, I prefer to include also a "Ranking" column based on some attributes I'm expecting to be unique over the whole dataset. This, just in case some duplicate might make its way into the dataset. This is not a must, but something to consider in exceptional cases. 

The alternative would be to save the same data into a file and make it available as a CETAS. For small datasets the overhead can be minimal. This can be a better idea if the users need to modify the data manually. On the other side, storing the definition of a view into an SQL file and making corrections as needed can prove to be faster for a developer, at least until the dataset stabilizes and less changes are needed.

It's important to understand that in this way a feature is misused and there are more likely some penalties deriving from it. As mentioned in the beginning, the dataset must be small (e.g. maximum a few hundreds of records) and change seldom. In extremis, it's not advisable to build a solution based on this!

One scenario I had to use this option was creating a dataset based on recursive logic built inside of a stored procedure, which would output the data in the form needed by the view, as done above. This was used as workaround, as I couldn't save the data via a pipeline. (I had a deja vu from the first attempts of exporting data in SSIS 2005!)

In theory the maintenance of such a view can be automated via a pipeline, if the effort makes sense, and the solution is stable enough. 

I blogged some years back on Misusing Views and Pseudo-Constants.

Happy coding!

26 November 2018

🔭Data Science: Lying with Statistics (Just the Quotes)

"Thus the alteration of the truth which is already manifesting itself in the progressive form of lying and perjury, offers us, in the superlative, the statistics." (François Magendie, 18th century) 

"An old jest runs to the effect that there are three degrees of comparison among liars. There are liars, there are outrageous liars, and there are scientific experts. This has lately been adapted to throw dirt upon statistics. There are three degrees of comparison, it is said, in lying. There are lies, there are outrageous lies, and there are statistics." (Robert Giffen, Economic Journal 2 (6), 1892)

"Professor [Joseph] Munro reminded him of an old saying which he rather reluctantly proposed, in that company, to repeat. It was to the effect that there were three gradations of inveracity - there were lies, there were d-d lies, and there were statistics." (Arthur J Balfour, [in Manchester Guardian] 1892)

"Columns of figures are hurled about in the papers, and demonstrate the justice of the witty claim that there are three kinds of untruth : fibs, lies, and statistics." (Herbert B Workman, "The principles of the Gothenburg system", Wesleyan-Methodist Magazine 118, 1895)

"After all, facts are facts, and although we may quote one to another with a chuckle the words of the Wise Statesman, 'Lies - damn lies - and statistics', still there are some easy figures the simplest must understand, and the astutest cannot wriggle out of." (Leonard H. Courtney, [speech] 1895)

"There are three kinds of lies - lies, damned lies and statistics." (Carroll D Wright, New York Times, 1896) 

"Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: “There are three kinds of lies: lies, damned lies, and statistics." (Mark Twain, [in "Mark Twain’s Autobiography" Vol I, 1904])

"Figures may not lie, but statistics compiled unscientifically and analyzed incompetently are almost sure to be misleading, and when this condition is unnecessarily chronic the so-called statisticians may be called liars." (Edwin B Wilson, "Bulletin of the American Mathematical Society", Vol 18, 1912)

"In earlier times they had no statistics and so they had to fall back on lies. Hence the huge exaggerations of primitive literature, giants, miracles, wonders! It's the size that counts. They did it with lies and we do it with statistics: but it's all the same." (Stephen Leacock, "Model memoirs and other sketches from simple to serious", 1939)

"It has long been recognized by public men of all kinds […] that statistics come under the head of lying, and that no lie is so false or inconclusive as that which is based on statistics." (Hilaire Belloc, "The Silence of the Sea", 1940)

"Many people use statistics as a drunkard uses a street lamp - for support rather than illumination. It is not enough to avoid outright falsehood; one must be on the alert to detect possible distortion of truth. One can hardly pick up a newspaper without seeing some sensational headline based on scanty or doubtful data." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"Just like the spoken or written word, statistics and graphs can lie. They can lie by not telling the full story. They can lead to wrong conclusions by omitting some of the important facts. [...] Always look at statistics with a critical eye, and you will not be the victim of misleading information." (Dyno Lowenstein, "Graphs", 1976)

"For many people the first word that comes to mind when they think about statistical charts is 'lie'. No doubt some graphics do distort the underlying data, making it hard for the viewer to learn the truth. But data graphics are no different from words in this regard, for any means of communication can be used to deceive. There is no reason to believe that graphics are especially vulnerable to exploitation by liars; in fact, most of us have pretty good graphical lie detectors that help us see right through frauds." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The conditions under which many data graphics are produced - the lack of substantive and quantitative skills of the illustrators, dislike of quantitative evidence, and contempt for the intelligence of the audience-guarantee graphic mediocrity. These conditions engender graphics that (1) lie; (2) employ only the simplest designs, often unstandardized time-series based on a small handful of data points; and (3) miss the real news actually in the data." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Fairy tales lie just as much as statistics do, but sometimes you can find a grain of truth in them." (Sergei Lukyanenko, "The Night Watch", 1998)

"While some social problems statistics are deliberate deceptions, many - probably the great majority - of bad statistics are the result of confusion, incompetence, innumeracy, or selective, self-righteous efforts to produce numbers that reaffirm principles and interests that their advocates consider just and right. The best response to stat wars is not to try and guess who's lying or, worse, simply to assume that the people we disagree with are the ones telling lies. Rather, we need to watch for the standard causes of bad statistics - guessing, questionable definitions or methods, mutant numbers, and inappropriate comparisons." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Believe it or not, it’s easy to make statistics lie. It’s called massaging the facts, and people do it all the time. […] To avoid this, graphics reporters should develop a keen eye for spotting problems with statistics in order to avoid the embarrassment and possible liability of reporting incorrect information." (Jennifer George-Palilonis," A Practical Guide to Graphics Reporting: Information Graphics for Print, Web & Broadcast", 2006)

"Another way to obscure the truth is to hide it with relative numbers. […] Relative scales are always given as percentages or proportions. An increase or decrease of a given percentage only tells us part of the story, however. We are missing the anchoring of absolute values." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"One way a chart can lie is through overemphasis of the size and scale of items, particularly when the dimension of depth isnʼt considered." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"I believe that the backlash against statistics is due to four primary reasons. The first, and easiest for most people to relate to, is that even the most basic concepts of descriptive and inferential statistics can be difficult to grasp and even harder to explain. […] The second cause for vitriol is that even well-intentioned experts misapply the tools and techniques of statistics far too often, myself included. Statistical pitfalls are numerous and tough to avoid. When we can't trust the experts to get it right, there's a temptation to throw the baby out with the bathwater. The third reason behind all the hate is that those with an agenda can easily craft statistics to lie when they communicate with us  […] And finally, the fourth cause is that often statistics can be perceived as cold and detached, and they can fail to communicate the human element of an issue." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"It is easy to lie with statistics. It is hard to tell the truth without it." (Andrejs Dunkels)

02 November 2018

Data Science: Torturing the Data in Statistics

Statistics, through its methods, techniques and models rooted in mathematical reasoning, allows exploring, analyzing and summarizing a given set of data, being used to support decision-making, experiments, theories and ultimately to gain and communicate insights. When used adequately, statistics can prove to be a useful toolset, however as soon its use deviates from the mathematical rigor and principles on which it was built, it can be easily misused. Moreover, the results obtained with the help of statistics, can be easily denatured in communication, even when the statistical results are valid. 

The easiness with which statistics can be misused is probably best reflected in sayings like 'if you torture the data long enough it will confess'.  The formulation is attributed by several sources to the economist Ronald H Coase, however according to Coase the reference made by him in the 1960’s was slightly different: 'if you torture the data enough, nature will always confess' (see [1]). The latter formulation is not necessarily negative if one considers the persistence needed by researchers in revealing nature’s secrets. In exchange, the former formulation seems to stress only the negative aspect. 

The word 'torture' seems to be used instead of 'abuse', though metaphorically it has more weight, it draws the attention and sticks with the reader or audience. As the Quotes Investigator remarks [1], ‘torturing the data’ was employed as metaphor much earlier. For example, a 1933 article contains the following passage: 

"The evidence submitted by the committee from its own questionnaire warrants no such conclusion. To torture the data given in Table I into evidence supporting a twelve-hour minimum of professional training is indeed a statistical feat, but one which the committee accomplishes to its own satisfaction." ("The Elementary School Journal" Vol. 33 (7), 1933)

More than a decade earlier, in a similar context with Coase's quote, John Dewey remarked:

"Active experimentation must force the apparent facts of nature into forms different to those in which they familiarly present themselves; and thus make them tell the truth about themselves, as torture may compel an unwilling witness to reveal what he has been concealing." (John Dewey, "Reconstruction in Philosophy", 1920)

Torture was used metaphorically from 1600s, if we consider the following quote from Sir Francis Bacon’s 'Advancement of Learning':

"Another diversity of Methods is according to the subject or matter which is handled; for there is a great difference in delivery of the Mathematics, which are the most abstracted of knowledges, and Policy, which is the most immersed […], yet we see how that opinion, besides the weakness of it, hath been of ill desert towards learning, as that which taketh the way to reduce learning to certain empty and barren generalities; being but the very husks and shells of sciences, all the kernel being forced out and expulsed with the torture and press of the method." (Sir Francis Bacon, Advancement of Learning, 1605)

However a similar metaphor with closer meaning can be found almost two centuries later:

"One very reprehensible mode of theory-making consists, after honest deductions from a few facts have been made, in torturing other facts to suit the end proposed, in omitting some, and in making use of any authority that may lend assistance to the object desired; while all those which militate against it are carefully put on one side or doubted." (Henry De la Beche, "Sections and Views, Illustrative of Geological Phaenomena", 1830)

Probably, also the following quote from Goethe deservers some attention:

"Someday someone will write a pathology of experimental physics and bring to light all those swindles which subvert our reason, beguile our judgement and, what is worse, stand in the way of any practical progress. The phenomena must be freed once and for all from their grim torture chamber of empiricism, mechanism, and dogmatism; they must be brought before the jury of man's common sense." (Johann Wolfgang von Goethe)

Alternatives to Coase’s formulation were used in several later sources, replacing 'data' with 'statistics' or 'numbers':

"Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion." (Stephen M Stigler, "Neutral Models in Biology", 1987)

"Torture numbers, and they will confess to anything." (Gregg Easterbrook, New Republic, 1989)

"[…] an honest exploratory study should indicate how many comparisons were made […] most experts agree that large numbers of comparisons will produce apparently statistically significant findings that are actually due to chance. The data torturer will act as if every positive result confirmed a major hypothesis. The honest investigator will limit the study to focused questions, all of which make biologic sense. The cautious reader should look at the number of ‘significant’ results in the context of how many comparisons were made." (James L Mills, "Data torturing", New England Journal of Medicine, 1993)

"This is true only if you torture the statistics until they produce the confession you want." (Larry Schweikart, "Myths of the 1980s Distort Debate over Tax Cuts", 2001) [source

"Even properly done statistics can’t be trusted. The plethora of available statistical techniques and analyses grants researchers an enormous amount of freedom when analyzing their data, and it is trivially easy to ‘torture the data until it confesses’." (Alex Reinhart, "Statistics Done Wrong: The Woefully Complete Guide", 2015)

There is also a psychological component attached to data or facts' torturing to fit the reality, tendency derived from the way the human mind works, the limits and fallacies associated with mind's workings. 

"What are the models? Well, the first rule is that you’ve got to have multiple models - because if you just have one or two that you’re using, the nature of human psychology is such that you’ll torture reality so that it fits your models, or at least you’ll think it does." (Charles Munger, 1994)

Independently of the formulation and context used, the fact remains: statistics (aka data, numbers) can be easily abused, and the reader/audience should be aware of it!

Previously published on quotablemath.blogspot.com.

24 February 2018

💎SQL Reloaded: Misusing Views and Pseudo-Constants

Views as virtual tables can be misused to replace tables in certain circumstances, either by storing values within one or multiple rows, like in the below examples:

-- parameters for a BI solution
CREATE VIEW dbo.vLoV_Parameters
AS
SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId
 , Cast(GetDate() as Date) AS CurrentDate 
 , Cast(100 as int) AS BatchCount 

GO

SELECT *
FROM dbo.vLoV_Parameters

GO

-- values for a dropdown 
 CREATE VIEW dbo.vLoV_DataAreas
 AS
 SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId
 , Cast('Company ABC' as nvarchar(50)) AS Description 
 UNION ALL
 SELECT 'XYZ' DataAreaId 
 , 'Company XYZ'

GO

SELECT *
FROM dbo.vLoV_DataAreas

GO

These solutions aren’t elegant, and typically not recommended because they go against one of the principles of good database design, namely “data belong in tables”, though they do the trick when needed. Personally, I used them only in a handful of cases, e.g. when it wasn’t allowed to create tables, when it was needed testing something for a short period of time, or when there was some overhead of creating a table for 2-3 values. Because of their scarce use, I haven’t given them too much thought, not until I discovered Jared Ko’s blog posting on pseudo-constants. He considers the values from the first view as pseudo-constants, and advocates for their use especially for easier dependency tracking, easier code refactoring, avoiding implicit data conversion and easier maintenance of values.

All these are good reasons to consider them, therefore I tried to take further the idea to see if it survives a reality check. For this I took Dynamics AX as testing environment, as it makes extensive use of enumerations (aka base enums) to store list of values needed allover through the application. Behind each table there are one or more enumerations, the tables storing master data abounding of them.  For exemplification let’s consider InventTrans, table that stores the inventory transactions, the logic that governs the receipt and issued transactions are governed by three enumerations: StatusIssue, StatusReceipt and Direction.

-- Status Issue Enumeration 
 CREATE VIEW dbo.vLoV_StatusIssue
 AS
 SELECT cast(0 as int) AS None
 , cast(1 as int) AS Sold
 , cast(2 as int) AS Deducted
 , cast(3 as int) AS Picked
 , cast(4 as int) AS ReservPhysical
 , cast(5 as int) AS ReservOrdered
 , cast(6 as int) AS OnOrder
 , cast(7 as int) AS QuotationIssue

GO

-- Status Receipt Enumeration 
 CREATE VIEW dbo.vLoV_StatusReceipt
 AS
SELECT cast(0 as int) AS None
 , cast(1 as int) AS Purchased
 , cast(2 as int) AS Received
 , cast(3 as int) AS Registered
 , cast(4 as int) AS Arrived
 , cast(5 as int) AS Ordered
 , cast(6 as int) AS QuotationReceipt

GO

-- Inventory Direction Enumeration 
 CREATE VIEW dbo.vLoV_InventDirection
 AS
 SELECT cast(0 as int) AS None
 , cast(1 as int) AS Receipt
 , cast(2 as int) AS Issue

 To see these views at work let’s construct the InventTrans table on the fly:

-- creating an ad-hoc table  
 SELECT *
 INTO  dbo.InventTrans
 FROM (VALUES (1, 1, 0, 2, -1, 'A0001')
 , (2, 1, 0, 2, -10, 'A0002')
 , (3, 2, 0, 2, -6, 'A0001')
 , (4, 2, 0, 2, -3, 'A0002')
 , (5, 3, 0, 2, -2, 'A0001')
 , (6, 1, 0, 1, 1, 'A0001')
 , (7, 0, 1, 1, 50, 'A0001')
 , (8, 0, 2, 1, 100, 'A0002')
 , (9, 0, 3, 1, 30, 'A0003')
 , (10, 0, 3, 1, 20, 'A0004')
 , (11, 0, 1, 2, 10, 'A0001')
 ) A(TransId, StatusIssue, StatusReceipt, Direction, Qty, ItemId)


 Here are two sets of examples using literals vs. pseudo-constants:

--example issued with literals 
SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
 WHERE ITR.StatusIssue = 1 
   AND ITR.Direction = 2

GO
 --example issued with pseudo-constants
 SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
      JOIN dbo.vLoV_StatusIssue SI
        ON ITR.StatusIssue = SI.Sold
      JOIN dbo.vLoV_InventDirection ID
        ON ITR.Direction = ID.Issue

GO

--example receipt with literals 
 SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
 WHERE ITR.StatusReceipt= 1
   AND ITR.Direction = 1

GO

--example receipt with pseudo-constants
 SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
      JOIN dbo.vLoV_StatusReceipt SR
        ON ITR.StatusReceipt= SR.Purchased
      JOIN dbo.vLoV_InventDirection ID
        ON ITR.Direction = ID.Receipt

 
As can be seen the queries using pseudo-constants make the code somehow readable, though the gain is only relative, each enumeration implying an additional join. In addition, when further business tables are added to the logic (e.g. items, purchases or sales orders)  it complicates the logic, making it more difficult to separate the essential from nonessential. Imagine a translation of the following query:

-- complex query 
  SELECT top 100 ITR.*
  FROM dbo.InventTrans ITR
              <several tables here>
  WHERE ((ITR.StatusReceipt<=3 AND ITR.Direction = 1)
    OR (ITR.StatusIssue<=3 AND ITR.Direction = 2))
    AND (<more constraints here>)


The more difficult the constraints in the WHERE clause, the more improbable is a translation of the literals into pseudo-constraints. Considering that an average query contains 5-10 tables, each of them with 1-3 enumerations, the queries would become impracticable by using pseudo-constants and quite difficult to troubleshoot their execution plans.

The more I’m thinking about, an enumeration data type as global variable in SQL Server (like the ones available in VB) would be more than welcome, especially because values are used over and over again through the queries. Imagine, for example, the possibility of writing code as follows:

-- hypothetical query
SELECT top 100 ITR.*
FROM dbo.InventTrans ITR
WHERE ITR.StatusReceipt = @@StatusReceipt .Purchased
  AND ITR.Direction = @@InventDirection.Receipt

From my point of view this would make the code more readable and easier to maintain. Instead, in order to make the code more readable, one’s usually forced to add some comments in the code. This works as well, though the code can become full of comments.

-- query with commented literals
SELECT top 100 ITR.*
FROM dbo.InventTrans ITR
WHERE ITR.StatusReceipt <=3  Purchased, Received, Registered 
   AND ITR.Direction = 1-- Receip

In conclusion, pseudo-constants’ usefulness is only limited, and their usage is  against developers’ common sense, however a data type in SQL Server with similar functionality would make code more readable and easier to maintain.


PS: It is possible to simulate an enumeration data type in tables’ definition by using a CHECK constraint.

02 April 2006

🖍️Herbert F Spirer - Collected Quotes

"Clearly, the mean is greatly influenced by extreme values, but it can be appropriate for many situations where extreme values do not arise. To avoid misuse, it is essential to know which summary measure best reflects the data and to use it carefully. Understanding the situation is necessary for making the right choice. Know the subject!" (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"'Garbage in, garbage out' is a sound warning for those in the computer field; it is every bit as sound in the use of statistics. Even if the “garbage” which comes out leads to a correct conclusion, this conclusion is still tainted, as it cannot be supported by logical reasoning. Therefore, it is a misuse of statistics. But obtaining a correct conclusion from faulty data is the exception, not the rule. Bad basic data (the 'garbage in') almost always leads to incorrect conclusions (the 'garbage out'). Unfortunately, incorrect conclusions can lead to bad policy or harmful actions." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"Graphic misrepresentation is a frequent misuse in presentations to the nonprofessional. The granddaddy of all graphical offenses is to omit the zero on the vertical axis. As a consequence, the chart is often interpreted as if its bottom axis were zero, even though it may be far removed. This can lead to attention-getting headlines about 'a soar' or 'a dramatic rise (or fall)'. A modest, and possibly insignificant, change is amplified into a disastrous or inspirational trend." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"If you want to show the growth of numbers which tend to grow by percentages, plot them on a logarithmic vertical scale. When plotted against a logarithmic vertical axis, equal percentage changes take up equal distances on the vertical axis. Thus, a constant annual percentage rate of change will plot as a straight line. The vertical scale on a logarithmic chart does not start at zero, as it shows the ratio of values (in this case, land values), and dividing by zero is impossible." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"In analyzing data, more is not necessarily better. Unfortunately, it is not always possible to have one uniquely correct procedure for analyzing a given data set. An investigator may use several different methods of statistical analysis on a data set. Furthermore, different outcomes may result from the use of different analytical methods. If more than one conclusion results, then an investigator is committing a misuse of statistics unless the investigator shows and reconciles all the results. If the investigator shows only one conclusion or interpretation, ignoring the alternative procedure(s), the work is a misuse of statistics." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"It is a consequence of the definition of the arithmetic mean that the mean will lie somewhere between the lowest and highest values. In the unrealistic and meaningless case that all values which make up the mean are the same, all values will be equal to the average. In an unlikely and impractical case, it is possible for only one of many values to be above or below the average. By the very definition of the average, it is impossible for all values to be above average in any case." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"It is a major statistical sin to show a graph displaying a variable as a function of time with the vertical (left-hand) scale cut short so that it does not go down to zero, without drawing attention to this fact. This sin can create a seriously misleading impression, and, as they do with most sins, sinners commit it again and again." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"It is a misuse of statistics to use whichever set of statistics suits the purpose at hand and ignore the conflicting sets and the implications of the conflicts." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"Jargon and complex methodology have their place. But true professional jargon is merely a shorthand way of speaking. Distrust any jargon that cannot be translated into plain English. Sophisticated methods can bring unique insights, but they can also be used to cover inadequate data and thinking. Good analysts can explain their methods in simple, direct terms. Distrust anyone who can't make clear how they have treated the data." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"Know the subject matter, learn it fast, or get a trustworthy expert. To identify the unknown, you must know the known. But don't be afraid to challenge experts on the basis of your logical reasoning. Sometimes a knowledge of the subject matter can blind the expert to the novel or unexpected." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"Percentages seem to invite misuse, perhaps because they require such careful thinking." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"There is no shortage of statistical methods. Elementary statistics textbooks list dozens, and statisticians constantly develop and report new ones. But if a researcher uses the wrong method, a clear misuse, to analyze a specific set of data, then the results may be incorrect." (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

"When an analyst selects the wrong tool, this is a misuse which usually leads to invalid conclusions. Incorrect use of even a tool as simple as the mean can lead to serious misuses. […] But all statisticians know that more complex tools do not guarantee an analysis free of misuses. Vigilance is required on every statistical level."  (Herbert F Spirer et al, "Misused Statistics" 2nd Ed, 1998)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.