SQL Troubles: comparison

Showing posts with label comparison. Show all posts

03 May 2025

📊Graphical Representation: Graphics We Live By (Part XI: Comparisons Between Data Series)

Graphical Representation Series

Over the past 10-20 years it became so easy to create data visualizations just by dropping some of the data available into a tool like Excel and providing a visual depiction of it with just a few clicks. In many cases, the first draft, typically provided by default in the tool used, doesn't even need further work as the objective was reached, while in others the creator must have a minimum skillset for making the visualization useful, appealing, or whatever quality is a final requirement for the work in scope. However, the audience might judge the visualization(s) from different perspectives, and there can be a broad audience with different skills in reading, evaluating and understanding the work.

There are many depictions on the web resembling the one below, taken from a LinkedIn post:

Example Chart - Boing vs. Airbus

Even if the visualization is not perfect, it does a fair job in representing the data. Improvements can be made in the areas of labels, the title and positioning of elements, and the color palette used. At least these were the improvements made in the original post. It must be differentiated also between the environment in which the charts are made available, the print format having different characteristics than the ones in business setups. Unfortunately, the requirements of the two are widely confused, probably also because of the overlapping of the mediums used.

Probably, it's a good idea to always start with the row data (or summaries of it) when the result consists of only a few data points that can be easily displayed in a table like the one below (the feature to round the decimals for integer values should be available soon in Power BI):

Summary Table

Of course, one can calculate more meaningful values like percentages from the total, standard deviations and other values that offer more perspectives into the data. Even if the values adequately reflect the reality, the reader can but wonder about the local and global minimal/maximal values, without talking much about the meaning of data points, which is easily identifiable in a chart. At least in the case of small data sets, using a table in combination with a chart can provide a more complete perspective and different ways of analyzing the data, especially when the navigation is interactive.

Column and bar charts do a fair job in comparing values over time, though they do use a lot of ink in the process (see D). While they make it easy to compare neighboring values, the rectangles used tend to occupy a lot of space when they are made too wide or too high to cover the empty space within the display (e.g. when just a few values are displayed, space being wasted in the process). As the main downside, it takes a lot of scanning until the reader identifies the overall trends, and the further away the bars are from each other, the more difficult it becomes to do comparisons.

In theory, line charts are more efficient in representing the above data points, because the marks are usually small and the line thin enough to provide a better data-ink ratio, while one can see a lot at a glance. In Power BI the creator can use different types of interpolation: linear (A), step (B) or smooth (C). In many cases, it might be a good idea to use a linear interpolation, though when there are no or minimal overlapping, it might be worthwhile to explore the other types if interpolation too (and further request feedback from the users):

Linear, Step and Smooth Line Charts

The nearness of values from different series can raise difficulties in identifying adequately the points, respectively delimiting the lines (see B).When the density of values allows it, it makes sense also to include the averages for each data series to reflect the distance between the two data sets. Unfortunately, the chart can get crowded if further data series or summaries are added to the cart(s).

If the column chart (E) is close to the redesigned chart provided in the original redesign, the other alternatives can provide upon case more value. Stacked column charts (D) allow also to compare the overall quantity by month, area charts (F) tend to use even more color than needed, while water charts (G) allow to compare the difference between data points per time unit. Tornado charts (H) are a variation of bar charts, allowing easier comparing of the size of the bars, while ribbon charts (I) show well the stacking values.

Alternatives to Line Charts

One should consider changing the subtitle(s) slightly to reflect the chart type when the patterns shown imply a shift in attention or meaning. Upon case, more that one of the above charts can be used within the same report when two or more perspectives are important. Using a complementary perspective can facilitate data's understanding or of identifying certain patterns that aren't easily identifiable otherwise.

In general, the graphics creators try to use various representational means of facilitating a data set's understanding, though seldom only two series or a small subset of dimensions provide a complete description. The value of data comes when multiple perspectives are combined. Frankly, the same can be said about the above data series. Yes, there are important differences between the two series, though how do the numbers compare when one looks at the bigger picture, especially when broken down on element types (e.g. airplane size). How about plan vs. actual values, how long does it take more for production or other processes? It's one of a visualization's goals to improve the questions posed, but how efficient are visualizations that barely scratch the surface?

In what concerns the code, the following scripts can be used to prepare the data:

-- Power Query script (Boeing vs Airbus)
= let
    Source = let
    Source = #table({"Sorting", "Month Name", "Serial Date", "Boeing Deliveries", "Airbus Deliveries"},
    {
        {1, "Oct", #date(2023, 10, 31), 30, 50},
        {2, "Nov", #date(2023, 11, 30), 40, 40},
        {3, "Dec", #date(2023, 12, 31), 40, 110},
        {4, "Jan", #date(2024, 1, 31), 20, 30},
        {5, "Feb", #date(2024, 2, 29), 30, 40},  // Leap year adjustment
        {6, "Mar", #date(2024, 3, 31), 30, 60},
        {7, "Apr", #date(2024, 4, 30), 40, 60},
        {8, "May", #date(2024, 5, 31), 40, 50},
        {9, "Jun", #date(2024, 6, 30), 50, 80},
        {10, "Jul", #date(2024, 7, 31), 40, 90},
        {11, "Aug", #date(2024, 8, 31), 40, 50},
        {12, "Sep", #date(2024, 9, 30), 30, 50}
    }
    ),
    #"Changed Types" = Table.TransformColumnTypes(Source, {{"Sorting", Int64.Type}, {"Serial Date", type date}, {"Boeing Deliveries", Int64.Type}, {"Airbus Deliveries", Int64.Type}})
in
    #"Changed Types"
in
    Source

It can be useful to create the labels for the charts dynamically:

-- DAX code for labels
MaxDate = Format(Max('Boeing vs Airbus'[Serial Date]),"MMM-YYYY")
MinDate = FORMAT (Min('Boeing vs Airbus'[Serial Date]),"MMM-YYYY")
MinMaxDate = [MinDate] & " to " & [MaxDate]
Title Boing Airbus = "Boing and Airbus Deliveries " & [MinMaxDate]

Happy coding!

Previous Post <<||>> Next Post

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VII: Data Stores Comparison)

Business Intelligence Series

Microsoft made available a reference guide for the data stores supported for Microsoft Fabric workloads [1], including the new Fabric SQL database (see previous post). Here's the consolidated table followed by a few aspects to consider:

Area	Lakehouse	Warehouse	Eventhouse	Fabric SQL database	Power BI Datamart
Data volume	Unlimited	Unlimited	Unlimited	4 TB	Up to 100 GB
Type of data	Unstructured, semi-structured, structured	Structured, semi-structured (JSON)	Unstructured, semi-structured, structured	Structured, semi-structured, unstructured	Structured
Primary developer persona	Data engineer, data scientist	Data warehouse developer, data architect, data engineer, database developer	App developer, data scientist, data engineer	AI developer, App developer, database developer, DB admin	Data scientist, data analyst
Primary dev skill	Spark (Scala, PySpark, Spark SQL, R)	SQL	No code, KQL, SQL	SQL	No code, SQL
Data organized by	Folders and files, databases, and tables	Databases, schemas, and tables	Databases, schemas, and tables	Databases, schemas, tables	Database, tables, queries
Read operations	Spark, T-SQL	T-SQL, Spark*	KQL, T-SQL, Spark	T-SQL	Spark, T-SQL
Write operations	Spark (Scala, PySpark, Spark SQL, R)	T-SQL	KQL, Spark, connector ecosystem	T-SQL	Dataflows, T-SQL
Multi-table transactions	No	Yes	Yes, for multi-table ingestion	Yes, full ACID compliance	No
Primary development interface	Spark notebooks, Spark job definitions	SQL scripts	KQL Queryset, KQL Database	SQL scripts	Power BI
Security	RLS, CLS**, table level (T-SQL), none for Spark	Object level, RLS, CLS, DDL/DML, dynamic data masking	RLS	Object level, RLS, CLS, DDL/DML, dynamic data masking	Built-in RLS editor
Access data via shortcuts	Yes	Yes	Yes	Yes	No
Can be a source for shortcuts	Yes (files and tables)	Yes (tables)	Yes	Yes (tables)	No
Query across items	Yes	Yes	Yes	Yes	No
Advanced analytics	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Time Series native elements, full geo-spatial and query capabilities	T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics	Interface for data processing with automated performance tuning
Advanced formatting support	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Full indexing for free text and semi-structured data like JSON	Table support for OLTP, JSON, vector, graph, XML, spatial, key-value	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format
Ingestion latency	Available instantly for querying	Available instantly for querying	Queued ingestion, streaming ingestion has a couple of seconds latency	Available instantly for querying	Available instantly for querying

It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.

However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!

For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag.

The warehouse seems to be more versatile, though careful attention needs to be given to its design.

The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas.

Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios. Starting with 28th of March 2025, SQL databases will be ON by default and tenant admins must manually turn them OFF before the respective date [3].

Power BI datamarts have been in preview for a couple of years.

Previous Post <<||>> Next Post

References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]

[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link]

[3] Microsoft Fabric Update Blog (2025) Extending flexibility: default checkbox changes on tenant settings for SQL database in Fabric [link]

[4] Microsoft Fabric Update Blog (2025) Enhancing SQL database in Fabric: share your feedback and shape the future [link]

[5] Microsoft Fabric Update Blog (2025) Why SQL database in Fabric is the best choice for low-code/no-code Developers [link]

19 November 2011

📉Graphical Representation: Comparison (Just the Quotes)

"Comparison between circles of different size should be absolutely avoided. It is inexcusable when we have available simple methods of charting so good and so convenient from every point of view as the horizontal bar." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Graphic comparisons, wherever possible, should be made in one dimension only." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Readers of statistical diagrams should not be required to compare magnitudes in more than one dimension. Visual comparisons of areas are particularly inaccurate and should not be necessary in reading any statistical graphical diagram." (William C Marshall, "Graphical methods for schools, colleges, statisticians, engineers and executives", 1921)

"[….] double-scale charts are likely to be misleading unless the two zero values coincide" (either on or off the chart). To insure an accurate comparison of growth the scale intervals should be so chosen that both curves meet at some point. This treatment produces the effect of percentage relatives or simple index numbers with the point of juncture serving as the base point. The principal advantage of this form of presentation is that it is a short-cut method of comparing the relative change of two or more series without computation. It is especially useful for bringing together series that either vary widely in magnitude or are measured in different units and hence cannot be compared conveniently on a chart having only one absolute-amount scale. In general, the double scale treatment should not be used for presenting growth comparisons to the general reader." (Kenneth W Haemer, "Double Scales Are Dangerous", The American Statistician Vol. 2" (3), 1948)

"An important rule in the drafting of curve charts is that the amount scale should begin at zero. In comparisons of size the omission of the zero base, unless clearly indicated, is likely to give a misleading impression of the relative values and trend." (Rufus R Lutz, "Graphic Presentation Simplified", 1949)

"Charts and graphs represent an extremely useful and flexible medium for explaining, interpreting, and analyzing numerical facts largely by means of points, lines, areas, and other geometric forms and symbols. They make possible the presentation of quantitative data in a simple, clear, and effective manner and facilitate comparison of values, trends, and relationships. Moreover, charts and graphs possess certain qualities and values lacking in textual and tabular forms of presentation." (Calvin F Schmid, "Handbook of Graphic Presentation", 1954)

"The common bar chart is particularly appropriate for comparing magnitude or size of coordinate items or parts of a total. It is one of the most useful, simple, and adaptable techniques in graphic presentation. The basis of comparison in the bar chart is linear or one-dimensional. The length of each bar or of its components is proportional to the quantity or amount of each category represented." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"At a simpler level, some elementary but important suggestions for the clarity of graphs are as follows: (i) the axes should be clearly labelled with the names of the variables and the units of measurement; (ii) scale breaks should be used for false origins; (iii) comparison of related diagrams should be made easy, for example by using identical scales of measurement and placing diagrams side by side; (iv) scales should be arranged so that systematic and approximately linear relations are plotted at roughly 45° to the x-axis; (v) legends should make diagrams as nearly self-explanatory, i.e. independent of the text, as is feasible; (vi) interpretation should not be prejudiced by the technique of presentation, for example by superimposing thick smooth curves on scatter diagrams of points faintly reproduced." (David R Cox,"Some Remarks on the Role in Statistics of Graphical Methods", Applied Statistics 27 (1), 1978)

"A graphic is an illustration that, like a painting or drawing, depicts certain images on a flat surface. The graphic depends on the use of lines and shapes or symbols to represent numbers and ideas and show comparisons, trends, and relationships. The success of the graphic depends on the extent to which this representation is transmitted in a clear and interesting manner." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"Understandability implies that the graph will mean something to the audience. If the presentation has little meaning to the audience, it has little value. Understandability is the difference between data and information. Data are facts. Information is facts that mean something and make a difference to whoever receives them. Graphic presentation enhances understanding in a number of ways. Many people find that the visual comparison and contrast of information permit relationships to be grasped more easily. Relationships that had been obscure become clear and provide new insights." (Anker V Andersen, "Graphing Financial Information: How accountants can use graphs to communicate", 1983)

"At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution." (Edward R Tufte, "Envisioning Information", 1990)

"Changing measures are a particularly common problem with comparisons over time, but measures also can cause problems of their own. [...] We cannot talk about change without making comparisons over time. We cannot avoid such comparisons, nor should we want to. However, there are several basic problems that can affect statistics about change. It is important to consider the problems posed by changing - and sometimes unchanging - measures, and it is also important to recognize the limits of predictions. Claims about change deserve critical inspection; we need to ask ourselves whether apples are being compared to apples - or to very different objects." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Comparing series visually can be misleading […]. Local variation is hidden when scaling the trends. We first need to make the series stationary" (removing trend and/or seasonal components and/or differences in variability) and then compare changes over time. To do this, we log the series" (to equalize variability) and difference each of them by subtracting last year’s value from this year’s value." (Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"[...] the First Principle for the analysis and presentation data: 'Show comparisons, contrasts, differences'. The fundamental analytical act in statistical reasoning is to answer the question Compared with what?". Whether we are evaluating changes over space or time, searching big data bases, adjusting and controlling for variables, designing experiments , specifying multiple regressions, or doing just about any kind of evidence-based reasoning, the essential point is to make intelligent and appropriate comparisons. Thus visual displays, if they are to assist thinking, should show comparisons." (Edward R Tufte, "Beautiful Evidence", 2006)

"What distinguishes data tables from graphics is explicit comparison and the data selection that this requires. While a data table obviously also selects information, this selection is less focused than a chart's on a particular comparison. To the extent that some figures in a table are visually emphasised. say in colour or size and style of print. the table is well on its way to becoming a chart. If you're making no comparisons - because you have no particular message and so need no selection" (in other words, if you are simply providing a database, number quarry or recycling facility) - tables are easier to use than charts." (Nicholas Strange, "Smoke and Mirrors: How to bend facts and figures to your advantage", 2007)

"Whereas charts generally focus on a trend or comparison, tables organize data for the reader to scan. Tables present data in an easy-read-format, or matrix. Tables arrange data in columns or rows so readers can make side-by-side comparisons. Tables work for many situations because they convey large amounts of data and have several variables for each item. Tables allow the reader to focus quickly on a specific item by scanning the matrix or to compare multiple items by scanning the rows or columns." (Dennis K Lieu & Sheryl Sorby, "Visualization, Modeling, and Graphics for Engineering Design", 2009)

"[...] the human brain is not good at calculating surface sizes. It is much better at comparing a single dimension such as length or height. [...] the brain is also a hopelessly lazy machine." (Alberto Cairo, "The Functional Art", 2011)

"Histograms are often mistaken for bar charts but there are important differences. Histograms show distribution through the frequency of quantitative values" (y axis) against defined intervals of quantitative values(x axis). By contrast, bar charts facilitate comparison of categorical values. One of the distinguishing features of a histogram is the lack of gaps between the bars [...]" (Andy Kirk, "Data Visualization: A successful design process", 2012)

"Good design is an important part of any visualization, while decoration (or chart-junk) is best omitted. Statisticians should also be careful about comparing themselves to artists and designers; our goals are so different that we will fare poorly in comparison." (Hadley Wickham, "Graphical Criticism: Some Historical Notes", Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"Comparisons are the lifeblood of empirical studies. We can’t determine if a medicine, treatment, policy, or strategy is effective unless we compare it to some alternative. But watch out for superficial comparisons: comparisons of percentage changes in big numbers and small numbers, comparisons of things that have nothing in common except that they increase over time, comparisons of irrelevant data. All of these are like comparing apples to prunes." (Gary Smith, "Standard Deviations", 2014)

"Further develop the situation or problem by covering relevant background. Incorporate external context or comparison points. Give examples that illustrate the issue. Include data that demonstrates the problem. Articulate what will happen if no action is taken or no change is made. Discuss potential options for addressing the problem. Illustrate the benefits of your recommended solution." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"One way to lie with statistics is to compare things - datasets, populations, types of products - that are different from one another, and pretend that they’re not. As the old idiom says, you can’t compare apples with oranges." (Daniel J Levitin, "Weaponized Lies", 2017)

"The second rule of communication is to know what you want to achieve. Hopefully the aim is to encourage open debate, and informed decision-making. But there seems no harm in repeating yet again that numbers do not speak for themselves; the context, language and graphic design all contribute to the way the communication is received. We have to acknowledge we are telling a story, and it is inevitable that people will make comparisons and judgements, no matter how much we only want to inform and not persuade. All we can do is try to pre-empt inappropriate gut reactions by design or warning." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"For numbers to be transparent, they must be placed in an appropriate context. Numbers must presented in a way that allows for fair comparisons." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"So what does it mean to tell an honest story? Numbers should be presented in ways that allow meaningful comparisons." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"A good test of how effective your data visualizations are: can you remove all or most of the numbers and still understand the visualization and make comparisons?" (Steve Wexler, "The Big Picture: How to use data visualization to make better decisions - faster", 2021)

"Clutter is the main issue to keep in mind when assessing whether a paired bar chart is the right approach. With too many bars, and especially when there are more than two bars for each category, it can be difficult for the reader to see the patterns and determine whether the most important comparison is between or within the different categories." (Jonathan Schwabish, "Better Data Visualizations: A guide for scholars, researchers, and wonks", 2021)

"For a chart to be truly insightful, context is crucial because it provides us with the visual answer to an important question - 'compared with what'? No number on its own is inherently big or small – we need context to make that judgement. Common contextual comparisons in charts are provided by time" ('compared with last year...') and place" ('compared with the north...'). With ranking, context is provided by relative performance" ('compared with our rivals...')." (Alan Smith, "How Charts Work: Understand and explain data with confidence", 2022)

04 August 2011

🔏MS Office: Access vs. LightSwitch - About Starts and Ends of Software Products

Introduction

When an important software product or technology is released on the market, it brings with it dooming prophecies about the end/death of a competing or related product or technology. Even if maybe it catches the attention, the approach became a stereotype leading to other futile fights between adepts, some food for thought and a pile of words for search engines. As LightSwitch was released recently, people started already sketching dooming plans for competing tools like MS Access, Silverlight, WebMatrix, Visual Studio, etc. It’s actually interesting to study and understand how the entry on the software market impacts the overall landscape, the publishing of more or less pertinent thoughts on the future of a product are more than welcome, though from this to forecasting the end of a software product or technology, at least not without well-grounded reasons, it’s a long way.

In many cases it’s not even needed to go too deep into the features of the compared software products in order to dismiss such statements, this because there are a few common sense reasons for which the respective products will coexist, at least for the near future. Here are a few of them grouped into technology, products, people, partners and processes. Please note that by the terms old and new (software) products I’m referring here to a product existing on the market for a longer time, respectively a newly entered product.

Technology

In theory a new software product attempts to take advantage of the latest technological advances in the field, following the trends. Also an old product can take advantage of the latest technological developments, though a certain backward compatibility needs to be maintained, fact that could come with advantages and disadvantages altogether. Considering that nowadays such a product doesn’t exist “per se” but in a complex infrastructure with multiple layers of interconnectivity, a new product has to fit also in the overall picture.

A product in particular and a technology in general is doomed to extinction when it’s not more able to cope with the trends, when its characteristics don’t satisfy anymore users’ demands or the overhead of using it is greater than its benefits. As long two competing software products are trying to keep up with the trends and consolidate their market, the chances that they will parish are quite small. On the other side, each technology has sooner or later its own end.

Products

Software products having a few years on the market have reached in theory a certain maturity and stability. New software products typically go through an adoption phase that may last from months to years, and it will take time until they reach a certain maturity and stability, until their market develops, until vendors include them in their portfolio, until other products will develop interfaces to them, etc. First of all it will take some time until the two will come to have the same market share, and secondly it will take even more time until the market share of one of the products will deprecate. In addition, markets embrace diversity and the demands are so various that each product arrives to find his place.

When the products are coming from the same vendor and they are a part of greater packages and strategies, it’s hard to believe that a vendor would want to blow in the air his own business. Usually the two solutions target different markets, even if their markets intersect. Sure, there are also cases when a vendor might want to strengthen the position of a product in the detriment of another, especially when the benefits are higher.

People

Often different products demand different skill sets or an upgrade of skill set. For sure not all developers will move from one platform to the other, some will be reticent, while others are declared fans so there is no way to move to something new. Sure, in IT there are frequent the cases when developers have knowledge about 2-3 competing products, though this aspect doesn’t necessarily have a huge impact on the short term. Considering that software products are becoming more and more complex, it’s sometimes even needed a specialization covering only a part of a product.

Partners

Vendors and Customers, especially existing partners, will most probably approach and evaluate the new product, find a place in their portfolio/solution, conduct some pilot projects and eventually consider the product for further use. We can talk here about an adoption period, corroborated with the appearance of training material, best practices, books or any other material that facilitate the use of such a product. All this time requires time and effort, successful and unsuccessful projects, some years of experience.

Processes

Organizations have already in place solutions based on a product and integrated with other products. Some of them could be personal solutions, and maybe quite easy to replace, though the replacement of business/enterprise solutions come maybe with important expenses, changes in the infrastructure, and maybe the most important, process changes. And why change something that’s working just for the sake of change?! Sure, if there is the need for a second or third product, this doesn’t (always) mean that all the previous similar products will be replaced. For sure the two or more products can coexist, even if provide similar functionality, and the can maybe complete each other.

Conclusion

If one product or another will come to its end, for sure only time will tell. Usually when this happens, there are multiple factors that influenced the decay, factors that could be used maybe to foresee such an event. Though, without a detailed analysis or at least some well-supported ideas, dooming declarations about the rise or fall of software products are kind of futile, even if intended to catch readers’ attention. Enthusiastic or contradictory feelings about old or new products are natural, expressing opinions is free and welcomed when there is something to say, though are such declarations really necessary?!

07 January 2011

💎🏭SQL Reloaded: Pulling the Strings of SQL Server IV (Spaces, Trimming, Length and Comparisons)

In the previous post on concatenation, I was talking about the importance of spaces and other delimiters in making concatenations’ output more “readable”. Excepting their importance in natural language, the spaces have some further implication in the way strings are stored and processed. As remarked in the introductory post from this topic, there are two types of spaces that stand out in the crowds of spaces, namely the trailing spaces, the spaces found at the right extremity of a string, respectively the leading spaces, the spaces found at the left extremity of a string.

Are few the cases when the two trailing space are of any use, therefore databases like SQL Server usually ignore them. The philosophy about leading space is slightly different because there are cases in which they are used in order to align the text to the right, however there are tools which are cutting off the leading spaces. When no such tools are available or any of the two types of spaces are not cut off, then we’ll have do to it ourselves, and here we come to the first topic of this post, trimming.

Trimming

Trimming is the operation of removing the empty spaces found at the endings of a string. Unlike other programming languages which use only one function for this purpose (e.g. Trim function in VB or Oracle), SQL Server makes use of two functions used for this purpose, LTrim used to trim the spaces found at the left ending of the string, respectively RTrim, used to trim the spaces found at the right ending of the string.

-- trimming a string 
SELECT  LTrim(' this is a string ') Length1 -- left trimming 
, RTrim(' this is a string ') Length2 --right trimming 
, LTrim(RTrim(' this is a string ')) Length2 --left & right trimming

As can be seen it’s not so easy to identify the differences, maybe the next function will help to see that there is actually a difference.

Note:
1) If it looks like the two trimming functions are not working with strings having leading or trailing spaces, then maybe you are not dealing with an empty character but rather with other characters like CR, LF, CRLF or other similar characters, rendered sometimes like an empty character.

2) In SQL Server 2017 was introduced the Trim function which not only replaces the combined use of LTrim and RTrim functions, but it allows to replace other specified characters (including CR, LF, Tab) from the start or end of a string. (see post)

Length

Before approaching other operations with strings, it’s maybe useful (actually necessary as we will see) to get a glimpse of the way we can determine the length of a string value, in other words how many characters it has, this being possible by using the Len function:

-- length of a string 
SELECT Len('this is a string') Length1 -- simple string 
, Len('this is a string ') Length2 --ending in space 
, Len(' this is a string') Length3 --starting with a space 
, Len(' this is a string ') Length4 --starting &amp; ending with a space 
, Len(LTrim(' this is a string ')) Length5 --length & left trimming 
,Len(RTrim(' this is a string ')) Length5 --length & right trimming

,Len(LTrim(RTrim(' this is a string '))) Length5 --length, left & right trimming

In order to understand the above results, one observation is necessary: if a strings ends in with one or more empty characters, the Len function ignores them, though this doesn’t happen with the leading empty characters, they needing to be removed explicitly if needed.

Comparisons

The comparison operation points the differences or similarities existing between two data types, involving at minimum two expressions that reduce at runtime to a data type and the comparison operator. This means that each member of comparison could include any valid combinations of functions as long they are reduced to compatible data types. In what concerns the comparison of strings, things are relatively simple, the comparison being allowed independently on whether they have fix or varying length. Relatively simple because if we’d have to go into details, then we’d need to talk about character sets (also called character encoding or character maps) and other string goodies the ANSI SQL standard(s) are coming with, including a set of rules that dictate the behavior of comparisons. So, let’s keep things as simple as possible. As per above attempt of definition, a comparison implies typically an equality, respectively difference, based on equal (“=”), respectively not equal (“<>” or “!=”). Here are some simple examples:

-- sample comparisons 
SELECT CASE WHEN 'abc' != 'abc ' THEN 1 ELSE 0 END Example1 
, CASE WHEN ' abc' != 'abc' THEN 1 ELSE 0 END Example2 
, CASE WHEN ' ' != '' THEN 1 ELSE 0 END Example3 
-- error comparison , CASE WHEN 'abc' != NULL THEN 1 ELSE 0 END Example4 
, CASE WHEN 'abc' = NULL THEN 1 ELSE 0 END Example5 
-- adequate NULL comparison , CASE WHEN 'abc' IS NOT NULL THEN 1 ELSE 0 END Example6  
, CASE WHEN 'abc' IS NULL THEN 1 ELSE 0 END Example7

Output:

Example1	Example2	Example3	Example5	Example7
0	1	0	0	0

The first three examples are demonstrating again the behavior of leading, respectively trailing spaces. The next two examples, even if they seem quite logical in terms of natural language semantics, they are wrong from the point of view of SQL semantics, and this because the comparison of values in which one of them is NULL equates to a NULL, thus resulting the above behavior in which both expressions from the 4th and 5th example equate to false. The next two examples show how the NULLs should be handled in comparisons with the help of IS operator, respectively it’s negation – IS NOT.

Like in the case of numeric values, the comparison between two strings could be expressed by using the “less than” (“<;”) and “greater than” (“?”) operators, alone or in combination with the equality operator (“<=”, “>=”) or the negation operator (“!>”, “<!”) (see comparison operators in MDSN). Typically an SQL Server database is case insensitive, so there will be no difference between the following strings: “ABC”, “abc”, “Abc”, etc. Here are some examples:

-- sample comparisons (case sensitive) 
SELECT CASE WHEN 'abc' < 'ABC' THEN 1 ELSE 0 END Example1 
, CASE WHEN 'abc' > 'abc' THEN 1 ELSE 0 END Example2 
, CASE WHEN 'abc' >= 'abc ' THEN 1 ELSE 0 END Example3 
, CASE WHEN 'abc' <> 'ABC' THEN 1 ELSE 0 END Example4 
, CASE WHEN 'abc' > '' THEN 1 ELSE 0 END Example5 
, CASE WHEN ' ' > '' THEN 1 ELSE 0 END Example6

Output:

Example1	Example2	Example3	Example4	Example5	Example6
0	0	1	0	1	0

The case sensitivity could be changed at attribute, table or database level. As we don’t deal with a table and the don’t want to complicate too much the queries, let’s consider changing the sensitivity at database level. So if you are using a non-production database, try the following script in order to enable, respectively to disable the case sensitivity:

--enabling case sensitivity for a database 
ALTER DATABASE <database name>  
COLLATE Latin1_General_CS_AS  

--disabling case sensitivity for a database 
ALTER DATABASE <database name> 
COLLATE Latin1_General_CI_AS

In order to test the behavior of case sensitivity, enable first the sensitivity and then rerun the previous set of example (involving case sensitivity).
Output:

Example1	Example2	Example3	Example4	Example5	Example6
1	0	1	1	1	0

After that you could disable again the case sensitivity by running the last script. Please note that if your database has other collation, you’ll have to change the scripts accordingly in order to point to your database’s collation.

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

05 October 2010

🔏MS Office: The Limitations of MS Access Database

In the previous post I was highlighting some general considerations on the use of MS Access and Excel as frameworks for building applications. I left many things out from the lack of time and space, therefore, as the title reveals, in this post I will focus simply on the limitations of MS Access considered as Database. I considered then that Access is a fairly good as database, recommending it for 10-20 concurrent users, fact that could equate, after case, maybe with a total of users that range between 1-100. Of course, this doesn’t mean that MS Access can’t do more, actually it supports 255 concurrent users and with a good design that limit could be reached.

Another important limitation regards the size of an Access database, set to 2GB, it used to be more than sufficient a few years back, though nowadays, it’s sometimes the equivalent of a month/year of transactions. I never tried to count how many records could store a MS Access, though if I remember correctly, a relatively small to average table of 1000000 (10^6) records occupies about 100MB, using this logic 2GB could equate with about 20000000 (2*10^7) records, the equivalent of a small to average database size. Anyway, the numbers are relative, the actual size depends also on the number of objects the database stores, the size of attributes stored, on the fact that even if Access is supposed to have a limitation of 2GB, I met cases in which a database of 1GB was crashing a lot, needing to be repaired or backed up regularly.

Sometimes it could be repaired, other times not, unfortunately the “recovery” built within a MS Access can’t be compared with the recovery available in a RDBMS. That’s ok in the end, even mature databases crash from time to time, though the logs and transaction isolation models allow them to provide high recoverability and reliability, to which adds up scalability, availability, security and manageability. If all these are not essential for your database solution, the MS Access is ok, though you’ll have to invest effort in each of these area when you have to raise your standards.

One of the most painful issues when dealing with concurrent data access is the transaction processing that needs to guarantee the consistency and recoverability of operations. As Access is not handling the transactions, the programmer has to do that using ADO or DAO transactions. As many applications still don’t need pessimistic concurrency, with some effort and a good row versioning also this issue could be solved. Also the security-related issues could be solved programmatically by designing a role-based permission framework, though it occasionally it could be breached when the user is aware of the few Access hacks and has direct access to the database.

Manageability resumes usually in controlling resources utilization, monitoring the progress of the actions running on the database. If Access is doing a relatively good job in what concerns the manageability of its objects, it has no reliable way to control their utilization, when a query is running for too long, the easiest way to solve this is to coldly kill the process belonging to Access. Not sure if it makes sense to philosophy about Access’ scalability and availability, at least can’t be comparable from this point of view with RDBMS for which failover clustering, mirroring, log shipping, online backup and in general online maintenance have an important impact on the two.

Excepting the above theoretical limitations, when MS Access is part of your solution, it’s always a good idea to know its maximal capacity specifications, this applying to all type of databases or technologies. Most probably you won’t want that in the middle of your project or even later you realize that you reach one of such limitations. I tried to put together a comparison between the maximal capacity specifications for 2000, 2007 and 2010 versions of MS Access and, for reference, the same specification for SQL Server (2000, 2005, 2008 R2). The respective information come mainly from Microsoft websites, with a few additions from [5] and [6].

	MS Access		SQL Server
Attribute	2000 [1]	2007/2010 [2]	2000 [7]	2005 [4]	2008 R2 [3]
SQL statements size	64kb	64kb	64kb	64kb	64kb
# characters in Memo field	65535	65535	-	2^30-1	2^31-1
# characters in Text field	255	255	8000	8000	8000
# characters in object name	64	64	128	128	128
# characters in record	4000	4000	8000	8000	8000
# concurrent users	255

255

32767

# databases per instance

32767

# fields in index

# fields in recordset

255

4096

# fields in table

255

1024

1024/30000

# files per database

32767

# forced relationships per table

253

# indexes per table

250 (1 clustered)

# instances

# joins in a query

# levels nested queries

# nested subqueries

32768

</>

2147483647

# open tables

2048

2147483647

# roles per database

n/a

16379

# tables in a query

256

# users per database

n/a

16379

database size

<2GB

1048516 TB

542272TB

file size (data)

2GB

32TB

16TB

file size (log)

n/a

32TB

2TB

For my surprise the maximal capacity specifications of Access are comparable with the ones of SQL Server for many of the above attributes. Sure, there is a huge difference in what concerns the number of databases, the database/file size and the number of supported objects, quite relevant in the architecture of applications. Several other differences, for example the number of indexes supported per table or relationships per table, are less important for the majority of solutions. Another fact that is not remarked in the above table is the fact that the number of records in a table are typically limited by storage. Please note that many important features not available in Access were left out, therefore, for a better overview is advisable to check directly the referenced sources.

There are two one more personal observations for this post. Even if MS Access is great for non-SQL developers giving its nice Designer, for SQL developers it lacks a rich editor, the initial formatting being lost, this doubled by the poor support for later versions of the ANSI standard make from Access a tool to avoid.

References:
[1] Microsoft. 2010. Microsoft Access database specifications. [Online] Available form:
http://office.microsoft.com/en-us/access-help/access-specifications-HP005186808.aspx (Accessed: 04.10.2010)

[2] Microsoft. 2010. Access 2010 specifications [Online] Available form: http://office.microsoft.com/en-us/access-help/access-2010-specifications-HA010341462.aspx (Accessed: 04.10.2010)
[3] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2008 R2. [Online] Available form: http://msdn.microsoft.com/en-us/library/ms143432.aspx (Accessed: 04.10.2010)

[4] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2005. [Online] Available form: http://msdn.microsoft.com/en-us/library/ms143432(SQL.90).aspx (Accessed: 04.10.2010)

[5] SQL Server Helper. (2005). SQL Server 2005: Maximum Capacity Specifications. [Online] Available form: http://www.sql-server-helper.com/sql-server-2005/maximum-capacity-specifications.aspx (Accessed: 04.10.2010)

[6] MSDN. (2008).SQL 2005 and SQL 2008 database volume capacity. [Online] Available form: http://social.msdn.microsoft.com/forums/en-US/sqlgetstarted/thread/4225734e-e480-4b21-8cd4-4228ca2abf55/ (Accessed: 04.10.2010)

[7] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2000. [Online] Available form: http://technet.microsoft.com/en-us/library/aa274604(SQL.80).aspx (Accessed: 04.10.2010)

[8] MSDN. (2010). Comparison of Microsoft Access SQL and ANSI SQL. [Online] Available form: http://msdn.microsoft.com/en-us/library/bb208890.aspx (Accessed: 04.10.2010)

SQL Troubles

Pages

03 May 2025

📊Graphical Representation: Graphics We Live By (Part XI: Comparisons Between Data Series)

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VII: Data Stores Comparison)

19 November 2011

📉Graphical Representation: Comparison (Just the Quotes)

04 August 2011

🔏MS Office: Access vs. LightSwitch - About Starts and Ends of Software Products

07 January 2011

💎🏭SQL Reloaded: Pulling the Strings of SQL Server IV (Spaces, Trimming, Length and Comparisons)

05 October 2010

🔏MS Office: The Limitations of MS Access Database

About Me