Showing posts with label comparison. Show all posts
Showing posts with label comparison. Show all posts

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VI: Data Stores Comparison)

Business Intelligence Series
Business Intelligence Series

Microsoft made available a reference guide for the data stores supported for Microsoft Fabric workloads [1], including the new Fabric SQL database (see previous post). Here's the consolidated table followed by a few aspects to consider: 

Area Lakehouse Warehouse Eventhouse Fabric SQL database Power BI Datamart
Data volume Unlimited Unlimited Unlimited 4 TB Up to 100 GB
Type of data Unstructured, semi-structured, structured Structured, semi-structured (JSON) Unstructured, semi-structured, structured Structured, semi-structured, unstructured Structured
Primary developer persona Data engineer, data scientist Data warehouse developer, data architect, data engineer, database developer App developer, data scientist, data engineer AI developer, App developer, database developer, DB admin Data scientist, data analyst
Primary dev skill Spark (Scala, PySpark, Spark SQL, R) SQL No code, KQL, SQL SQL No code, SQL
Data organized by Folders and files, databases, and tables Databases, schemas, and tables Databases, schemas, and tables Databases, schemas, tables Database, tables, queries
Read operations Spark, T-SQL T-SQL, Spark* KQL, T-SQL, Spark T-SQL Spark, T-SQL
Write operations Spark (Scala, PySpark, Spark SQL, R) T-SQL KQL, Spark, connector ecosystem T-SQL Dataflows, T-SQL
Multi-table transactions No Yes Yes, for multi-table ingestion Yes, full ACID compliance No
Primary development interface Spark notebooks, Spark job definitions SQL scripts KQL Queryset, KQL Database SQL scripts Power BI
Security RLS, CLS**, table level (T-SQL), none for Spark Object level, RLS, CLS, DDL/DML, dynamic data masking RLS Object level, RLS, CLS, DDL/DML, dynamic data masking Built-in RLS editor
Access data via shortcuts Yes Yes Yes Yes No
Can be a source for shortcuts Yes (files and tables) Yes (tables) Yes Yes (tables) No
Query across items Yes Yes Yes Yes No
Advanced analytics Interface for large-scale data processing, built-in data parallelism, and fault tolerance Interface for large-scale data processing, built-in data parallelism, and fault tolerance Time Series native elements, full geo-spatial and query capabilities T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics Interface for data processing with automated performance tuning
Advanced formatting support Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format Full indexing for free text and semi-structured data like JSON Table support for OLTP, JSON, vector, graph, XML, spatial, key-value Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format
Ingestion latency Available instantly for querying Available instantly for querying Queued ingestion, streaming ingestion has a couple of seconds latency Available instantly for querying Available instantly for querying

It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.

However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!

For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag. 

The warehouse seems to be more versatile, though careful attention needs to be given to its design. 

The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas. 

Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios. 

Power BI datamarts have been in preview for a couple of years.

References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]
[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link

19 November 2011

📉Graphical Representation: Comparison (Just the Quotes)

"Comparison between circles of different size should be absolutely avoided. It is inexcusable when we have available simple methods of charting so good and so convenient from every point of view as the horizontal bar." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Graphic comparisons, wherever possible, should be made in one dimension only." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Readers of statistical diagrams should not be required to compare magnitudes in more than one dimension. Visual comparisons of areas are particularly inaccurate and should not be necessary in reading any statistical graphical diagram." (William C Marshall, "Graphical methods for schools, colleges, statisticians, engineers and executives", 1921)

"[….] double-scale charts are likely to be misleading unless the two zero values coincide" (either on or off the chart). To insure an accurate comparison of growth the scale intervals should be so chosen that both curves meet at some point. This treatment produces the effect of percentage relatives or simple index numbers with the point of juncture serving as the base point. The principal advantage of this form of presentation is that it is a short-cut method of comparing the relative change of two or more series without computation. It is especially useful for bringing together series that either vary widely in magnitude or are measured in different units and hence cannot be compared conveniently on a chart having only one absolute-amount scale. In general, the double scale treatment should not be used for presenting growth comparisons to the general reader." (Kenneth W Haemer, "Double Scales Are Dangerous", The American Statistician Vol. 2" (3), 1948)

"An important rule in the drafting of curve charts is that the amount scale should begin at zero. In comparisons of size the omission of the zero base, unless clearly indicated, is likely to give a misleading impression of the relative values and trend." (Rufus R Lutz, "Graphic Presentation Simplified", 1949)

"Charts and graphs represent an extremely useful and flexible medium for explaining, interpreting, and analyzing numerical facts largely by means of points, lines, areas, and other geometric forms and symbols. They make possible the presentation of quantitative data in a simple, clear, and effective manner and facilitate comparison of values, trends, and relationships. Moreover, charts and graphs possess certain qualities and values lacking in textual and tabular forms of presentation." (Calvin F Schmid, "Handbook of Graphic Presentation", 1954)

"The common bar chart is particularly appropriate for comparing magnitude or size of coordinate items or parts of a total. It is one of the most useful, simple, and adaptable techniques in graphic presentation. The basis of comparison in the bar chart is linear or one-dimensional. The length of each bar or of its components is proportional to the quantity or amount of each category represented." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"A graphic is an illustration that, like a painting or drawing, depicts certain images on a flat surface. The graphic depends on the use of lines and shapes or symbols to represent numbers and ideas and show comparisons, trends, and relationships. The success of the graphic depends on the extent to which this representation is transmitted in a clear and interesting manner." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"Understandability implies that the graph will mean something to the audience. If the presentation has little meaning to the audience, it has little value. Understandability is the difference between data and information. Data are facts. Information is facts that mean something and make a difference to whoever receives them. Graphic presentation enhances understanding in a number of ways. Many people find that the visual comparison and contrast of information permit relationships to be grasped more easily. Relationships that had been obscure become clear and provide new insights." (Anker V Andersen, "Graphing Financial Information: How accountants can use graphs to communicate", 1983)

"At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution." (Edward R Tufte, "Envisioning Information", 1990)

"Changing measures are a particularly common problem with comparisons over time, but measures also can cause problems of their own. [...] We cannot talk about change without making comparisons over time. We cannot avoid such comparisons, nor should we want to. However, there are several basic problems that can affect statistics about change. It is important to consider the problems posed by changing - and sometimes unchanging - measures, and it is also important to recognize the limits of predictions. Claims about change deserve critical inspection; we need to ask ourselves whether apples are being compared to apples - or to very different objects." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Comparing series visually can be misleading […]. Local variation is hidden when scaling the trends. We first need to make the series stationary" (removing trend and/or seasonal components and/or differences in variability) and then compare changes over time. To do this, we log the series" (to equalize variability) and difference each of them by subtracting last year’s value from this year’s value." (Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"[...] the First Principle for the analysis and presentation data: 'Show comparisons, contrasts, differences'. The fundamental analytical act in statistical reasoning is to answer the question Compared with what?". Whether we are evaluating changes over space or time, searching big data bases, adjusting and controlling for variables, designing experiments , specifying multiple regressions, or doing just about any kind of evidence-based reasoning, the essential point is to make intelligent and appropriate comparisons. Thus visual displays, if they are to assist thinking, should show comparisons." (Edward R Tufte, "Beautiful Evidence", 2006)

"What distinguishes data tables from graphics is explicit comparison and the data selection that this requires. While a data table obviously also selects information, this selection is less focused than a chart's on a particular comparison. To the extent that some figures in a table are visually emphasised. say in colour or size and style of print. the table is well on its way to becoming a chart. If you're making no comparisons - because you have no particular message and so need no selection" (in other words, if you are simply providing a database, number quarry or recycling facility) - tables are easier to use than charts." (Nicholas Strange, "Smoke and Mirrors: How to bend facts and figures to your advantage", 2007)

"Whereas charts generally focus on a trend or comparison, tables organize data for the reader to scan. Tables present data in an easy-read-format, or matrix. Tables arrange data in columns or rows so readers can make side-by-side comparisons. Tables work for many situations because they convey large amounts of data and have several variables for each item. Tables allow the reader to focus quickly on a specific item by scanning the matrix or to compare multiple items by scanning the rows or columns."  (Dennis K Lieu & Sheryl Sorby, "Visualization, Modeling, and Graphics for Engineering Design", 2009)

"[...] the human brain is not good at calculating surface sizes. It is much better at comparing a single dimension such as length or height. [...] the brain is also a hopelessly lazy machine." (Alberto Cairo, "The Functional Art", 2011)

"Histograms are often mistaken for bar charts but there are important differences. Histograms show distribution through the frequency of quantitative values" (y axis) against defined intervals of quantitative values(x axis). By contrast, bar charts facilitate comparison of categorical values. One of the distinguishing features of a histogram is the lack of gaps between the bars [...]" (Andy Kirk, "Data Visualization: A successful design process", 2012)

"Good design is an important part of any visualization, while decoration (or chart-junk) is best omitted. Statisticians should also be careful about comparing themselves to artists and designers; our goals are so different that we will fare poorly in comparison." (Hadley Wickham, "Graphical Criticism: Some Historical Notes", Journal of Computational and Graphical Statistics Vol. 22(1), 2013) 

"Comparisons are the lifeblood of empirical studies. We can’t determine if a medicine, treatment, policy, or strategy is effective unless we compare it to some alternative. But watch out for superficial comparisons: comparisons of percentage changes in big numbers and small numbers, comparisons of things that have nothing in common except that they increase over time, comparisons of irrelevant data. All of these are like comparing apples to prunes." (Gary Smith, "Standard Deviations", 2014)

"Further develop the situation or problem by covering relevant background. Incorporate external context or comparison points. Give examples that illustrate the issue. Include data that demonstrates the problem. Articulate what will happen if no action is taken or no change is made. Discuss potential options for addressing the problem. Illustrate the benefits of your recommended solution." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"One way to lie with statistics is to compare things - datasets, populations, types of products - that are different from one another, and pretend that they’re not. As the old idiom says, you can’t compare apples with oranges." (Daniel J Levitin, "Weaponized Lies", 2017)

"The second rule of communication is to know what you want to achieve. Hopefully the aim is to encourage open debate, and informed decision-making. But there seems no harm in repeating yet again that numbers do not speak for themselves; the context, language and graphic design all contribute to the way the communication is received. We have to acknowledge we are telling a story, and it is inevitable that people will make comparisons and judgements, no matter how much we only want to inform and not persuade. All we can do is try to pre-empt inappropriate gut reactions by design or warning." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"For numbers to be transparent, they must be placed in an appropriate context. Numbers must presented in a way that allows for fair comparisons." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"So what does it mean to tell an honest story? Numbers should be presented in ways that allow meaningful comparisons." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"A good test of how effective your data visualizations are: can you remove all or most of the numbers and still understand the visualization and make comparisons?" (Steve Wexler, "The Big Picture: How to use data visualization to make better decisions - faster", 2021)

"Clutter is the main issue to keep in mind when assessing whether a paired bar chart is the right approach. With too many bars, and especially when there are more than two bars for each category, it can be difficult for the reader to see the patterns and determine whether the most important comparison is between or within the different categories." (Jonathan Schwabish, "Better Data Visualizations: A guide for scholars, researchers, and wonks", 2021)

"For a chart to be truly insightful, context is crucial because it provides us with the visual answer to an important question - 'compared with what'? No number on its own is inherently big or small – we need context to make that judgement. Common contextual comparisons in charts are provided by time" ('compared with last year...') and place" ('compared with the north...'). With ranking, context is provided by relative performance" ('compared with our rivals...')." (Alan Smith, "How Charts Work: Understand and explain data with confidence", 2022)

04 August 2011

🔏MS Office: Access vs. LightSwitch - About Starts and Ends of Software Products

Introduction

    When an important software product or technology is released on the market, it brings with it dooming prophecies about the end/death of a competing or related product or technology. Even if maybe it catches the attention, the approach became a stereotype leading to other futile fights between adepts, some food for thought and a pile of words for search engines. As LightSwitch was released recently, people started already sketching dooming plans for competing tools like MS Access, Silverlight, WebMatrix, Visual Studio, etc. It’s actually interesting to study and understand how the entry on the software market impacts the overall landscape, the publishing of more or less pertinent thoughts on the future of a product are more than welcome, though from this to forecasting the end of a software product or technology, at least not without well-grounded reasons, it’s a long way.
    In many cases it’s not even needed to go too deep into the features of the compared software products in order to dismiss such statements, this because there are a few common sense reasons for which the respective products will coexist, at least for the near future. Here are a few of them grouped into technology, products, people, partners and processes. Please note that by the terms old and new (software) products I’m referring here to a product existing on the market for a longer time, respectively a newly entered product.

Technology

    In theory a new software product attempts to take advantage of the latest technological advances in the field, following the trends. Also an old product can take advantage of the latest technological developments, though a certain backward compatibility needs to be maintained, fact that could come with advantages and disadvantages altogether. Considering that nowadays such a product doesn’t exist “per se” but in a complex infrastructure with multiple layers of interconnectivity, a new product has to fit also in the overall picture.
    A product in particular and a technology in general is doomed to extinction when it’s not more able to cope with the trends, when its characteristics don’t satisfy anymore users’ demands or the overhead of using it is greater than its benefits. As long two competing software products are trying to keep up with the trends and consolidate their market, the chances that they will parish are quite small. On the other side, each technology has sooner or later its own end.

Products

    Software products having a few years on the market have reached in theory a certain maturity and stability. New software products typically go through an adoption phase that may last from months to years, and it will take time until they reach a certain maturity and stability, until their market develops, until vendors include them in their portfolio, until other products will develop interfaces to them, etc. First of all it will take some time until the two will come to have the same market share, and secondly it will take even more time until the market share of one of the products will deprecate. In addition, markets embrace diversity and the demands are so various that each product arrives to find his place.
   When the products are coming from the same vendor and they are a part of greater packages and strategies, it’s hard to believe that a vendor would want to blow in the air his own business. Usually the two solutions target different markets, even if their markets intersect. Sure, there are also cases when a vendor might want to strengthen the position of a product in the detriment of another, especially when the benefits are higher.

People

    Often different products demand different skill sets or an upgrade of skill set. For sure not all developers will move from one platform to the other, some will be reticent, while others are declared fans so there is no way to move to something new. Sure, in IT there are frequent the cases when developers have knowledge about 2-3 competing products, though this aspect doesn’t necessarily have a huge impact on the short term. Considering that software products are becoming more and more complex, it’s sometimes even needed a specialization covering only a part of a product.

Partners

    Vendors and Customers, especially existing partners, will most probably approach and evaluate the new product, find a place in their portfolio/solution, conduct some pilot projects and eventually consider the product for further use. We can talk here about an adoption period, corroborated with the appearance of training material, best practices, books or any other material that facilitate the use of such a product. All this time requires time and effort, successful and unsuccessful projects, some years of experience.

Processes

    Organizations have already in place solutions based on a product and integrated with other products. Some of them could be personal solutions, and maybe quite easy to replace, though the replacement of business/enterprise solutions come maybe with important expenses, changes in the infrastructure, and maybe the most important, process changes. And why change something that’s working just for the sake of change?! Sure, if there is the need for a second or third product, this doesn’t (always) mean that all the previous similar products will be replaced. For sure the two or more products can coexist, even if provide similar functionality, and the can maybe complete each other.

Conclusion

    If one product or another will come to its end, for sure only time will tell. Usually when this happens, there are multiple factors that influenced the decay, factors that could be used maybe to foresee such an event. Though, without a detailed analysis or at least some well-supported ideas, dooming declarations about the rise or fall of software products are kind of futile, even if intended to catch readers’ attention. Enthusiastic or contradictory feelings about old or new products are natural, expressing opinions is free and welcomed when there is something to say, though are such declarations really necessary?!

07 January 2011

💎🏭SQL Reloaded: Pulling the Strings of SQL Server IV (Spaces, Trimming, Length and Comparisons)

In the previous post on concatenation, I was talking about the importance of spaces and other delimiters in making concatenations’ output more “readable”. Excepting their importance in natural language, the spaces have some further implication in the way strings are stored and processed. As remarked in the introductory post from this topic, there are two types of spaces that stand out in the crowds of spaces, namely the trailing spaces, the spaces found at the right extremity of a string,  respectively the leading spaces, the spaces found at the left extremity of a string. 

Are few the cases when the two trailing space are of any use, therefore databases like SQL Server usually ignore them. The philosophy about leading space is slightly different because there are cases in which they are used in order to align the text to the right, however there are tools which are cutting off the leading spaces. When no such tools are available or any of the two types of spaces are not cut off, then we’ll have do to it ourselves, and here we come to the first topic of this post, trimming.

Trimming

Trimming is the operation of removing the empty spaces found at the endings of a string. Unlike other programming languages which use only one function for this purpose (e.g. Trim function in VB or Oracle), SQL Server makes use of two functions used for this purpose, LTrim used to trim the spaces found at the left ending of the string, respectively RTrim, used to trim the spaces found at the right ending of the string.

-- trimming a string 
SELECT  LTrim(' this is a string ') Length1 -- left trimming 
, RTrim(' this is a string ') Length2 --right trimming 
, LTrim(RTrim(' this is a string ')) Length2 --left & right trimming 

As can be seen it’s not so easy to identify the differences, maybe the next function will help to see that there is actually a difference.

Note:
1) If it looks like the two trimming functions are not working with strings having leading or trailing spaces, then maybe you are not dealing with an empty character but rather with other characters like CR, LF, CRLF or other similar characters, rendered sometimes like an empty character.
2)   In SQL Server 2017 was introduced the Trim function which not only replaces the combined use of LTrim and RTrim functions, but it allows to replace other specified characters (including CR, LF, Tab) from the start or end of a string. (see post

Length

Before approaching other operations with strings, it’s maybe useful (actually necessary as we will see) to get a glimpse of the way we can determine the length of a string value, in other words how many characters it has, this being possible by using the Len function:

-- length of a string 
SELECT Len('this is a string') Length1 -- simple string 
, Len('this is a string ') Length2 --ending in space 
, Len(' this is a string') Length3 --starting with a space 
, Len(' this is a string ') Length4 --starting & ending with a space 
, Len(LTrim(' this is a string ')) Length5 --length & left trimming 
,Len(RTrim(' this is a string ')) Length5 --length & right trimming
,Len(LTrim(RTrim(' this is a string '))) Length5 --length, left & right trimming    

In order to understand the above results, one observation is necessary: if a strings ends in with one or more empty characters, the Len function ignores them, though this doesn’t happen with the leading empty characters, they needing to be removed explicitly if needed.

Comparisons

The comparison operation points the differences or similarities existing between two data types, involving at minimum two expressions that reduce at runtime to a data type and the comparison operator. This means that each member of comparison could include any valid combinations of functions as long they are reduced to compatible data types. In what concerns the comparison of strings, things are relatively simple, the comparison being allowed  independently on whether they have fix or varying length. Relatively simple because if we’d have to go into details, then we’d need to talk about character sets (also called character encoding or character maps) and other string goodies the ANSI SQL standard(s) are coming with, including a set of rules that dictate the behavior of comparisons. So, let’s keep things as simple as possible. As per above attempt of definition, a comparison implies typically an equality, respectively difference, based on equal (“=”), respectively not equal (“<>” or “!=”). Here are some simple examples:

-- sample comparisons 
SELECT CASE WHEN 'abc' != 'abc ' THEN 1 ELSE 0 END Example1 
, CASE WHEN ' abc' != 'abc' THEN 1 ELSE 0 END Example2 
, CASE WHEN ' ' != '' THEN 1 ELSE 0 END Example3 
-- error comparison , CASE WHEN 'abc' != NULL THEN 1 ELSE 0 END Example4 
, CASE WHEN 'abc' = NULL THEN 1 ELSE 0 END Example5 
-- adequate NULL comparison , CASE WHEN 'abc' IS NOT NULL THEN 1 ELSE 0 END Example6  
, CASE WHEN 'abc' IS NULL THEN 1 ELSE 0 END Example7 
Output:
Example1 Example2 Example3 Example5 Example7
0 1 0 0 0

The first three examples are demonstrating again the behavior of leading, respectively trailing spaces. The next two examples, even if they seem quite logical in terms of natural language semantics, they are wrong from the point of view of SQL semantics, and this because the comparison of values in which one of them is NULL equates to a NULL, thus resulting the above behavior in which both expressions from the 4th and 5th example equate to false. The next two examples show how the NULLs should be handled in comparisons with the help of IS operator, respectively it’s negation – IS NOT. 

 Like in the case of numeric values, the comparison between two strings could be expressed by using the “less than” (“<;”) and “greater than” (“?”) operators, alone or in combination with the equality operator (“<=”, “>=”) or the negation operator (“!>”, “<!”) (see comparison operators in MDSN). Typically an SQL Server database is case insensitive, so there  will be no difference between the following strings: “ABC”, “abc”, “Abc”, etc. Here are some examples:

-- sample comparisons (case sensitive) 
SELECT CASE WHEN 'abc' < 'ABC' THEN 1 ELSE 0 END Example1 
, CASE WHEN 'abc' > 'abc' THEN 1 ELSE 0 END Example2 
, CASE WHEN 'abc' >= 'abc ' THEN 1 ELSE 0 END Example3 
, CASE WHEN 'abc' <> 'ABC' THEN 1 ELSE 0 END Example4 
, CASE WHEN 'abc' > '' THEN 1 ELSE 0 END Example5 
, CASE WHEN ' ' > '' THEN 1 ELSE 0 END Example6 
Output:
Example1 Example2 Example3 Example4 Example5 Example6
0 0 1 0 1 0


The case sensitivity could be changed at attribute, table or database level. As we don’t deal with a table and the don’t want to complicate too much the queries, let’s consider changing the sensitivity at database level. So if you are using a non-production database, try the following script in order to enable, respectively to disable the case sensitivity:

--enabling case sensitivity for a database 
ALTER DATABASE <database name>  
COLLATE Latin1_General_CS_AS  

--disabling case sensitivity for a database 
ALTER DATABASE <database name> 
COLLATE Latin1_General_CI_AS 
 
In order to test the behavior of case sensitivity, enable first the sensitivity and then rerun the previous set of example (involving case sensitivity).
Output:
Example1 Example2 Example3 Example4 Example5 Example6
1 0 1 1 1 0
After that you could disable again the case sensitivity by running the last script. Please note that if your database has other collation, you’ll have to change the scripts accordingly in order to point to your database’s collation.

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

05 October 2010

🔏MS Office: The Limitations of MS Access Database

In the previous post I was highlighting some general considerations on the use of MS Access and Excel as frameworks for building applications. I left many things out from the lack of time and space, therefore, as the title reveals, in this post I will focus simply on the limitations of MS Access considered as Database. I considered then that Access is a fairly good as database, recommending it for 10-20 concurrent users, fact that could equate, after case, maybe with a total of users that range between 1-100. Of course, this doesn’t mean that MS Access can’t do more, actually it supports 255 concurrent users and with a good design that limit could be reached.

Another important limitation regards the size of an Access database, set to 2GB, it used to be more than sufficient a few years back, though nowadays, it’s sometimes the equivalent of a month/year of transactions. I never tried to count how many records could store a MS Access, though if I remember correctly, a relatively small to average table of 1000000 (10^6) records occupies about 100MB, using this logic 2GB could equate with about 20000000 (2*10^7) records, the equivalent of a small to average database size. Anyway, the numbers are relative, the actual size depends also on the number of objects the database stores, the size of attributes stored, on the fact that even if Access is supposed to have a limitation of 2GB, I met cases in which a database of 1GB was crashing a lot, needing to be repaired or backed up regularly. 

Sometimes it could be repaired, other times not, unfortunately the “recovery” built within a MS Access can’t be compared with the recovery available in a RDBMS. That’s ok in the end, even mature databases crash from time to time, though the logs and transaction isolation models allow them to provide high recoverability and reliability, to which adds up scalability, availability, security and manageability. If all these are not essential for your database solution, the MS Access is ok, though you’ll have to invest effort in each of these area when you have to raise your standards.

One of the most painful issues when dealing with concurrent data access is the transaction processing that needs to guarantee the consistency and recoverability of operations. As Access is not handling the transactions, the programmer has to do that using ADO or DAO transactions. As many applications still don’t need pessimistic concurrency, with some effort and a good row versioning also this issue could be solved. Also the security-related issues could be solved programmatically by designing a role-based permission framework, though it occasionally it could be breached when the user is aware of the few Access hacks and has direct access to the database. 

Manageability resumes usually in controlling resources utilization, monitoring the progress of the actions running on the database. If Access is doing a relatively good job in what concerns the manageability of its objects, it has no reliable way to control their utilization, when a query is running for too long, the easiest way to solve this is to coldly kill the process belonging to Access. Not sure if it makes sense to philosophy about Access’ scalability and availability, at least can’t be comparable from this point of view with RDBMS for which failover clustering, mirroring, log shipping, online backup and in general online maintenance have an important impact on the two.

Excepting the above theoretical limitations, when MS Access is part of your solution, it’s always a good idea to know its maximal capacity specifications, this applying to all type of databases or technologies.  Most probably you won’t want that in the middle of your project or even later you realize that you reach one of such limitations. I tried to put together a comparison between the maximal capacity specifications for 2000, 2007 and 2010 versions of MS Access and, for reference, the same specification for SQL Server (2000, 2005, 2008 R2). The respective information come mainly from Microsoft websites, with a few additions from [5] and [6].


MS Access
SQL Server
Attribute
2000 [1]
2007/2010 [2]
2000 [7]
2005 [4]
2008 R2 [3]
 SQL statements size
64kb
64kb
64kb
64kb
64kb
# characters in Memo field
65535
65535
-
2^30-1
2^31-1
# characters in Text field
255
255
8000
8000
8000
# characters in object name
64
64
128
128
128
# characters in record
4000
4000
8000
8000
8000
# concurrent users
255

255


32767
# databases per instance
1
1
32767
32767
32767
# fields in index
10
10
16
16
16
# fields in recordset
255
255
4096
4096
4096
# fields in table
255
255
1024
1024
1024/30000
# files per database
1
1
32767
32767
32767
# forced relationships per table
32
32
253
253
253
# indexes per table
32
32
250 (1 clustered)
250 (1 clustered)
250 (1 clustered)
# instances


16
50
50
# joins in a query
16
16
32
32
32
# levels nested queries
50
50
32
32
32
# nested subqueries


32
32
32
# objects
32768
32768
2147483647
<>
</>
2147483647
2147483647
# open tables
2048
2048
2147483647
2147483647
2147483647
# roles per database
n/a
n/a
16379
16379
16379
# tables in a query
32
32
256
256
256
# users per database
n/a
n/a
16379
16379
16379
database size
<2GB
<2GB
1048516 TB
542272TB
542272TB
file size (data)
2GB
2GB
32TB
16TB
16TB
file size (log)
n/a
n/a
32TB
2TB
2TB


For my surprise the maximal capacity specifications of Access are comparable with the ones of SQL Server for many of the above attributes. Sure, there is a huge difference in what concerns the number of databases, the database/file size and the number of supported objects, quite relevant in the architecture of applications. Several other differences, for example the number of indexes supported per table or relationships per table, are less important for the majority of solutions. Another fact that is not remarked in the above table is the fact that the number of records in a table are typically limited by storage. Please note that many important features not available in Access were left out, therefore, for a better overview is advisable to check directly the referenced sources.

There are two one more personal observations for this post. Even if MS Access is great for non-SQL developers giving its nice Designer, for SQL developers it lacks a rich editor, the initial formatting being lost, this doubled by the poor support for later versions of the ANSI standard make from Access a tool to avoid.

References:
[1] Microsoft. 2010. Microsoft Access database specifications. [Online] Available form:
http://office.microsoft.com/en-us/access-help/access-specifications-HP005186808.aspx (Accessed: 04.10.2010)
[2] Microsoft. 2010. Access 2010 specifications [Online] Available form: http://office.microsoft.com/en-us/access-help/access-2010-specifications-HA010341462.aspx (Accessed: 04.10.2010)
[3] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2008 R2. [Online] Available form: http://msdn.microsoft.com/en-us/library/ms143432.aspx (Accessed: 04.10.2010)
[4] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2005. [Online] Available form: http://msdn.microsoft.com/en-us/library/ms143432(SQL.90).aspx (Accessed: 04.10.2010)
[5] SQL Server Helper. (2005). SQL Server 2005: Maximum Capacity Specifications. [Online] Available form: http://www.sql-server-helper.com/sql-server-2005/maximum-capacity-specifications.aspx (Accessed: 04.10.2010)
[6] MSDN. (2008).SQL 2005 and SQL 2008 database volume capacity. [Online] Available form: http://social.msdn.microsoft.com/forums/en-US/sqlgetstarted/thread/4225734e-e480-4b21-8cd4-4228ca2abf55/ (Accessed: 04.10.2010)
[7] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2000. [Online] Available form: http://technet.microsoft.com/en-us/library/aa274604(SQL.80).aspx (Accessed: 04.10.2010)
[8] MSDN. (2010). Comparison of Microsoft Access SQL and ANSI SQL. [Online] Available form: http://msdn.microsoft.com/en-us/library/bb208890.aspx (Accessed: 04.10.2010)
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.