05 October 2010

MS Office: The Limitations of MS Access Database

In the previous post I was highlighting some general considerations on the use of MS Access and Excel as frameworks for building applications. I left many things out from the lack of time and space, therefore, as the title reveals, in this post I will focus simply on the limitations of MS Access considered as Database. I considered then that Access is a fairly good as database, recommending it for 10-20 concurrent users, fact that could equate, after case, maybe with a total of users that range between 1-100. Of course, this doesn’t mean that MS Access can’t do more, actually it supports 255 concurrent users and with a good design that limit could be reached.

Another important limitation regards the size of an Access database, set to 2GB, it used to be more than sufficient a few years back, though nowadays, it’s sometimes the equivalent of a month/year of transactions. I never tried to count how many records could store a MS Access, though if I remember correctly, a relatively small to average table of 1000000 (10^6) records occupies about 100MB, using this logic 2GB could equate with about 20000000 (2*10^7) records, the equivalent of a small to average database size. Anyway, the numbers are relative, the actual size depends also on the number of objects the database stores, the size of attributes stored, on the fact that even if Access is supposed to have a limitation of 2GB, I met cases in which a database of 1GB was crashing a lot, needing to be repaired or backed up regularly. 

Sometimes it could be repaired, other times not, unfortunately the “recovery” built within a MS Access can’t be compared with the recovery available in a RDBMS. That’s ok in the end, even mature databases crash from time to time, though the logs and transaction isolation models allow them to provide high recoverability and reliability, to which adds up scalability, availability, security and manageability. If all these are not essential for your database solution, the MS Access is ok, though you’ll have to invest effort in each of these area when you have to raise your standards.

One of the most painful issues when dealing with concurrent data access is the transaction processing that needs to guarantee the consistency and recoverability of operations. As Access is not handling the transactions, the programmer has to do that using ADO or DAO transactions. As many applications still don’t need pessimistic concurrency, with some effort and a good row versioning also this issue could be solved. Also the security-related issues could be solved programmatically by designing a role-based permission framework, though it occasionally it could be breached when the user is aware of the few Access hacks and has direct access to the database. 

Manageability resumes usually in controlling resources utilization, monitoring the progress of the actions running on the database. If Access is doing a relatively good job in what concerns the manageability of its objects, it has no reliable way to control their utilization, when a query is running for too long, the easiest way to solve this is to coldly kill the process belonging to Access. Not sure if it makes sense to philosophy about Access’ scalability and availability, at least can’t be comparable from this point of view with RDBMS for which failover clustering, mirroring, log shipping, online backup and in general online maintenance have an important impact on the two.

Excepting the above theoretical limitations, when MS Access is part of your solution, it’s always a good idea to know its maximal capacity specifications, this applying to all type of databases or technologies.  Most probably you won’t want that in the middle of your project or even later you realize that you reach one of such limitations. I tried to put together a comparison between the maximal capacity specifications for 2000, 2007 and 2010 versions of MS Access and, for reference, the same specification for SQL Server (2000, 2005, 2008 R2). The respective information come mainly from Microsoft websites, with a few additions from [5] and [6].


MS Access
SQL Server
Attribute
2000 [1]
2007/2010 [2]
2000 [7]
2005 [4]
2008 R2 [3]
 SQL statements size
64kb
64kb
64kb
64kb
64kb
# characters in Memo field
65535
65535
-
2^30-1
2^31-1
# characters in Text field
255
255
8000
8000
8000
# characters in object name
64
64
128
128
128
# characters in record
4000
4000
8000
8000
8000
# concurrent users
255

255


32767
# databases per instance
1
1
32767
32767
32767
# fields in index
10
10
16
16
16
# fields in recordset
255
255
4096
4096
4096
# fields in table
255
255
1024
1024
1024/30000
# files per database
1
1
32767
32767
32767
# forced relationships per table
32
32
253
253
253
# indexes per table
32
32
250 (1 clustered)
250 (1 clustered)
250 (1 clustered)
# instances


16
50
50
# joins in a query
16
16
32
32
32
# levels nested queries
50
50
32
32
32
# nested subqueries


32
32
32
# objects
32768
32768
2147483647
<>
</>
2147483647
2147483647
# open tables
2048
2048
2147483647
2147483647
2147483647
# roles per database
n/a
n/a
16379
16379
16379
# tables in a query
32
32
256
256
256
# users per database
n/a
n/a
16379
16379
16379
database size
<2GB
<2GB
1048516 TB
542272TB
542272TB
file size (data)
2GB
2GB
32TB
16TB
16TB
file size (log)
n/a
n/a
32TB
2TB
2TB


For my surprise the maximal capacity specifications of Access are comparable with the ones of SQL Server for many of the above attributes. Sure, there is a huge difference in what concerns the number of databases, the database/file size and the number of supported objects, quite relevant in the architecture of applications. Several other differences, for example the number of indexes supported per table or relationships per table, are less important for the majority of solutions. Another fact that is not remarked in the above table is the fact that the number of records in a table are typically limited by storage. Please note that many important features not available in Access were left out, therefore, for a better overview is advisable to check directly the referenced sources.

There are two one more personal observations for this post. Even if MS Access is great for non-SQL developers giving its nice Designer, for SQL developers it lacks a rich editor, the initial formatting being lost, this doubled by the poor support for later versions of the ANSI standard make from Access a tool to avoid.

References:
[1] Microsoft. 2010. Microsoft Access database specifications. [Online] Available form:
http://office.microsoft.com/en-us/access-help/access-specifications-HP005186808.aspx (Accessed: 04.10.2010)
[2] Microsoft. 2010. Access 2010 specifications [Online] Available form: http://office.microsoft.com/en-us/access-help/access-2010-specifications-HA010341462.aspx (Accessed: 04.10.2010)
[3] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2008 R2. [Online] Available form: http://msdn.microsoft.com/en-us/library/ms143432.aspx (Accessed: 04.10.2010)
[4] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2005. [Online] Available form: http://msdn.microsoft.com/en-us/library/ms143432(SQL.90).aspx (Accessed: 04.10.2010)
[5] SQL Server Helper. (2005). SQL Server 2005: Maximum Capacity Specifications. [Online] Available form: http://www.sql-server-helper.com/sql-server-2005/maximum-capacity-specifications.aspx (Accessed: 04.10.2010)
[6] MSDN. (2008).SQL 2005 and SQL 2008 database volume capacity. [Online] Available form: http://social.msdn.microsoft.com/forums/en-US/sqlgetstarted/thread/4225734e-e480-4b21-8cd4-4228ca2abf55/ (Accessed: 04.10.2010)
[7] MSDN. (2010). Maximum Capacity Specifications for SQL Server: SQL Server 2000. [Online] Available form: http://technet.microsoft.com/en-us/library/aa274604(SQL.80).aspx (Accessed: 04.10.2010)
[8] MSDN. (2010). Comparison of Microsoft Access SQL and ANSI SQL. [Online] Available form: http://msdn.microsoft.com/en-us/library/bb208890.aspx (Accessed: 04.10.2010)

02 October 2010

🧭Business Intelligence: Is MS Access or Excel the Answer to your Problems?

Business Intelligence
Business Intelligence Series

Introduction 

That’s one of the topics that followed me for years, quite often being asked by customers to provide a MS Access or MS Excel solution as an answer to a business need. The beauty of this question is that there is no right answer and, as I stressed out in several occasions, there is not always a straightforward answer to such a question in IT, the feasibility of an IT solution relying on many variables formulated typically in term of business and IT requirements. 

When a customer is requesting to built a MS Access or Excel solution outside of Office paradigm, I’m kind of circumspect, and this not because they are not great tools, but because they are not adequate for all purposes. I even recommend the two for personal or for small-scale solutions, though their applicability should stop right there.

A personal solution is an application developed for personal use, for example to store and maintain the data for a report, to process data automatically or any other attempt of automating some tasks. By small-scale solutions I’m referring to the following types of applications: 
- applications of basic to average complexity, that don’t require complex design or could be developed by a developer with average skills.
- applications that target a small number of users, usually a small group of max 10-20 concurrent users, it may be occasionally a whole department or it could be cross departmental as long the previous mentioned condition are met.

A Short Review 
 
MS Excel is the perfect tool for storing non-relational tabular data, manipulating data manually or with the help of formulas, doing data analysis with pivoting and charting, or of querying various data sources. Its extensibility based on its DOM (Document Object Model), VBA (Visual Basic for Applications) and its IDE (Integrated Development Environment), Forms, add-ins, in-house or third-party developed libraries, the template and wizard-based approach, make from Excel a powerful development environment. I would say that Excel’s weakness resides in its intrinsic design, the DOM model which lacks a rich event model, in the fact that Excel is mainly a tool for data entry, analysis and reporting, the other types of functionality coming on a secondary plan. Excepting a few new features built in Excel itself, the important new functionality comes as add-on – SQL Server-based data mining add-in, MS Sharepoint Server-based Web Services features like multiuser collaboration, slicer and a few other.

The extensibility capabilities mentioned above are not only a particularity of Excel but apply to the whole Office family: Access, Word, Outlook, Powerpoint, and even Visio if is considered the “extended family”, each of them with its role. Access’ role is that of flexible relational data storage, querying and reporting solution, its strength relying mainly in the easiness of providing a simple UI (User Interface) for maintaining and navigating the data, in the easiness of pulling data from various sources for further analysis. As in the case of Excel, Access’ weakness resides in its DOM, in the fact that it’s not a full RDBMS (Relational Database Management System) and all the consequences deriving from it.

Programming for the Masses/Citizens
 
The great thing about VBA is that also non-developers could successfully adventure in developing Office-based applications, the possibility of learning from the code built with “Record Macro” functionality allowing a small learning curve. Enabling “non-developers” to built applications makes from Office a powerful and altogether dangerous tool because such applications could be easily misused. Misused here refers to the fact that often is attempted to built in Excel or Access complex applications that sooner or later break apart under their complexity, that organizations arrive to have a multitude of such applications with no control over their existence, maintenance, security, etc. 

Unfortunately the downsides of such applications are discovered late in the process, when intended functionality is not available, thus arriving to reinvent the wheel, patch up functionality in a jumble, in a tumble. With some hard-work you could achieve the alike functionality as the one available in powerful frameworks like .Net, WPF, WCF or Silverlight, to mention the Microsoft technologies I’m somewhat acquainted to. VBA is great but with time became less powerful than VB, C# or C++ (the comparison between VBA and C++ is a little forced), to mention the most important programming languages for writing managed code in .Net. The barriers between the capabilities of the two types of programming languages are somehow broken by the possibility of developing add-ins and libraries for MS Office or of using Office DOM in .Net applications, though few (non-) programmers adventure on this path.

The Architectural Perspective 
 
There is another important architectural perspective – separating the data storage and eventually data processing from presentation. Also when using Access or Excel the data storage could be separated from presentation, though I’ve seen few solutions doing that, the three layers coexisting usually within the same tire. An Access solution could be split in two, one for database and other for UI and processing, allowing more flexibility in what concerns the architecture, security, version management, etc. 

Access is good for data presentation and rapid prototyping, though the concept and the data controls are quite old, having several limitations when compared with similar controls available for example in .Net. The advantage of using simple drag-and-drop or wizards in Access is for long over, the same functionality existing also in Visual Studio (Express), environment in which applications could be built with drag-and-drop and wizards too, in plus taking advantage of additional built-in features. The database layer could be replaced with a full RDBMS, same as the presentation layer could be replaced with a .Net UI. It’s not much easier then to built the architecture around the .Net UI and a RDBMS?!
 
Excel is considered by many as a (relational) database, is it really so? It’s true the data could be stored in tabular format in which a sheet plays the role of a table and queryable through the various drivers available, though no primary key is available, less control over the data entered and many other features available in RDBMS need to be provided programmatically, again reinventing the wheel. Same as in the case of Access, Excel could be considered for data storage and presentation, its functionality being reduced when compared with the one of Access. 

Many people are used with the data entry mechanism available in Excel, especially in what concerns data manipulation, wanting similar functionality in other tools. If this was Excels’ advantage some time ago, that’s no more valid, several rich data grids offering similar data entry functionality which, with some effort, could simulate to an acceptable degree the functionality of Excel, and they could provide also richer validation functionality.

It’s all about Costs 
 
In the past MS Excel and Access were quite cheap as "development platforms" when compared with the purchasing of existing IDE, especially when we consider their extensibility through VBA and IDE’s availability, thus the functionality vs. extensibility favorable ratio. Recently were introduced express (aka community) versions of powerful IDEs for Visual Studio, respectively open source IDE and development frameworks that provide rich capabilities, the report of forces changed dramatically in the favor of the later. 

Today you could put together a small-scale application with a minimum of investment, making sometimes obsolete the use of Office tools outside of the Office solutions. The pool of software tools and technologies changed in the past years considerable, but the mentality in what concerns the IT infrastructure and software development changed less. It’s true that sometimes organizations lack the resources who could architect and design such solutions, relying mainly on external resources, or being much easier to rely on an employee’s programming skills who knows “exactly” what's needed and it would be in theory much easier in order to attempt solving a problem directly rather than writing the requirements down. 

In VBA’s advantage comes also the fact that normally software solutions evolve and need to be changed in order to reflect business or philosophy changes, being much easier to introduce such changes directly by the employee who built the application in contrast with starting a whole project for this purpose. This aspect is rooted in other perspective – sometimes organizations ignore the software needs, falling in employees attribution to find cheap and fast ways of automating tasks in particular, solving work-related problems in general, Excel or Access being quite handy for this purpose. Sure, you can do almost anything also in Excel/Access but with what costs?

The Strategic Context 
 
Several times I heard people talking about replacing the collection of Excel sheets with an Access solution. I know that in the absence of adequate solutions people arrive to store various types of data in Excel sheets, duplicating data, loosing the control over versions, data quality, making data unsecure/unavailable or un-processable. Without a good data management and infrastructure strategy the situation doesn’t change significantly by using an Access solution. 

It’s true that the data could be easier stored in a global place, some validation could result in better data quality, while security, availability and data maintainability could suffer some improvements too, however the gain is insignificant when compared with the capabilities of a full-featured RDBMS. Even if a company doesn’t have the resources to invest in a mature RDBMS like Oracle or SQL Server, there are also the Express versions for the respective databases, several other free solutions existing on the market especially in the area of open source. On the other side it’s true that MS Access, through its easy to use SQL Designer, allows people building queries with simple drag-and-drops and limited SQL knowledge, though its value is relative.

Talking about data management strategy, it concerns mainly the data quality as a function of its 6 main dimensions (accuracy, conformity, consistency, completeness, duplicates, referential integration) to which add data actuality, accessibility, security, relevance, usability, and so on. The main problem with personal solutions is that they lead to data and logic duplication, and even when such solutions are consolidated in one form or another, their consolidation and integration is quite complex because you have to consider not only the various designs but also the overall requirements from a higher perspective. On the other side it’s difficult to satisfy the needs of all the people in an organization, in a form or another, duplication of data being inevitable, with direct or indirect implications on data quality. It is required some effort and a good strategy in what concerns these aspects, finding the balance between the various requirements and the number of solutions to satisfy them.

Reformulating the Question

How can we determine which tool or set of tools is appropriate for our problem? Normally the answer to this question depends on the needed functionality. The hard road in answering this question is to identify all the requirements, the features available in the various tools, weight both of them, and decide what worth best. Unfortunately that’s not an easy task, it need to be considered not only actual but also future requirements, organization’s strategy, and whatever might come around. 

Reports, best practices, lessons learned or other type of succinct content might help as well in taking a decision without going too deep in analyzing features and requirements thoroughly. Sometimes a gut feeling might work as well, especially when comes from a person with experience in the field. Other times you don’t have too many options – time, resources, knowledge, IT infrastructure, philosophy or politics reducing your area of maneuverability/decision. In the end we learn by doing, by fighting with the constraints and problems we have, hopefully we learn also from our or others’ mistakes…

PS: Even if I’m having several good cumulated years in developing solutions based on Excel and Access, and I can’t pretend that I know their full potential, especially when judged from the perspective of the new features introduced with Excel 2007 or 2010, even more when considering their integration with SharePoint, SQL Server or other similar platforms. The various software tools or platforms existing on the market allow people to mix functionality theoretically in unlimited ways, the separation of functionality between layers, SaaS (software as a service) and data meshes changing the way we program and perceive software development.

Previous Post <<||>> Next Post

30 September 2010

🚧Project Management: Project Management [PM] (Definitions)

"Project Management is an approach which has been developed and successfully employed for more than a decade to systematically plan and control efforts which have an identifiable end item, the production of which involves complexity, risk and some sort of a fixed deadline." (G F Blanchard & D L Cook, "Project Management and Educational Change", 1973)

"Project management is the planning, organizing, directing, and controlling of company resources for a relatively short-term objective that has been established to complete specific goals and objectives. Furthermore, project management utilises the systems approach to management by having functional personnel (the vertical hierarchy) assigned to a specific project." (Harold Kerzner, "Project Management for Executives", 1982)

"Project management is the application of knowledge, skills, tools, and techniques to a broad range of activities in order to meet the requirements of the particular project. A project is a temporary endeavor undertaken to achieve a particular aim." (Irene Chen, "Instructional Design Methodologies", 2008)

"The act of planning and managing a series of tasks and agreed-upon deliverables. It is based on standard methods and processes." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"The application of knowledge, skills, tools, and techniques to project activities to meet the project requirements." (Cynthia Stackpole, "PMP® Certification All-in-One For Dummies®", 2011)

"The process of managing a project. Deploys tools, processes and attitudes that deal with the complexity and uncertainty inherent in a project." (Mike Clayton, "Brilliant Project Leader", 2012)

"A profession in the programmatic sciences that focuses on the design and completion of work plans to ensure the efficient delivery of specified work products on time, on budget, and to specifications. Project management is practiced by project managers who apply professional knowledge, skills, tools, and techniques to ensure the effective management of operational uncertainty and the resolution of operational complexity." (Richard J Heaslip, "Managing Complex Projects and Programs", 2014)

"The application of knowledge, skills, tools, and techniques to project activities to meet the project requirements." (PMI, "Implementing Organizational Project Management: A Practice Guide", 2014)

"The process of managing required project activities and tasks in a formal manner to complete a project; performed primarily by the project manager." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"Project management is the application of methods, tools, techniques and competencies to a project. Project management includes the integration of the various phases of the project life cycle [...]" (ISO 25000)

29 September 2010

🛠️SQL Server Administration: Troubleshooting Who is Active

    I checked ReplTalk’s post on “waiting” statements, realizing that I had to write a similar query when troubleshooting performance of running queries in general, respectively blocking issues in particular. The novelty resides in Adam Machanic’s comment to the respective post, the link leading me to sp_whoisactive stored procedure which provides a collection of wait stats, lock information, outer command/batch, etc. The procedure is impressive from the point of view of its complexity (and number of lines), the author spending many hours on this piece of art. The parameters are well documented but not the logic, some additional information could be gathered from the several posts on this topic, however that won’t make easier reviewer’s work. After a first review I’m missing a few examples and scenarios in which to use the procedure, without them I’m kind of starting from 0. 

     I am a little reticent in using other people’s code when the code is too complex and requires too much time to be understood, especially when the respective code is not documented, requires a whole manual and/or the code isn’t easy to use. These aspects don’t decrease the merit of the person who wrote it but it decreases the overall utility of the respective code, because when people don’t understand the usability and/or range of applicability of a piece of code, the risk of people misusing the respective piece of code is quite high. From my perspective I find it more useful to have in place a set of queries and procedures that help you troubleshoot performance issues than having a stored procedure that attempts to do everything, requires many parameters, time to be debugged and understood. 

   SQL Server lacks in documentation describing in detail how its internal actually work. The stored procedure could be used to understand how various situations were handled, one of the first problems residing in identifying the piece of code corresponding to each scenario, “isolate” it and make most of it. Most probably a developer who’s trying to understand the respective code will need to break down the code in chunks, like I did, and use them in isolation in order to prove code’s logic. Sure, the followed approach depends also on each person’s skills, level of understanding and exploration techniques. As too much philosophy is not good, especially when divagating from the subject, I will end by inviting the reader to look over the respective stored procedure and, why not, to improve it, eventually derive more knowledge out of it.

30 August 2010

📊Business Intelligence: Scorecards (Definitions)

"Online, real time reporting to monitor performance against targets." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"An approach to rating risk used in many areas of risk management. A scorecard is most commonly used for rating customer risk (often individuals) as the probability of default in financial risk management." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide To Risk Management", 2010)

"An application that helps organizations measure and align the strategic and tactical aspects of their businesses, comparing organizational and individual performance to goals and targets. (Laura Reeves, A Manager's Guide to Data Warehousing, 2009) 

"A collection of information - organized in a single view - that tracks an organization's progress toward a specific goal." (Ken Withee, "Microsoft® Business Intelligence For Dummies®", 2010)

"Performance management tools that help managers track performance against strategic goals" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Scorecards are used by enterprises to measure the progress against the enterprise strategy. Scorecards represent performance trends over a period of time such as monthly/quarterly/yearly; whereas dashboards indicate the status of a performance metric at a given point in time. In contrast, dashboards are used to represent actual granular data, they contain data that is more recent than that of scorecards." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"A graphical representation of valid values for a source column or output of a rule in profile results. Use scorecards to measure data quality progress." (Informatica)

"A representation of summarized performance measurements representing progress towards the implementation of long-term goals. A scorecard provides static measurements of performance over or at the end of a defined interval." (ISQTB)

"A scorecard is a visual display of the most important information needed to achieve one or more objectives, consolidated and arranged on a single screen so the information can be monitored at a glance. Unlike dashboards that display actual values of metrics, scorecards typically display the gap between actual and target values for a smaller number of key performance indicators." (Intrafocus)

17 August 2010

SQL Server: Model Database (Definitions)

"A template for new user databases. The installation process creates model when SQL Server is installed. Each time the create database command is issued, SQL Server makes a copy of model and extends it to the size requested, if necessary." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"An SQL Server–supplied database that provides a template for new user databases. Each time a database is created, SQL Server makes a copy of the model database, sizes it to the requested size, and files the user database with the system tables and objects currently defined in the model database." (Patrick Dalton, "Microsoft SQL Server Black Book", 1997)

"A database installed with SQL Server that provides the template for new user databases. Each time a database is created, SQL Server makes a copy of the model and then extends it to the size requested. A new database cannot be smaller than the model. The model database contains the system tables required for each user database. You can modify the model to add objects that you want in all newly created databases." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A database installed with SQL Server that provides the template for new user databases. SQL Server 2000 creates a new database by copying in the contents of the model database and then expanding it to the size requested." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"The template used when creating new databases. Any new database is created from a copy of the Model database and then modified from there." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"A database that is installed with Microsoft SQL Server and that provides the template for new user databases. SQL Server creates a database by copying in the contents of the model database and then expanding the new database to the size requested." (Microsoft, SQL Server 2012 Glossary", 2012)

15 August 2010

Data Security: SQL Injection I - Introduction

Introduction

  If you are working in IT, most probably you’ve heard already about SQL Injection, if not then might be it’s a good idea to ask your colleagues and eventually your IT manager if your company has any policies related to it. If you are working for a software vendor or a consultancy company then SQL Injection countermeasure techniques might be quite well positioned in the list of best practices in what concerns the development of Web/Desktop Applications, Web Services or database-related logic adopted by your company. If you are working for a company, other than the two mentioned above, and have various software projects on the role or already in house, then most probably you’ll have to ask if the software vendors you are working with have took into consideration the SQL Injection threats and proved their solutions against them. On contrary, if you have nothing to do with IT at all, it might still be a good idea to ask your IT department if they have anything in place related to SQL Injection – Security Policy, security best practices, etc.

Definition

  Wikipedia defines SQL Injection as “a code injection technique that exploits a security vulnerability occurring in the database layer of an application” [3], the code injection being defined as “the exploitation of a computer bug that is caused by processing invalid data” [4]. For a programmer the definition is acceptable, though for other type of professionals it might not be so clear what’s about, especially when they are not familiar with IT terminology. I find more clear the definition provided by J. Clarke et. al, who in his book SQL Injection Attacks and Defense, defines SQL Injection as the vulnerability that results when you give an attacker the ability to influence the Structured Query Language (SQL) queries that an application passes to a back-end database” [2]. I will slightly modify the last definition and say that the SQL injection is a security vulnerability residing in the possibility to alter the intended behavior of the SQL Queries passed to the database.

Some Background

    At the beginning of our century, with the increase importance of Web Applications whose availability over WAN/Internet (networks) brought new security issues, the SQL Injection became a really hot topic given the damages such techniques could do to an application, with just a few tricks the “hacker” having the possibility to enter in the application and even in the machine hosting the database used, entering thus in the possession of sensitive information, and above all having the possibility of damaging the database. J. Clarke et. al remarks that the first connection between web applications and SQL injection is widely accredited to Rain Forest Puppy, who in an article titled “NT Web Technology Vulnerabilities” (see “ODBC and MS SQL server 6.5” section) written in 1998 for Phrack, an e-zine written by and for hackers [2], was describing the behavior specific to SQL Injection in relation to MS SQL Server 6.5. 
 
    I remember when my boss break us the news that we have to protect urgently our applications against SQL Injection, having to redesign some of the database objects and components in order to protect our applications against such techniques. I was then in my first or second year of professional experience, so the topic was new and quite intriguing not only for myself but also for my colleagues, some of them having a few more years of professional programming experience that I did and, I hope I’m not mistaking, none (or few) of them actually have heard about it. It was interesting to check how simple techniques could do so much damage. At those times there were few articles on SQL Injection and specific countermeasure techniques, not to mention best practices, so we were kind of groping in the dark in finding a countermeasure to the problem.

State of Art

    Since then, the number of search engines hits on the topic is quite impressive, many professionals approaching the problem in their way, Vendors started to design their solutions and make aware programmers on best practices in order to minimize this type of security threat, books were written on this topic, the awareness increased between developers and other type of IT professionals. Even if considerable effort has been made into this direction, and the topic appears often on the blogs, there are still many web sites not designed to address SQL Injection concerns. In 2007, The WhiteHat Security, placed SQL Injection on 5th position in top of vulnerabilities, estimating that 1 out of 5 web sites is vulnerable to SQL Injection [1]. In 8th Web Security Report based on 2009 data provided by WhiteHat Security[5], and as it seems also in 9th report [6], SQL Injection remains on the same position, what’s interesting to remark is the split per scripting technology provided in [6]: 
 
SQL Injection - Statistics WhiteHat

   In Web Hacking Incident Database maintained by Web Application Security Consortium, SQL Injection is considered as 17.97 % out of the total 512 reported top attack methods. Even if the number of reported attacks is insignificant in report to the number of sites available on the web, the percentage of cases seems to be in agreement with the number provided in WhiteHat Security reports.

Resources

   If the topic made you curious, you could find out more with just a simple search on the Web. There are many professionals who wrote on this topic, however it’s a good idea to start directly with the resources provided by the RDBMS vendors, for example Microsoft through its Security Research & Defense blog, in SQL Injection Attacks post has an interesting list of resources on this topic. A nice document on ‘How to write SQL injection proof PL/SQL’ comes from Oracle, an interesting presentation on ‘SQL Injection Myths and Fallacies’ was made at MySQL Conference & Expo, etc.

References:
[1] WhiteHat Security. [2007]. Website Security Statistics Report. [Online] Available from: http://www.whitehatsec.com/home/assets/WPStatsreport_100107.pdf (Accessed: 15 August 2010)
[2] J.Clarke et. al (2009). SQL Injection Attacks and Defense. Elsevier. ISBN: 978-1-59749-424-3
[3] Wikipedia. (2010). SQL Injection. [Online] Available from: http://en.wikipedia.org/wiki/Sql_injection (Accessed: 15 August 2010)
[4] Wikipedia. (2010). Code Injection. [Online] Available from: http://en.wikipedia.org/wiki/Code_injection (Accessed: 15 August 2010)
[5] WhiteHat Security. [2009]. Website Security Statistic Report, 8th Ed. [Online] Available from: http://www.whitehatsec.com/home/assets/WPstats_fall09_8th.pdf (Accessed: 15 August 2010)
[6] WhiteHat Security. [2010]. Website Security Statistic Report, 9th Ed. [Online] Available from: http://www.slideshare.net/jeremiahgrossman/whitehat-security-9th-website-security-statistics-report-3995771 (Accessed: 15 August 2010)

13 August 2010

💎SQL Reloaded: Temporary Tables vs. Table Variables and TempDB

Yesterday, I started to read Ken Henderson’s book, SQL Server 2005 Practical Troubleshooting: The Database Engine, diving directly into tempdb topic (Chapter 9, Tempdb issues). He mentions that metadata are created in system tables when a temporary table is created (see p.415). This means that when a temporary table is created, a record must be created in tempdb’s sys.tables and sys.columns system table, the respective records being deleted when the table is dropped. As I never looked at how the metadata of a temporary table look like, I thought is the case to do something in this direction, and here’s the code created for this purpose:

-- creating the temporary tables 
CREATE TABLE #temp ( 
   id int NOT NULL 
, value nvarchar(50) NOT NULL) 

-- retrieving the metadata 
SELECT t.name table_name 
, s.name column_name 
, t.type  
, t.type_desc  
, t.create_date  
FROM tempdb.sys.tables t 
    JOIN tempdb.sys.columns s 
      ON t.object_id = s.object_id 
WHERE t.name LIKE '%temp%' 

-- dropping the temporary table 
-- DROP TABLE #temp  -- see the 2nd note!

temp vs variable tables - temporary example 

Note:
By changing the width of table_name column could be seen that object’s name corresponding to the temporary tables is a combination from table’s name and, according to K. Henderson, the number designating the connection that owns the table.

If the temporary table is stored in tempdb and metadata are stored about it, what’s happening with a temporary table? Here’s the answer:

-- creating the table variable 
DECLARE @temp TABLE( 
  id int NOT NULL  
, value nvarchar(50) NOT NULL) 


-- retrieving the metadata 
SELECT t.name table_name 
, s.name column_name 
, t.type  
, t.type_desc  
, t.create_date  
FROM tempdb.sys.tables t 
     JOIN tempdb.sys.columns s 
      ON t.object_id = s.object_id 
WHERE t.name LIKE '%#%' 
   AND DateDiff(ss, t.create_date, GetDate()) BETWEEN -2 AND 2

temp vs variable tables - table variable example

As can be seen I had to put a little more effort in order to see a table variable’s metadata. As there is no name that could be used in order to identify the table, as object’s name is stored as a hex number, I had to restrain the list of tables by using the timestamp. Excepting the name, the metadata stored about the two types of tables are identical for the same table definition. Actually their definition is similar with the one of a “standard” table:

-- creating a "standard" table 
CREATE TABLE temp( 
  id int NOT NULL 
, value nvarchar(50) NOT NULL) 

-- retrieving the metadata 
SELECT t.name table_name 
, s.name column_name 
, t.type  
, t.type_desc  
, t.create_date  
FROM sys.tables t 
     JOIN sys.columns s 
      ON t.object_id = s.object_id 
WHERE t.name LIKE '%temp%' 

-- dropping the table 
-- DROP TABLE temp -- see the 2nd note!

temp vs variable tables - standard table example

Notes: 
(1) For exemplification I used a restrained list of attributes, when comparing the various table’s metadata could be used instead a SELECT * statement. The above examples reflect also the differences in declaring the three types of tables.
(2) Microsoft recommends not to drop the temporary tables explicitly, but let SQL Server handle this cleanup automatically and take thus advantage of the Optimistic Latching Algorithm, which helps prevent contention on TempDB [1].

Last updated: Oct-2024

References:
[1] Haripriya SB (2024) Do NOT drop #temp tables (link)

10 August 2010

SQL Server New Features: Information Schema Views

    In the last post, Database Object Dependencies – An introduction, I was talking about metadata, in common understanding defined as “data about data”, and its meaning in the context of databases as “information (data) stored about data, the structures or objects related to data”. The SQL92, SQL99, SQL2003, and I suppose also SQL2008 and the next coming ISO standard document versions, define an information schema called INFORMATION_SCHEMA supposed to contain a minimum of metadata about a database’s underlying objects, namely view descriptors, one base table descriptor and several domain descriptors [1]. The information schema is independent from databases’ internal system metadata implementations, fact that makes it a valuable approach in retrieving common metadata, at least from this perspective being possible to write vendor independent code as long the vendor adheres and implements the ISO/IEC 9075 standard(s).

    SQL Server 2008 makes available the Information Schema Views through INFORMATION_SCHEMA schema available in each database storing information about all database objects contained in the respective database. In order to retrieve the views available under the INFORMATION_SCHEMA schema and their definitions, could be used the following query:

-- retrieving objects' definition for INFORMATION_SCHEMA 
SELECT s.name [schema_name] 
, o.name [object_name] 
, sm.definition  
FROM sys.all_sql_modules sm 
    JOIN sys.all_objects o 
       ON sm.object_id = o.object_id  
   JOIN sys.schemas s 
      ON o.schema_id = s.schema_id  
WHERE s.name = 'INFORMATION_SCHEMA' 
ORDER BY o.name 
 
INFORMATION_SCHEMA definitions

    Unfortunately there are no descriptions stored together with the database objects, and even if most of the view names are self-describing, there are a few cases in which more information are required. Here is the list again with the views and their descriptions together with the link to the MSDN section that gives more information on view’s definition:
No View Description
1.
CHECK_CONSTRAINTS Returns one row for each CHECK constraint
2. COLUMN_DOMAIN_USAGE Returns one row for each column that has an alias data type
3. COLUMN_PRIVILEGES Returns one row for each column that has a privilege that is either granted to or granted by
4. COLUMNS Returns one row for each column
5. CONSTRAINT_COLUMN_USAGE Returns one row for each column that has a constraint defined on it
6. CONSTRAINT_TABLE_USAGE Returns one row for each table that has a constraint defined on it
7. DOMAIN_CONSTRAINTS Returns one row for each alias data type that has a rule bound to it
8. DOMAINS Returns one row for each alias data type
9. KEY_COLUMN_USAGE Returns one row for each column that is constrained as a key
10. PARAMETERS Returns one row for each parameter of a user-defined function or stored procedure
11. REFERENTIAL_CONSTRAINTS Returns one row for each FOREIGN KEY constraint
12. ROUTINES Returns one row for each stored procedure and function
13. ROUTINE_COLUMNS Returns one row for each column returned by table-valued functions
14. SCHEMATA Returns one row for each schema
15. TABLE_CONSTRAINTS Returns one row for each table constraint
16. TABLE_PRIVILEGES Returns one row for each table privilege that is granted to or granted by
17. TABLES Returns one row for each table
18. VIEW_COLUMN_USAGE Returns one row for each column that is used in a view definition
19. VIEW_TABLE_USAGE Returns one row for each table that is used in a view
20. VIEWS Returns one row for views

       As stated above, the information schema is available for each database, therefore the information schema views return only the records available in the current database to which the current user has access.

Notes:
1. Above their informative character, the INFORMATION_SCHEMA definitions could be used in order to understand more about the systems tables used as source for the respective views and the relations between them.
2. Microsoft recommends not to use the INFORMATION_SCHEMA views in order to determine the schema of an object and use instead the sys.objects system catalog view [2], warning that appears in several pages from information schema documentation.
3. As it seems the SQL Server 2008 doesn’t implement all the views defined in information schema, for example ASSERTIONS, CHARACTER_SETS, COLLATIONS, TRANSLATIONS, SQL_LANGUAGES, etc. It’s also interesting to note that the respective information are available in .Net through the OleDbSchemaGuid class (see also this).

References:
[1] ISO/IEC. (1992) ISO/IEC 9075:1992,  Database Language SQL. [Online] Available from: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt (Accessed: 8 August 2010)
[2] MSDN. (2010). Querying the SQL Server System Catalog. [Online] Available from: http://msdn.microsoft.com/en-us/library/ms189082.aspx (Accessed: 9 August 2010)

07 August 2010

Database Design: Object Dependencies (Part I - An Introduction)

Data Management
Data Management Series

Introduction

Around the various data islands existing in the blue and the models that support them are created a whole range of database objects (views, stored procedures, user-defined functions) and other type of non-database objects (classes, strong-typed datasets, reports, ad-hoc queries, etc.) With each reference to a database object is created a database dependency, or simply dependency, between the database object and the other objects that reference it, thus any change occurring in a database object could impact the various referents resulting in broken links, invalid calls or any type of error that might brake the calling applications or the isolated pieces of code (e.g. reports, ad-hoc queries, SQL scripts-based logic).

Tracking Database Dependencies

Many organizations use to document such dependencies in data dictionaries or any other type of similar documentation, one of the reasons being the easier identification of the objects that are impacted by the changes occurring in the database structure. One of the problems is that the documentation is often application-oriented, targeting thus the application using the data, and if there are multiple applications consuming the same data, then it’s not so easy to aggregate all the dependencies especially when they are stored in Excel files, dispersed documents, repositories, with (complicated) permissions access, 

That’s one of the reasons for which an organization might consider storing in the source database as much of the business logic related directly to data. Encapsulating queries and procedural logic in views, stored procedures, user-defined functions or any other similar objects seems a good idea in order to reduce the maintenance of code, hide the complexity of a database from the consumers (users, services, web/desktop applications, etc.), and from several other considerations. 

The most important of these considerations is the fact that databases store not only the respective objects and statistics about them, but could store also the dependencies between them, making easier the impact analysis or any type of analysis based on the dependency between objects. It’s at the discretion of developers, architects or any other type of professional with decision power on whether they want to take advantage of such functionality.

Foreign Key Constraints

The simplest and most natural dependence information to store are the primary-foreign key relations implemented in the form of a constraint. The foreign key constraints, as they are called, identify and enforce the relationship between two tables; “identify” because it makes the relation explicit, and “enforce” because it checks the validity of foreign-primary key values pairs when records are inserted, updated or deleted, enforcing thus the referential integrity of the database. 

When a deletion is attempted on a record, the database engine checks if there is any dependent record (in the same or other table) that references the respective record, and if such a constraint is defined, the deletion is aborted raising also an error message. A check is performed also when a record is inserted or updated in the child table, the respective actions being aborted if the foreign key reference is not valid. 

Conversely, a foreign key constraint could bring additional complexity during migration tasks, though with a little effort and a good architecture the overhead is minimized. In addition the foreign key constraints could be used by third party tools to provide some degree of automation when joining tables or for other purposes.

Object Dependencies

More complex dependency regard the dependency between database objects  – views, stored procedures, user-defined functions or tables. In some cases is enough to see that there is a dependency between two objects, though in more complex situations would be useful to know which specific attribute is used from the dependency objects, especially when using the metadata for automation tasks. By creating the dependency tree, the tree of objects resulted from the dependency between the various database objects, it’s possible to provide more accurate impact assessments.

So until now were considered the dependencies between database objects, though, as highlighted above, there are many other objects stored outside of a database and referencing database objects. It makes sense to have a global repository in which to store information about dependencies, preferably in a relational database which could be easily interrogated using simple flat or hierarchical queries.

Metadata

The object definitions, statistics, dependencies and other type of information stored about data or the structures or objects related to data are encompassed under the denomination of metadata, which in common understanding is defined as “data about data”. The metadata could be used not only as input for impact analysis but also for automating business logic, functionality that opens new perspectives in development. Einstein’s believe that “problems cannot be solved by the same level of thinking that created them” is reflected in the world of databases by the fact that the metadata stored about database objects help to solved problems related to the objects and the data the databases contain. For example during a data migration project the two database structures could be mapped at table and attribute level, being possible thus to create validation rules in an automated manner.

Even if databases come with a predefined structure of storing metadata, the various solutions developed on top of such databases require additional metadata to be stored, and in theory it would be great if databases’ metadata structures could be extended for this purposes, though given the risks involved in altering such structures lead to the existence of parallel metadata repositories, in which an important percent of the database’s metadata are duplicated.

Beyond Database Dependencies

Talking about data mappings, integration projects and integration functionality/features rely heavily on data mappings, they involving a degree of automation too. Integration of data doesn’t necessarily occur only at application level, in the context of web’s evolution, the tendency is to link and integrate the various data islands (see linkeddata.org), especially the ones with public character, and provide thus cross-database functionality. Many of the problems such an approach implies are solved at metadata level, new metadata and dependency levels being required for this purpose.

Created: Aug-2010, Last Reviewed: Mar-2024

01 August 2010

🛠️SQL Server Administration: Resources for Troubleshooting I

    Last week the SQL Server troubleshooting-related topics came again on my table, and it’s actually quite an important topic. As I have disparate links I thought it makes sense to bring the links together in a post.

    When searching for information it’s always a good idea to start with the documentation or the support site, in what concerns SQL Server troubleshooting Microsoft has several valuable resources on what concerns performance issues, application performance, ad-hoc queries, blocking, stored procedure recompilation, clusters, etc.  Even if outdated, of interest could be also the Improving .Net Application Performance and Scalability resource,  with a chapter on SQL Server Performance and ADO.NET. Other resources could be found in Technet, for example Troubleshooting Performance Problems in SQL Server 2005, a similar document being available on SQL Server 2008 from MSDN. As a way to avoid the need for troubleshooting, it makes sense to check also the SQL Server Best Practices.

    There are several good books on this topic I heartily recommend, the first on my list is the book of C. Bolton et al, Professional SQL Server 2008 Internals and Troubleshooting. What’s interesting to note is that the book is available to browse and read on Scribd as part of a eBook Deal with Wiley, deal that includes several other programming and non-programming books (See Wiley’s profile on Scribd). The code for several chapters from the C. Bolton's book is available on Wiley’s website. Of interest are especially the chapters on DMV (dynamic management views) because performance troubleshooting often resumes in searching for hints in SQL Server internal tables. Another book, actually booklet, on DMV comes from Redgate, the SQL Server DMV Starter Pack, the pack being available also with a list of 30 sample scripts downloadable together as zip.

   Talking about scripts, there are many scripts available on DMV from Technet’s Script Repository. If you want to get an overview of your SQL Server configuration and health, you could check Diagnostic Information Queries available from Glen Berry’s blog for SQL Server 2005, respectively SQL Server 2008 and 2008R2.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.