SQL Troubles: database architecture

Showing posts with label database architecture. Show all posts

05 December 2009

Database Management: Databases (An Introduction)

So, you’ve heard the reporting guy or somebody else talking about getting some data or a report from the database, or you found out that you can’t use one of the fancy applications your company has in place just because the database is not available. Dam, that database must be something important!

Thus you may wonder what a database is, and, with a few clicks, you find out in Wikipedia what all is about, a database is “an integrated collection of logically related records or files consolidated into a common pool that provides data for one or more multiple uses” [1].

This definition doesn’t clear up things at all, isn’t it? Terms like “integrated collection of logical related records” or “files consolidated into a common pool” even if they seem semantically right seems to be hard to digest. Without pretending to give a better definition, I would define a database as a logical and physical structure used to contain data in a consistent form. For sure, this definition won’t revolutionize the world of databases, and most probably might be other similar definitions out there, better formulated and sustained.

Actually also this definition might need some clarifications, I’m talking about a logical structure because there is a logic on how the data are stored, physical structure because the data are stored on a physical device (e.g. computer memory, hard disk or any other type of storage device). I used contained and not stored, because containment imply certain control over the structure in comparison with the simple storage of data. Ok, this being said, somebody would question: hey, also a delimited text file can store data in a consistent form, what’s the difference between a database and a delimited text file?

At a first view, I would say the difference resides in context, in the fact that a database “exists” in the context of a database management system (DBMS), a (software) system that provides a mechanism for storage, retrieval, modification and management of data. There are DBMS that store their data as delimited flat files, known also as flat file database, however such simple structures hardly cope with the requirements of modern DBMS, that need to provide a scalable, reliable and secure storage system. Why is a delimited flat file a simple structure?! This affirmation needs indeed some further explanation…

A delimited text file is a text file in which the chunks of data are delimited by special characters and stored in a tabular format with rows and columns, much like in Excel if you want, though Excel offers a better visual structure that facilitates data visualization and manipulation. The delimited text file supposes the existence of at least two types of delimiters, one to delimit the columns, usually a colon, semicolon or pipe, and one to delimit the rows, usually a combination of carriage return and line feed. Unfortunately such a structure imposes one important problem: what happens with the chunks of text containing a character or a set of characters used already as delimiters?

Therefore the text is typically encompassed between two quotes, and in case a double quote exists already in a text, then the contained double quote it’s duplicated facilitating thus text’s interpretation. There could be even more problems, given the two systems of writing a number, comma versus dot, could be tricky to use a comma delimited file, semicolon or tab would be a better delimiter. An application that reads such a file would need thus to understand the delimiters used, and even more the format in which numeric and date values are stored, of whether the first row contains the column names, being required to store such metadata in a second file.

In contrast, a database stores its data in multiple tables designed around logical entities (e.g. Vendors, Customers, etc.) with attributes (e.g. Name, Address, City, Country, etc) translated into columns and the actual values forming a record in a table, resulting thus again a row/column structure for each table. In a flat file database, each table is stored in one file, while in a typical database the tables are stored together in one file, this implying the existence of another delimiter for tables themselves, not to mention that each table has its own structure with different column names, though the columns names, their data types and several types of metadata on columns and tables are stored in auxiliary tables.

And that’s not all, normally a simple search functionality might be enough in order to find a value in a text file, though for databases that’s something complicated to achieve because first of all a “search” is performed against one or more tables, and even against two or more databases, thus first of all, for efficiency, is needed a mechanism that reflects where a table begins, secondly it makes sense to have an explicit and/or implicit unique identifier (UID) for a record, that would allow identifying a record in a table in a unique manner.

Such unique identifiers might be a combination of one or more columns, the best candidates from a performance standpoint being (positive) integer values, though text and as types of values could qualify as such too. A table could have more than one such unique identifiers, for example an integer running number (sequence) used especially for fast retrieval of records, and another attribute (e.g. Material Number) or a combination of several attributes (e.g. Material Number, Vendor, Serial Number) from a entity standpoint. One of unique identifiers, typically single attributes, is a candidate for the primary key, used by other tables to reference a record from the respective table.

Why is needed such a mechanism when in theory one table can store all the data, much like in Excel fashion?! It’s mainly a question of design, storage and maintenance efficiency, being more feasible of storing redundant data in a table of their own and referencing the corresponding record from the main table, having thus a relationship between the two tables, the identifier that references the primary key of the table thus formed being called the foreign key.

Even if not always feasible, the foreign key could be based on multiple attributes, fact that increases relation’s complexity, because when the data are retrieve, the primary/foreign key columns are used to merge the results from the two tables into a common result set. Even if logical, the mechanism can be become complicated, the relationship needing to be stored together with the actual data in a structure of its own, even more a reference constrain can be enforced in order to assure that references are not invalidated by the deletion of a record.

Tables can become really big, ranging from a few thousand of records to millions and even milliards records, making data retrieval quite complex. Therefore it makes sense to have in place a mechanism that allows faster data retrieval at least for the most used attributes in searching, and that can be achieved with the help of indexes created based on one or more columns, they are stored in a dedicated structure and include a reference to the actual record. Indexes allow a DBMS to perform a search based on the columns used in an index on an optimized index structure rather than performing the search on the table itself, and this, from performance point of view, can make quite a difference for big tables.

Data are retrieved from a database with the help of a query, a well structured statement based on SQL (Structured Language Query) standard, specifying typically the columns to be retrieved, together with the tables they belong and the primary/foreign columns used to build the relation (join) between tables, and the columns on which the search will be performed on. A query can be more complex than that, it can include other statement modifiers, subqueries, views and functions.

A subquery is a query nested inside of another query, referred also as nested query, the nesting can go even several levels, making queries hard to “read” and maintain. Fortunately a database allows encapsulating a query in database objects like views, functions and stored procedures, enabling a better management of queries, facilitating also code reuse.

A database view can be seen as a virtual table based on a single statement query, storing no data on its own and being used to limit the number of columns or records retrieved. In contrast with views, user-defined functions and stored procedures can be based on multi-statement code, the main difference between the two residing in the fact that functions can be used in queries, while stored procedures are executed individually, this implying also some differences on the manner they are executed; there are many more other differences between these two types of database objects, mainly DBMS vendor related.

Another important topic in the world of databases is concurrency, handling “simultaneous” requests coming from multiple “users” and this involves access of the same piece of data by multiple users. A DBMS has to take care in the background of such scenarios, avoid when possible and address locks, restrict users the access to data they are not entitled to see or modify. The access to data and database objects is based on accounts and roles, and even if a user has accidental access to the server on which the database stored, above the binary encoding of data, a DBMS might even encrypt the data and database objects.

As can be seen, there is a whole arsenal of database objects associated with a database, typically stored independently of the actual database that holds the data, in tables spanning over one or more databases that belong to the internal kitchen of the DBMS. This arsenal makes the difference between a simple delimited text file and a database, DBMS vendors building in their solutions even more concepts mechanism in order to address issues like reliability, accessibility, scalability, performance, security, heterogeneity and so on.

Now, after this being said, the Wikipedia definition for a database seems to be relatively accurate, and that up to a point because relatively recently appeared the concept of in-memory database (IMDB) that primarily relies on the main memory for data storage. This doesn’t make the respective definition less valuable, though frankly I prefer my own definition, hopefully I haven’t left anything important out of it.

References:
[1] Wikipedia. 2009. Database. [Online] Available from: http://en.wikipedia.org/wiki/Database (Accessed: 5 December 2009)

31 October 2009

Database Management: Views I

Database Management Series

Typically the RDBMS vendors offer on top of their table-based storage 3 types of database objects used for data access: stored procedures, functions and views, each with their pluses and minuses. The views, same as the functions and stored procedures, can be seen as abstraction layers that stand between the physical database and users, allowing functionality reuse, logical structuring and better code maintenance. As Tony Regerson mentions, logically (and it only is logically) a view can be visualized as a virtual table, but from an optimization perspective a view is not a virtual table – "this is extremely important to remember when designing queries that you want to scale and perform well in a real environment" (T. Regerson, 2007). The view as virtual table is maybe the most realistic definition for a view, without reusing the term of view redundantly, as in “a view offers a view within a data set based on one or more tables”.

Over the years I saw several pros and cons against using views, even several fights on whether to use views or not in database-based development. I’m not an SQL guru, though I often dealt with various applications making heavily use of database side programming, and I’m using the old fashioned views on a daily basis because of one of the below benefits.

Structure the code in logical table-like query-able units that can cover many possibilities of querying the same data and which can be reused in other database objects, including views (nesting views). It can be thus created an abstract model on top of the existing table model, with multiple levels of details and perspectives.

Hide implementation complexity from users, being easier to use than the base tables. There are situations in which users want to query the data by themselves, instead of using an existing report, a simple view can reduce the complexity of a query by hiding the implementation details and it simplifies data manipulation by focusing on a restrained scope specific to the problem is supposed to model. Making use of a often met quip, the users don’t have to “reinvent the wheel”, a view reducing concomitantly the effort and potential mistakes which can occur in queries’ construction; as somebody was remarking sarcastically, “not everybody is a Bill Gates”, so you can’t expect from users to be good query developers and know in and outs of a database model.

When accessing data over MS Access or other similar tools, views might allow reducing the network traffic and increase query performance compared with the queries built on top of linked tables, in theory the fewer linked tables, the less network traffic and better the performance as the heavy processing is moved from the client to the server. Based on their use, the reduction of network traffic and the increased in performance can be relative.

Stored as database objects, views keep code duplication and maintenance at minimum, offering better code traceability and testability. Imagine creating a set of queries for each possible scenario the users might use to query a set of tables, the simplest such scenario being the one in which the only difference resides in the attributes the users need; it’s must easier to create a view with all the attributes, the users selecting then only the attributes they need. In some situations might be necessary to create several views for the same set of tables, when are targeted different levels of details or views. There is another important aspect, the more queries you use on the same topic, the more maintenance work is needed when changes occurred in their logic; not to mention that often queries are run directly by users, the synchronization and testing of the various versions of the same query can become a nightmare. Even more, working with a named entity facilitates the communication between users and developers, allowing to easier identifying the piece of code in discussion.

Views allow changing the order of attributes, adding formatting, calculations and scalar functions on top of the existing attributes, using table-valued functions and views.

Views allow enforcing security at object, vertical (column) and horizontal (rows) level, partitioning thus data access based on the required security levels.

Another “unorthodox” feature of views is that they allow developers to directly insert, update or delete data from tables they are based on, however this can be done under certain circumstances (e.g. such views need to reference directly and individually table’s columns, thus no aggregations and transformations can be performed).

There are several cases against using views, the most important of them stresses the lower views’ performance when compared with the one of queries or stored procedures, however the benefit of caching query-plans can be relative, and the balancing between the cost of flexibility and maintainability vs. the cost of performance can favour the views. I find somehow paradoxical that in applications programming everybody goes for OOP weighting reusability more than performance, often creating allegorical monsters deviating from their purpose, and when it comes to databases, everyone is against views though the performance decrease is not so big, while the structures built derive intrinsically from data. On the other side it’s true that databases have a higher workload and are more sensitive to it, but I think people should learn to make use of tools and knowledge more wisely than hiding behind philosophies.

Tony Regerson considers that views are more difficult to understand when nested, this making tuning and debugging more difficult. I can partially agree with him; from my point of view the code becomes easier to read, understand, maintain and test when logic is split in smaller fragments, and even if it can be more difficult to identify the source of a field from the final query, for this needing to traverse multiple views, the benefits of views can diminish the impact of this downside. It remains the problem of tuning, though actually this reflects the need for smarter database engines, while the other understanding related issues reflect the need for development tools that can identify the source tables and fields from nested objects. Documentation like ERDs and other types of mappings can improve the understanding of nested views.

Several applications I worked with, for example Oracle APPS, built the data access on top of views, each form being based on one or more views, which encapsulate the basic layer of business logic, further processing and customization being done in the form itself. This architecture facilitates to some degree developers’ work. The same views can be used also for the various reports, however such views often have the inconvenience they resume only on the attributes needed by a form, thus the adding of other attributes resume at extending the views, the creation of similar views that include the needing attributes, or at the recombination of views. The extension of standard views is not recommended because there is the risk the vendor might decline the responsibility for the errors occurring in a view, another important risk residing the fact that the views might be replaced with the installation of patches or during version upgrade. The combination of views might lead to the unnecessary reuse of the same table more than once, resulting in performance decrease which can be substantial. The remaining approach, often the best, is to create another view replicating the needed logic, though the developed views need to be synchronized with the standard views in case changes occur.

A similar scenario occurs when users recombine views in order to obtain the attributes needed in their analysis, again is recommended to provide a view which offers the same needed behaviour. The problem is that it’s difficult to identify such issues, because usually the developer has low visibility on how the user is using the tables. This could be theoretically avoided by better communication between users and developers, by moving the code, when possible, from personal databases to production databases.

Doing changes through (updatable) views diminishes the chances of trapping or raising customized errors from code, letting the application or DBMS to handle the issue and all the implications derived from it. This eventual issue could be partially solved by view triggers, however this functionality is provided only by a few vendors. In web or desktop applications, such direct data access increases also the risk for SQL injection attacks, stored procedure based access being more recommended for such architectures.

There are several other aspects which need to be considered with views, facts resulting mainly from the various types of views; I’ll try to summarize them in a second post.

References:
T. Regerson (2007) Views – they offer no optimisation benefits they are simply inline macros use sparingly (link)

SQL Troubles

Pages

05 December 2009

Database Management: Databases (An Introduction)

31 October 2009

Database Management: Views I

About Me