SQL Troubles: DBMS

Showing posts with label DBMS. Show all posts

09 February 2025

🏭🗒️Microsoft Fabric: Sharding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 9-Feb-2024

[Microsoft Fabric] Data Partitioning (aka Sharding)

{definition} "a process where small chunks of the database are isolated and can be updated independently of other shards" [2]
allows a logical database to be partitioned across multiple physical servers [1]

each partition is referred to as a shard
the largest tables are partitioned across multiple database servers [1]

when operating on a record, the application must determine which shard will contain the data and then send the SQL to the appropriate server [1]

partitioning is based on a Key Value

e.g. such as a user ID

proven technique for achieving data processing on a massive scale [1]

solution used at the largest websites

e.g. Facebook, Twitter
usually associated with rapid growth

⇒ the approach needs to be dynamic [1]

the only way to scale a relational database to massive web use [1]

together with caching and replication [1]

{drawback} involves significant operational complexities and compromises [1]

the application must contain logic that understands the location of any particular piece of data and the logic to route requests to the correct shard [1]
requests that can only be satisfied by accessing more than one shard thus need complex coding as well, whereas on a nonsharded database a single SQL statement might suffice.

{drawback} high operational costs [1]
{drawback} application complexity

it’s up to the application code to route SQL requests to the correct shard [1]

⇒ a dynamic routing layer must be implemented

⇐ most massive websites are adding shards as they grow [1]
layer required to maintain Memcached object copies and to differentiate between the master database and read-only replicas [1]

{drawback} crippled SQL

[sharded database] it is not possible to issue a SQL statement that operates across shards [1]

⇒ usually SQL statements are limited to row-level access [1]
⇒ only programmers can query the database as a whole [1]
joins across shards cannot be implemented, nor can aggregate GROUP BY operations [1]

{drawback} loss of transactional integrity

ACID transactions against multiple shards are not possible and/or not practical [1]

⇐ {exception} there are database systems that support 2PC

involves considerable troubleshooting as conflicts and bottlenecks can occur [1]

{drawback} operational complexity.

load balancing across shards becomes extremely problematic

adding new shards requires a complex rebalancing of data [1]
changing the database schema requires a rolling operation across all the shards [1]

⇒ can lead to transitory inconsistencies in the schema [1]

a sharded database entails a huge amount of operational effort and administrator skill [1]

{concept} CAP (Consistency, Availability, and Partition) theorem

in a distributed database system, one can have at most only two of CAP tolerance [1]
consistency

every user of the database has an identical view of the data at any given instant [1]

availability

in the event of a failure, the database remains operational [1]

partition tolerance

the database can maintain operations in the event of the network’s failing between two segments of the distributed system [1]

{concept} partitioning

{def} core pattern of building scalable services by dividing state (data) and compute into smaller accessible units to improve scalability and performance [5]

⇐ determines that a particular service partition is responsible for a portion of the complete state of the service.

a partition is a set of replicas)

{type} [stateless services] a logical unit that contains one or more instances of a service [5]

partitioning a stateless service is a very rare scenario
scalability and availability are normally achieved by adding more instances
{subtype} externally persisted state

persists its state externally [5]

e.g. databases in Azure SQL Database

{subtype} computation-only services

service that do not manage any persistent state e.g. calculator or image thumbnailing [5]

{type} scalable stateful services

partition state (data)
a partition of a stateful service as a scale unit that is highly reliable through replicas that are distributed and balanced across the nodes in a cluster
the state must be accessed and stored

⇒ bound by

network bandwidth limits
system memory limits
disk storage limits

{scenario} run into resource constraints in a running cluster

{recommendation} scale out the cluster to accommodate the new requirements [4]

{concept}distributed systems platform used to build hyper-scalable, reliable and easily managed applications for the cloud [6]

⇐ addresses the significant challenges in developing and managing cloud applications
places the partitions on different nodes [5]

allows partitions to grow to a node's resource limit

⇐ partitions are rebalances across nodes [5]

{benefit} ensures the continued efficient use of hardware resources [5]

{default} makes sure that there is about the same number of primary and secondary replicas on each node

⇒ nodes that hold replicas can serve more traffic and others that serve less traffic [5]
hot and cold spots may appear in a cluster

⇐ it should be preferably avoided

{recommendation} partition the state so is evenly distributed across all partitions [5]
{recommendation} report load from each of the replicas for the service [5]

provides the capability to report load consumed by services [5]

e.g. amount of memory, number of records
detects which partitions server higher loads than others [5]

⇐ based on the metrics reported

rebalances the cluster by moving replicas to more suitable nodes, so that overall no node is overloaded [5]
⇐ it's not always possible to know how much data will be in a given partition

{recommendation} adopt a partitioning strategy that spreads the data evenly across the partitions [5]

{benefit} prevents situations described in the voting example [5]

{recommendation} report load

{benefit} helps smooth out temporary differences in access or load over time [5]

{recommendation} choose an optimal number of partitions to begin with

⇐ there's nothing that prevents from starting out with a higher number of partitions than anticipated [5]

⇐ assuming the maximum number of partitions is a valid approach [5]

⇒ one may end up needing more partitions than initially considered [5]

⇐ {constraint} the partition count can't be changed after the fact [5]

⇒ apply more advanced partition approaches

e.g. creating a new service instance of the same service type
e.g. implement client-side logic that routes the requests to the correct service instance

Previous Post <<||>> Next Post

References:

[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data

[2] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed

[3] Microsoft Fabric (2024) External data sharing in Microsoft Fabric [link]
[4] Microsoft Fabric (2024) Data sharding policy [link]
[5] Microsoft Fabric (2024) Partition Service Fabric reliable services [link]
[6] MSDN (2015) Microsoft Azure - Azure Service Fabric and the Microservices Architecture [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ACID - atomicity, consistency, isolation, durability

2PC - Two Phase Commit

CAP - Consistency, Availability, Partition

07 November 2020

⛁DBMS: Event Streaming Databases (More of a Kafka’s Story)

Event streaming architectures are architectures in which data are generated by different sources, and then processed, stored, analyzed, and acted upon in real-time by the different applications tapped into the data streams. An event streaming database is then a database that assures that its data are continuously up-to-date, providing specific functionality like management of connectors, materialized views and running queries on data-in-motion (rather than on static data).

Reading about this type of technologies one can easily start fantasizing about the Web as a database in which intelligent agents can process streams of data in real-time, in which knowledge is derived and propagated over the networks in an infinitely and ever-growing flow in which the limits are hardly perceptible, in which the agents act as a mind disconnected in the end from the human intent. One is stroke by the fusing elements of realism and the fantastic aspects, more like in a Kafka’s story in which the metamorphosis of the technologies and social aspects can easily lead to absurd implications.

The link to Kafka was somehow suggested by Apache Kafka, an open-source distributed event streaming platform, which seems to lead the trends within this new-developing market. Kafka provides database functionality and guarantees the ACID (atomicity, concurrency, isolation, durability) properties of transactions while tapping into data streams.

Data streaming is an appealing concept though it has some important challenges like data overload or over-flooding, the complexity derived from building specific (business) and integrity rules for processing the data, respectively for keeping data consistency and truth within the ever-growing and ever-changing flows.

Data overload or over-flooding occurs when applications are not able to keep the pace with the volume of data or events fired with each change. Imagine the raindrops falling on a wide surface in which each millimeter or micrometer has its own rules for handling the raindrops and this at large scale. The raindrops must infiltrate into the surface to be processed and find their way to the beneath water flows, aggregating up to streams that could nurture seas or even oceans. Same metaphor can be applied to the data events, in which the data pervade applications accumulating in streams processed by engines to derive value. However heavy rains can easily lead to floods in which water aggregates at the surface.

Business applications rely on predefined paths in which the flow of data is tidily linked to specific actions found themselves in processual sequences that reflect the material or cash flow. Any variation in the data flow from expectations will lead to inefficiencies and ultimately to chaos. Some benefit might be derived from data integrations between the business applications, however applications must be designed for this purpose and handle extreme behaviors like over-flooding.

Data streams are maybe ideal for social media networks in which one broadcasts data through the networks and any consumer that can tap to the network can further use the respective data. We can see however the problems of nowadays social media – data, better said information, flow through the networks being changed as fit for purposes that can easily diverge from the initial intent. Moreover, information gets denatured, misused, overused to the degree that it overloads the networks, being more and more difficult to distinguish between reliable and non-reliable information. If common sense helps in the process of handling such information, not the same can be said about machines or applications.

It will be challenging to deal with the vastness, vagueness, uncertainty, inconsistency, and deceit of the networks of the future, however data streaming more likely will have a future as long it can address such issues by design.

25 July 2019

🧱IT: Blockchain (Definitions)

"A block chain is a perfect place to store value, identities, agreements, property rights, credentials, etc. Once you put something like a Bit coin into it, it will stay there forever. It is decentralized, disinter mediated, cheap, and censorship-resistant." (Kirti R Bhatele et al, "The Role of Artificial Intelligence in Cyber Security", 2019)

"A system made-up of blocks that are used to record transactions in a peer-to-peer cryptocurrency network such as bitcoins." (Murad Al Shibli, "Hybrid Artificially Intelligent Multi-Layer Blockchain and Bitcoin Cryptology", 2020)

"A chain of blocks containing data that is bundled together. This database is shared across a network of computers (so-called distributed ledger network). Each data block links to the previous block in the blockchain through a cryptographic hash of the previous block, a timestamp, and transaction data. The blockchain only allows data to be written, and once that data has been accepted by the network, it cannot be changed." (Jurij Urbančič et al, "Expansion of Technology Utilization Through Tourism 4.0 in Slovenia", 2020)

"A system in which a record of transactions made in Bitcoin or another cryptocurrency is maintained across several computers that are linked in a peer-to-peer network. Amany M Alshawi, "Decentralized Cryptocurrency Security and Financial Implications: The Bitcoin Paradigm", 2020)

"An encrypted ledger that protects transaction data from modification." (David T A Wesley, "Regulating the Internet, Encyclopedia of Criminal Activities and the Deep Web", 2020)

"Blockchain is a decentralized, immutable, secure data repository or digital ledger where the data is chronologically recorded. The initial block named as Genesis. It is a chain of immutable data blocks what has anonymous individuals as nodes who can transact securely using cryptology. Blockchain technology is subset of distributed ledger technology." (Umit Cali & Claudio Lima, "Energy Informatics Using the Distributed Ledger Technology and Advanced Data Analytics", 2020)

"Blockchain is a meta-technology interconnected with other technologies and consists of several architectural layers: a database, a software application, a number of computers connected to each other, peoples’ access to the system and a software ecosystem that enables development. The blockchain runs on the existing stack of Internet protocols, adding an entire new tier to the Internet to ensure economic transactions, both instant digital currency payments and complicated financial contracts." (Aslı Taşbaşı et al, "An Analysis of Risk Transfer and Trust Nexus in International Trade With Reference to Turkish Data", 2020)

"Is a growing list of records, called blocks, which are linked using cryptography. Each block contains a cryptographic hash of the previous block a timestamp, and transaction data. (Vardan Mkrttchian, "Perspective Tools to Improve Machine Learning Applications for Cyber Security", 2020)

"This is viewed as a mechanism to provide further protection and enhance the security of data by using its properties of immutability, auditability and encryption whilst providing transparency amongst parties who may not know each other, so operating in a trustless environment." (Hamid Jahankhani & Ionuț O Popescu, "Millennials vs. Cyborgs and Blockchain Role in Trust and Privacy", 2020)

"A blockchain is a data structure that represents the record of each accounting move. Each account transaction is signed digitally to protect its authenticity, and no one can intervene in this transaction." (Ebru E Saygili & Tuncay Ercan, "An Overview of International Fintech Instruments Using Innovation Diffusion Theory Adoption Strategies", 2021)

"A system in which a record of transactions made in bitcoin or another cryptocurrency are maintained across several computers that are linked in a peer-to-peer network." (Silvije Orsag et al, "Finance in the World of Artificial Intelligence and Digitalization", 2021)

"It is a decentralized computation and information sharing platform that enables multiple authoritative domains, who don’t trust each other, to cooperate, coordinate and collaborate in a rational decision-making process." (Vinod Kumar & Gotam Singh Lalotra, "Blockchain-Enabled Secure Internet of Things", 2021)

"A concept consisting of the methods, technologies, and tool sets to support a distributed, tamper-evident, and reliable way to ensure transaction integrity, irrefutability, and non-repudiation. Blockchains are write-once, append-only data stores that include validation, consensus, storage, replication, and security for transactions or other records." (Forrester)

[hybrid blockchain:] "A network with a combination of characteristics of public and private blockchains where a blockchain may incorporate select privacy, security and auditability elements required by the implementation." (AICPA)

[private blockchain:] "A restricted access network controlled by an entity or group which is similar to a traditional centralized network." (AICPA)

"A technology that records a list of records, referred to as blocks, that are linked using cryptography. Each block contains a cryptographic hash of the previous block, a timestamp and transaction data." (AICPA)

[public blockchain:] "An open network where participants can view, read and write data, and no one participant has control (e.g., Bitcoin, Ethereum)." (AICPA)

01 May 2019

Database Management: Version Management (Part I: SQL Server Feature Bloat)

Database Management Series

In an old SSWUG editorial “SQL Server Feature Bloat” by Ben Taylor, the author raises the question on whether SQL Server features like the support for Unicode, the increase in page size for data storage to 8k or the storage of additional metadata and statistics create a feature bloat. He further asks on whether customers may consider other database solutions, and on whether this aspect is important for customers.

A software or feature bloat is the “process whereby successive versions of a computer program become perceptibly slower, use more memory, disk space or processing power, or have higher hardware requirements than the previous version - whilst making only dubious user-perceptible improvements or suffering from feature creep” (Wikipedia).

Taylor’s question seems to be entitled, especially when is considered the number of features added in the various releases of SQL Server. Independently on whether they attempt to improve performance, extend existing functionality or provide new functionality, many of these features target special usage and are hardly used by average applications that use SQL Server only for data storage. Often after upgrading to a new release, it may happen that the customers see no performance improvement in the features used or the performance even decays, while the new release needs more resources to perform the same tasks. This can make customers wonder on whether all these new features bring any benefit for them.

It’s easy to neglect the fact that the SQL Server is just used as storage layer in an architecture and more likely that some of the problems reside in the business or presentation layers. In addition, not always a solution is designed to take advantage of a database’s (latest) features. Besides, it may happen that the compatibility level is set to a lower value, so the latest functionality won’t be used at all.

Probably the customers hope that the magic will happen immediately after the upgrade. For some features like the ones regarding engine’s optimization are enabled by default and is expected a performance gain, however, to take advantage of the new features the existing applications need to be redesigned. With each new edition it’s important to look at the opportunities provided by the upgrades and analyze the performance benefit as there’s often a trade-off between benefit and effort on one side, respectively between technical advantages and disadvantages on the other.

The examples used by Taylor aren’t necessarily representative because they refer to changes made prior to SQL Server 2005 edition and there are good arguments for their implementation. The storage of additional metadata and statistics is neglectable in comparison with the size of the databases and the benefits, given that the database engine needs statistics so it can operate optimally. SQL Server moved from 2 KB pages to 8 KB pages between versions 6.5 and 7.0 probably because it offers a good performance with efficient use of space. The use of Unicode character set become a standard given that databases needed to support multiple languages.

Feature bloating is not a problem that concerns only SQL Server but also other database products like Oracle, DB2 or MySQL, and other types of software. Customers’ choice of using one vendor’s products over another is often a strategic decision in which the database is just a piece of a bigger architecture. In the TPC-H benchmarks SQL Server 2014 and 2016 scored during the last years higher than the competitors. It’s more likely that customers will move to SQL Server than vice-versa, when possible. Customers expect performance, stability and security and are willing to pay for them, as long the gain is visible.

18 April 2019

🪧Meta-Blogging: Mea Culpa (Part I: Changing the Status Quo)

During the past years I started multiple posts on various programming-related topics though I seldom managed to bring something close to a publishable form. The main reason seems to be the lack of time needed to put an idea into words, to look at it from different perspectives in form of a logical meaningful unit and, last but not the least, make it count. This is accentuated by the fact that each idea pulls another, and often there are so many things to say that it’s hard to find a delimitation between what to be included and what to be left out. In extremis one feels that something is missing.

Often, it's required a certain amount of research needed to validate or support the facts. The knowledge about SQL Server and other DBMS is relative – it can be only relative as long their internals are known only to a certain degree. The relativity is found also in the area of applicability, the usage of a solution over another lying in details. Readers want solid facts while all one can give is a dry “it depends”…

Unfortunately, for a blogger not found close to the source of knowledge, the content posted tends to be third or fourth-hand knowledge and, in one form or the other, just duplication of information. As long content isn’t copied and there’s some personal touch the duplication is not necessarily a bad thing. Duplication makes knowledge more likely to be found as the content is indexed by search engines, however it becomes more difficult to stand in the crowd. To bring something new one must to put existing knowledge into new contexts, to be creative, and this takes time as well.

Without access to a pool of readers and of knowledge for a lone blogger it’s hard to succeed, giving up being just a few posts or a few years away. Of course, life tends to take over. It’s also in human nature to be enthusiastic about an idea and renounce shortly with the first difficulties met. On the other side, often it’s hard to keep or to find the needed motivation, especially when there is little support coming from the blogging platforms, tools creators or content publishers. Not being able to monetize one’s effort makes blogging more of a hobby.

With small exceptions, the investments made in blogging tools are below expectations. It’s frustrating when the tools or the integration between them stopped working and there’s no simple way to overcome this. Some aspects changed with time, however blogging seems to lose in contrast with other forms of media content.

Despite the lack of time and other difficulties I want to write and share my thoughts, my experience, make the time invested in learning and solving problems count. Blogging is also a way to externalize the implicit knowledge, of sharing, of questioning some of the ideas and practices, and ultimately of getting feedback. In this resides the personal value of blogging!

In the fight with time and words, I found myself forced to limit the length of the posts on some random nontechnical topics to 600 words. This number is rooted in the university years, representing the proximate limit of a written assignment to include an acceptable quality and coverage, and involve a bearable amount of effort. 600 is not a perfect number as its leading digit though, for the time being will do.

The challenge is to find a context to express my thoughts and experience without being too boring, without skimming through ideas. Without carrying great expectations, it’s an attempt to change the status quo!

Previous Post <<||>>Next Post

04 April 2011

🔹SQL Server: Fixed Database Role (Definitions)

"A predefined role defined at the database level and existing in each database." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A predefined role that exists in each database. The scope of the role is limited to the database in which it is defined. You use fixed database roles to quickly define permissions within a particular database. Roles can be compared to groups in the Window NT/2000 operating systems." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"These are special groups in each SQL Server database that already have permissions applied and are used to limit the permissions that a user has inside the database." (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"A role that is created by SQL Server 2005 in each database with a set of permissions that are predetermined and cannot be altered. Users can be added to one or more fixed database roles." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"A predefined role that exists in each database. The scope of the role is limited to the database in which it is defined." (Microsoft, "SQL Server 2012 Glossary", 2012)

15 February 2010

🕋Data Warehousing: Aggregation (Definitions)

"A table or structure containing pre-calculated data for a cube. Aggregations support rapid and efficient querying of a multidimensional database. " (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Data that’s the result of applying a process to combine data elements. This is data that’s taken collectively or in summary form. Usually, this is a sum, count, or average of underlying detail data." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A process where the data is presented in a summary form, such as average." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Summarized values of a measure." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A process where the data is presented in a summary form, such as average." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"Any of a variety of operations that reduces and summarizes large sets of data." (MongoDb, "Glossary", 2008)

"A special type of abstraction relationship that defines a higher-level entity that is an aggregate of several lower-level entities; a 'part-of' type relationship. For example, a bicycle entity would be an aggregate of wheel, handlebar, and seat entities." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"1.Generally, the process of gathering into a whole from parts. 2.In data management, a process that transforms atomic data into aggregate-level information by using an aggregation function such as count, sum, average, standard deviation, etc." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A table or structure containing pre-calculated data for an online analytical processing (OLAP) cube. Aggregations support the rapid and efficient querying of a multidimensional database." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A process by which variables are summed based on a classification or temporal hierarchy. Common examples include totaling all sales for a given time period or geographic region." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"The sum or collection of data. Often 'aggregation' is used to summarize sales at various levels of the merchandise, location, and time hierarchy." (Brittany Bullard, "Style and Statistics", 2016)

"The act of combining information from separate sources of a lower classification level that results in the creation of information of a higher classification level, which the subject does not have the necessary rights to access." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)

Data Management Series

Referential integrity, when considered as data quality dimension, refers to the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value). Typically, that's assured by design in Database Management Systems (DBMS) using a feature called referential integrity that defines an explicit relationship between the two tables that makes sure that the values remain valid during database changes. Thus, when a record is inserted or updated and a value is provided for the reference value, the system makes sure that the referenced value is valid, otherwise it throws a referential integrity error. A similar error is thrown when one attempts to delete the record with the referenced value as long is referenced by a table on which the relationship was explicitly defined.

Using referential integrity is a recommended technique for assuring the overall integrity of the data in a database, though there are also exceptions when that's not enforced for all tables (e.g. data warehouses) or only for exceptions (e.g. interface tables where records are imported as they are, attribute whose values references data from multiple tables). Therefore, even if there are tables with the referential integrity enforced, don't make the assumption that it applies to all tables!

In relational DBMS there are three types of integrity mentioned – entity, referential and domain integrity. Entity integrity demands that all the tables must have a primary key that contains no Null values. The referential integrity demands that each non-null value of a foreign key must match the value of a primary key [1], while the domain integrity demands that the type of an attribute should be restricted to a certain data type, the format should be restricted by using constraints, rules or range of possible values [2].

Even if not mandatory, all three types of integrity are quintessential for reliable relational databases. When the referential integrity is not enforced at database level or at least in code, when a record from a table is deleted and a foreign key it’s still pointing to it, fact that could lead to unexpected disappearance of records from the system’s UI even if the records are still available.

During conversions or data migrations is important to assure that the various sets loaded match the referential and domain integrity of the database in which the data will be loaded, otherwise the records not respecting the mentioned type of integrity will be rejected. The rejection itself might not be a problem for several records, though when it happens at large scale, then the situations changes dramatically, especially when the system gives no adequate messages for the cause or rejection. A recommended approach is to assure that the scope is synchronized between the various data elements, and that the referential integrity of datasets is validated before the data are loaded in the destination database.

There are several sources (e.g. [3]) that consider Codd’s referential integrity constraint as a type of consistency, in the support of this idea could be mentioned the fact that referential integrity could be used to solve data consistency issues by bringing the various LOV in the systems. Referential integrity is mainly an architectural concept even if it involves the 'consistency' of foreign key/primary key pairs.

Note:
Expect the unforeseeable! It’s always a good idea to check whether the referential integrity is kept by a system – there are so many things that could go wrong! In data migration solutions, data warehouses and more general analytical solutions is a good idea to have in place mechanisms that check for this kind of issues.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Halpin. T. (2001) Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan Kaufmann Publishers. ISBN 1-55860-672-6.
[2] MSDN. 2009. Data Integrity. [Online] Available from: http://msdn.microsoft.com/en-us/library/ms184276.aspx (Accessed: 18 January 2009)

[3] Lee Y.W., Pipino L.L., Funk J.D., Wang R.Y. (2006) "Journey to Data Quality", MIT Press. ISBN: 0-262-12287-1

31 December 2009

🛢DBMS: Database Design (Just the Quotes)

"The database is the heart of our information processing systems and not simply a by-product of processing. […] A database that has been properly structured and organized, and which actually represents and contains that data of an enterprise, will stabilize the information processing systems. [...] The use of data is reckoning, the generation of information which will in turn be used to manage the enterprise. The generation of information is the only reason for having a database." (John K Lyon, "What is a Data Base", Bulletin of ACM SIGFIDET 5(1) , 1973)

"A database has structure which exists in three forms. It has a perceived structure, which is the way that each of us looks at the database. […] The physical structure of the database relates to the manner in which data is physically represented within the database in terms of sort sequence, record and field sizes, and its existence on one or more master files. […] The logical structure describes the way things really exist within the environment. The entities, i.e., the things within the enterprise, really exist; they have properties which distinguish them from all the other things within the business; and the relationships which exist among the entities." (John K Lyon, "What is a Data Base", Bulletin of ACM SIGFIDET 5(1) , 1973)

"Two of the most difficult areas of data-base management are the design of an information structure and the reduction of that structure to a data structure which is compatible with and managed by the DBMS. […] Data-base management systems are tools to be applied by the users of these systems to build an accurate and useful model of their organization and its information needs. To accomplish this, the information structure must accurately define and characterize the items of data and the relations among them that are of interest to the users. This is no small task, for it demands a knowledge of the organization and the distribution of information among its various parts." (Robert W Taylor & Randall L Frank, "CODASYL Data-Base Management Systems", 1976)

"A variety of reasons make careful database design essential. These include data redundancy, application performance, data independence, data security, and ease of programming. All are important factors in the data processing environment, and all can be adversely affected by a poor database design." (Mark L Gillenson, "Database: Step by Step", 1985)

"Database design refers to the process of organizing the data fields needed by one or more applications into an organized structure. This structure must foster the required relationships among the fields while conforming to the physical constraints of the particular database management system in use. There are two parts to the process that people usually associate with the term 'database design'. One is 'logical database design' which has two components. The first involves organizing the data fields into nonredundant groupings based on the data relationships. The second involves an initial organizing of those logical groupings into structures based on the nature of the DBMS and of the applications that will use the data. […] The second part of database design is 'physical database design, which refers to refitting the previously described derived structures to conform to the performance and operational idiosyncrasies of the DBMS, again guided by the application's processing requirements." (Mark L Gillenson, "Database: Step by Step", 1985)

"The conceptual model is the focal point of the process of database design. All activities either converge upon or emanate from the conceptual model. All structures, through mappings either into or out of the conceptual model, must be to some extent compatible with it." (Andrew Pletch, "Conceptual Modeling in the Classroom", SIGMOD Record 18(1), 1989)

"Because one has to be an optimist to begin an ambitious project, it is not surprising that underestimation of completion time is the norm." (Fernando J Corbató, "On Building Systems That Will Fail", 1991)

"It is important to emphasize the value of simplicity and elegance, for complexity has a way of compounding difficulties and as we have seen, creating mistakes. My definition of elegance is the achievement of a given functionality with a minimum of mechanism and a maximum of clarity." (Fernando J Corbató, "On Building Systems That Will Fail", 1991)

"A database is basically a collection of related facts (e.g. a company's personnel records, or a bus timetable). Discovering the essential kinds of facts that underlie an application, and the conditions that apply to them, is both interesting and illuminating. The quality of the database design used for these facts and conditions is important. Just as a house built from a good architectural plan is more likely to be safe and convenient for living, a well-designed database simplifies the task of ensuring that its facts are correct and easy to get at." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"If an application requires maintenance and retrieval of large amounts of data, a DBMS offers many advantages over manual record-keeping systems. Operations on data may often be performed faster. Data may be stored compactly on disk, and redundancy may be reduced by data sharing. Many data errors can be avoided by automatic integrity checking. With multi-user systems, access rights to data can be enforced by the system. People can spend more time on creative design rather than on routine tasks more suited to computers."(Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"No matter how sophisticated the information system, if we give it the wrong picture of our UoD to start with, we can't expect to get much sense out of it. This is one aspect of the gigo (Garbage In Garbage Out) principle. Most of the problems with many database applications can be traced to bad database design." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"When we design a database for a particular application, we create a model of it. Technically, the application area that we are modeling is called the universe of discourse (UoD), since it is the world (or universe) that we are interested in talking (or discoursing) about. Typically the UoD is a 'part' of the 'real world'. To build a good model requires a good understanding of the world we are modeling, and hence is a task ideally suited to people rather than machines. The main challenge is to describe the UoD clearly and precisely. Great care is required here, since errors introduced at this stage filter through to later stages in software development, and the later the errors are detected the more expensive they are to remove." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"A bad logical design means that a good physical design cannot be performed. Good logical design is crucial to good database performance, and a bad logical design will result in a physical design that attempts to cover up the weaknesses in it. A bad logical design is hard to change, and once the system is implemented it will be almost impossible to do so." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"[…] physical database design is not a static, one-off process. Once the database has gone into production, the user requirements are likely to change. Even if they do not, the database data is likely to be volatile, and tables are likely to grow." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"The physical design process is a key phase in the overall design process. It is too often ignored until the last minute in the vain hope that performance will be satisfactory. Without a good physical design, performance is rarely satisfactory and throwing hardware at the problem is rarely completely effective. There is no substitute for a good physical design, and the time and effort spent in the physical design process will be rewarded with an efficient and well-tuned database, not to mention happy users!" (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"There is no right way to design a database; there are a number of possible approaches and all these may be perfectly valid. It is sometimes said that performance tuning is an art, not a science. This may be true, but it is important to undertake performance tuning experiments with the same kind of rigorous, controlled conditions under which scientific experiments are performed." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"One problem area for refactoring is databases. Most business applications are tightly coupled to the database schema that supports them. That's one reason that the database is difficult to change. Another reason is data migration. Even if you have carefully layered your system to minimize the dependencies between the database schema and the object model, changing the database schema forces you to migrate the data, which can be a long and fraught task." (Martin Fowler et al, "Refactoring: Improving the Design of Existing Code", 2002)

29 December 2009

🛢DBMS: Database Management Systems (Just the Quotes)

Disclaimer: The following definitions are intended for tracing the long road of defining what Database Management Systems (DBMS) are about, therefore the value of the quotes is regarded from the historical perspective.

"The software that allows one or many persons to use and/or modify this data is a database management system (DBMS). A major role of the DBMS is to allow the user to deal with the data in abstract terms, rather than as the computer stores the data. In this sense, the DBMS acts as an interpreter for a (very) high-level language such as APL, ideally allowing the user to specify what must be done, with little or no attention on the user's part to the detailed algorithms or data representation used by the system. However, in the case of a DBMS, there may be even less relationship between the data as seen by the user and as stored in the computer, than between APL arrays and the representation of these arrays in memory." (Jeffrey D Ullman, "Principles of Database Systems", 1980)

"An integrated data management system, that is a true database system, is one in which data can be held nonredundantly while at the same time, a query which requires a mixture of different kinds of data (such as the query that we've been looking at) can be specified in a single command from the highest level programming interface. Any system that does not have this property really should not call itself a 'database' system; 'file management' system would be a more accurate term." (Mark L Gillenson, "Database: Step by Step", 1985)

"The DBMS is a program (or a set of programs) that allows stored data to be integrated, reduces data duplication, ensures data integrity, eliminates program dependency on file formats, and allows even complicated objects to be easily represented and retrieved. In short, a DBMS is the program that 'processes the database'." (David M Kroenke & Kathleen Dolan , Database Processing: Fundamentals, design, implementation” 3rd Ed., 1988)

"A database management system (DBMS) consists of a collection of interrelated data and a set of programs to access that data. The collection of data, usually referred to as the database, contains information about one particular enterprise. The primary goal of a DBMS is to provide an environment that is both convenient and efficient to use in retrieving and storing database information." (Henry F. Korth & Abraham Silberschatz, "Database System Concepts" 2nd Ed., 1991)

"A database management system is a collection of interrelated files and a set of programs that allow users to access and modify these files. A major purpose of a database system is to provide users with an abstract view of the data. That is, the system hides certain details of how the data is stored and maintained. However, in order for the system to be usable, data must be retrieved efficiently. This concern has lead to the design of complex data structures for the representation of data in the database. Since many database systems users are not computer-trained, the complexity is hidden from them through several levels of abstraction in order to simplify their interaction with the system." (Henry F. Korth & Abraham Silberschatz, "Database System Concepts" 2nd Ed., 1991)

"A DBMS is a large collection of computer processes that can be entirely software or firmware that operate in a generalized or specialized computer that provides languages or mechanisms to: Define the logical database component of a specific database either directly or through reference to the IRDS metadata, that is, the data record types, data record elements, and relationships. Define the physical component of a specific database, that is, the storage structure definition, and the techniques for access strategies, data loading and updates, and database backup. Specify interrogations that access data through one or more types of languages such as host languages like COBOL or FORTRAN, and/or natural languages like query-update, report writers, or procedure-oriented. Specify system control facilities, that is, audit trails, backup and recovery, concurrent operations, and the like." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"Database packages, more correctly called Database Management Systems or DBMSs, such as Oracle and Microsoft Access provide us with all the facilities necessary to design and use an application database. A DBMS can be viewed as having three components: (1) The user interface is the software that talks' with the user in creating and accessing the database. […] (2) The database engine: this is the software that manages the storage and accessing of the physical data stored on disk. (3) The data dictionary, a repository of information about the application database." (Colin Ritchie, "Relational Database Principles" 2nd Ed., 2002)

"A Database Management System (DBMS) is an organised set of facilities for accessing and maintaining one or more databases. A DBMS is a shell which surrounds a database or series of databases and through which all interactions take place with the database." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"A software system that enables users to define, create, and maintain the database and also provides controlled access to this database." (Thomas M Connolly & Carolyn E Begg, "Database Solutions: A step-by-step guide to building databases", 2004)

"A product or tool that manages any kind of database, not just relational databases." (Rod Stephens, "Beginning Database Design Solutions", 2009)

🛢DBMS: Databases (Just the Quotes)

Disclaimer: The definitions below are intended for tracing the long road of defining what databases are about, therefore the value of the quotes should be regarded from the historical perspective.

"Historically, data base systems evolved as generalized access methods [...] As a result, most data base systems emphasize the question of how data may be stored or accessed, but they ignore the question of what the data means to the people who use it." (John F Sowa, "Conceptual Graphs for a Data Base Interface", 1976)

"A database is a set of data whose structure can be described by a connected graph, and whose instances can be accessed simultaneously." (Rob Gerritsen et al, "On Some Metrics for Databases or What is a Very Large Database?", 1977)

"Data […] that is stored more-or-less permanently in a computer we term a database." (Jeffrey D Ullman, "Principles of Database Systems", 1980)

"When we talk informally about a database, we refer to a collection of mutually related data, to the computer hardware that is used to store it, and to the programs used to manipulate it. […] A database is a collection of related data. The data storage for a database is accomplished by the use of one or more files. All files of one database are accessible from one computer or from any interconnected computer if the database is distributed over several computers." (Gio Wiederhold, "Database Design" 2nd Ed., 1983)

"[...] there is a class of methodologies - really more of an environment - for storing, accessing, and in general, managing data which can form the foundation for survival in a world which is going to become ever more data oriented as time goes on. That class of methodologies or environment has come to be known as database." (Mark L Gillenson, "Database: Step by Step", 1985)

"A database is a self-describing collection of integrated records. […] A database is self-describing in that it contains, in addition to application data, a description of its own structure. […] A database is more than a collection of files. A database includes not only files, but also a data dictionary and a description of the relationships among the records in the files. These relationship descriptions are stored and recalled during database. […] A database is a data model of an organization." (David M Kroenke & Kathleen Dolan, , "Database Processing: Fundamentals, design, implementation" 3rd Ed., 1988)

"A database is not a technology. Rather, it is an expression of organization, clarity, and precision. It may or may not be computerized. If it is, it may exist on a microcomputer, a minicomputer, or a large mainframe. Finally, a database may or may not be centralized. However a database is implemented and operated, success is impossible without the codification of and adherence to data semantics, which are the rules for meaning, validity, and usage." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"What is a database? It's a file - that's an electronic version of information sheets and binders. It enables you to efficiently use your information. […] Just as a binder contains information sheets, a database holds records. […] First, a database enables you to store larger amounts of information than a binder. […] Second, a database enables you to retrieve your information more efficiently, more flexibly, and more creatively. […] Third, a database can manipulate the information it contains. […] Fourth, a database enables you to print your information on Information envelopes, labels, fax cover sheets, reports, and letters." (Guy Kawasaki, "Database 101", 1991)

"When a database is computerized, it represents the automation of the knowledge component of a business, which is manifest through the business's quality operation, planning, and management With a successful database, the managers of a business can research the past, organize the present, and plan for the future." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"Conceptually, the database is a set of sentences expressing propositions asserted to be true of the UoD [Universe of Discourse]. Since sentences may be added to or deleted from the database, the database may undergo transitions from one state to another. However, at any particular time, the sentences populating the database must individually and collectively conform to the application specific grammar or design plan which is the conceptual schema." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"In using a data base, first look at the metadata, then look at the data. [...] The old computer acronym GIGO (Garbage In, Garbage Out) applies to the use of large databases. The issue is whether the data from the database will answer the research question. In order to determine this, the investigator must have some idea about the nature of the data in the database - that is, the metadata." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"To be worthy of being called a database, a system must have two essential properties. 1. It holds data as an integrated system of records. 2. It contains self-describing information." (Colin Ritchie, "Relational Database Principles" 2nd Ed., 2002)

"A database can be considered as an organised collection of data which is meant to represent some UoD [Universe of Discourse]. […] A database is made up of two parts: an intensional part and an extensional part. The intension of a database is a set of definitions which describe the structure or organisation of a given database. The extension of a database is the total set of data in the database. The intension of a database is also referred to as its schema." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"A database in manual terms is analogous to a filing cabinet, or more accurately to a series of filing cabinets. A database is a structured repository for data. The overall purpose of such a repository is to maintain data for some set of organisational objectives. Traditionally, such objectives fall within the domain of administration. Most database systems are built to retain the data required for the running of the day-to-day activities of an organisation. […] Structure normally implies some logical division, usually hierarchical. Hence we speak of a database as being a collection of data structures. […] Each data structure in a database is in turn also a hierarchical collection of data. In manual terms, such data structures would be folders hung in a filing cabinet." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"Databases rarely begin their life empty. More often the starting point in their lifecycle is a data conversion from some previously exiting data source. And by a cruel twist of fate, it is usually a rather violent beginning. Data conversion usually takes the better half of new system implementation effort and almost never goes smoothly." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"[...] data conversion is the most difficult part of any system implementation. The error rate in a freshly populated new database is often an order of magnitude above that of the old system from which the data is converted. As a major source of the data problems, data conversion must be treated with the utmost respect it deserves." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"The corporate data universe consists of numerous databases linked by countless real-time and batch data feeds. The data continuously move about and change. The databases are endlessly redesigned and upgraded, as are the programs responsible for data exchange. The typical result of this dynamic is that information systems get better, while data deteriorates. This is very unfortunate since it is the data quality that determines the intrinsic value of the data to the business and consumers. Information technology serves only as a magnifier for this intrinsic value. Thus, high quality data combined with effective technology is a great asset, but poor quality data combined with effective technology is an equally great liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"As we know, every action produces an equal and opposite reaction. Adding security reduces performance. Sharding and partitioning the database affords greater performance and distribution but creates complexity that is difficult to manage. Adding robust monitoring can generate huge volumes of log data to be stored, rotated, secured, and cleansed. Keeping the design “simple” often defers the interests of flexibility until later, where it becomes very expensive." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

22 December 2009

🛢DBMS: Databases (Unconventional Quotes)

"It is almost everywhere the case that soon after it is begotten the greater part of human wisdom is laid to rest in repositories." (Georg C Lichtenberg, Notebook K, 1789-1793)

"An observatory, like any other laboratory of research, may concentrate its attention upon either one of two widely different objects: the accumulation of great stores of data in existing departments of knowledge, or the opening up and exploration of new fields of investigation." (George E Hale, "Ten Years' Work of a Mountain Observatory", 1915)

"As information itself becomes the largest business in the world, data banks know more about individual people than the people do themselves. The more the data banks record about each one of us, the less we exist." (Marshall McLuhan & Wilfred Watson, "From Cliché To Archetype" , 1970)

"Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts. [...] A graphic representation of data abstracted from banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data." (William Gibson, "Neuromancer", 1984)

"Information is data that has been given meaning by way of relational connection. This 'meaning' can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it. (Russell L Ackoff, "Towards a Systems Theory of Organization", 1985)

"I found myself answering the same questions asked frequently of me by different people. It would be so much easier if everyone could just read my database." (Tim Berners-Lee, "Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web", 1999)

"Perhaps there is supranatural: reason beyond the normal definitions of fact or data-based logic; something that only makes sense if you can see a bigger picture of reality. Maybe that is where faith fits in." (William P Young, "The Shack", 2007)

"Like the French, database programmers speak a slightly different language, and since we’re working with a database we need to learn some vocabulary." (Allen B Downey, "Think Stats: Probability and Statistics for Programmers", 2011)

"We’ve got one huge advantage - people believe what they see in databases. They’ve never learned the most important rule of cyberspace - computers don’t lie but liars can compute." (Terry Hayes, "I Am Pilgrim", 2013)

"Things get even messier when linked into networks, which can literally scatter one’s mind even at today’s rudimentary levels of connectivity. The 'transactive memory system' called Google is already rewiring the parts of our brains that used to remember facts locally; now those circuits store search protocols for remote access of a distributed database. And Google doesn’t come anywhere close to the connectivity of a real hive mind." (Peter Watts, "Echopraxia", 2014)

"Finally, most identity thefts occur when databases are not securely managed. So, my advice? Don’t ever end up in a database." (Joseph Fink & Jeffrey Cranor, "Welcome to Night Vale", 2015)

"Everything is 'smart' now. The library cataloging system is smart, classification and indexing information entered into a uniform online database. People wept and lamented the loss of the old cards, then forgot them. They pretty much forget everything they weep over and lament." (Christopher Sorrentino, "The Fugitives", 2017)

"I'm an oracle of the past. I can accurately predict up to 1 minute in the future, by thoroughly investigating the last 2 years of your life. Also, I look like an old database – flat and full of useless info." (Will Advise & Jarod Kintz, "Nothing is here…")

"Practically all large corporations insure their data bases against loss or damage or against their inability to gain access to them. Some day, on the corporate balance sheet, there will be an entry which reads, 'Information', for in most cases, the information is more valuable than the hardware which processes it." (Grace M Hopper)

"The body of truth has a much larger data base than the conscience of lies." (Anthony T Hincks)

05 December 2009

Database Management: Databases (An Introduction)

So, you’ve heard the reporting guy or somebody else talking about getting some data or a report from the database, or you found out that you can’t use one of the fancy applications your company has in place just because the database is not available. Dam, that database must be something important!

Thus you may wonder what a database is, and, with a few clicks, you find out in Wikipedia what all is about, a database is “an integrated collection of logically related records or files consolidated into a common pool that provides data for one or more multiple uses” [1].

This definition doesn’t clear up things at all, isn’t it? Terms like “integrated collection of logical related records” or “files consolidated into a common pool” even if they seem semantically right seems to be hard to digest. Without pretending to give a better definition, I would define a database as a logical and physical structure used to contain data in a consistent form. For sure, this definition won’t revolutionize the world of databases, and most probably might be other similar definitions out there, better formulated and sustained.

Actually also this definition might need some clarifications, I’m talking about a logical structure because there is a logic on how the data are stored, physical structure because the data are stored on a physical device (e.g. computer memory, hard disk or any other type of storage device). I used contained and not stored, because containment imply certain control over the structure in comparison with the simple storage of data. Ok, this being said, somebody would question: hey, also a delimited text file can store data in a consistent form, what’s the difference between a database and a delimited text file?

At a first view, I would say the difference resides in context, in the fact that a database “exists” in the context of a database management system (DBMS), a (software) system that provides a mechanism for storage, retrieval, modification and management of data. There are DBMS that store their data as delimited flat files, known also as flat file database, however such simple structures hardly cope with the requirements of modern DBMS, that need to provide a scalable, reliable and secure storage system. Why is a delimited flat file a simple structure?! This affirmation needs indeed some further explanation…

A delimited text file is a text file in which the chunks of data are delimited by special characters and stored in a tabular format with rows and columns, much like in Excel if you want, though Excel offers a better visual structure that facilitates data visualization and manipulation. The delimited text file supposes the existence of at least two types of delimiters, one to delimit the columns, usually a colon, semicolon or pipe, and one to delimit the rows, usually a combination of carriage return and line feed. Unfortunately such a structure imposes one important problem: what happens with the chunks of text containing a character or a set of characters used already as delimiters?

Therefore the text is typically encompassed between two quotes, and in case a double quote exists already in a text, then the contained double quote it’s duplicated facilitating thus text’s interpretation. There could be even more problems, given the two systems of writing a number, comma versus dot, could be tricky to use a comma delimited file, semicolon or tab would be a better delimiter. An application that reads such a file would need thus to understand the delimiters used, and even more the format in which numeric and date values are stored, of whether the first row contains the column names, being required to store such metadata in a second file.

In contrast, a database stores its data in multiple tables designed around logical entities (e.g. Vendors, Customers, etc.) with attributes (e.g. Name, Address, City, Country, etc) translated into columns and the actual values forming a record in a table, resulting thus again a row/column structure for each table. In a flat file database, each table is stored in one file, while in a typical database the tables are stored together in one file, this implying the existence of another delimiter for tables themselves, not to mention that each table has its own structure with different column names, though the columns names, their data types and several types of metadata on columns and tables are stored in auxiliary tables.

And that’s not all, normally a simple search functionality might be enough in order to find a value in a text file, though for databases that’s something complicated to achieve because first of all a “search” is performed against one or more tables, and even against two or more databases, thus first of all, for efficiency, is needed a mechanism that reflects where a table begins, secondly it makes sense to have an explicit and/or implicit unique identifier (UID) for a record, that would allow identifying a record in a table in a unique manner.

Such unique identifiers might be a combination of one or more columns, the best candidates from a performance standpoint being (positive) integer values, though text and as types of values could qualify as such too. A table could have more than one such unique identifiers, for example an integer running number (sequence) used especially for fast retrieval of records, and another attribute (e.g. Material Number) or a combination of several attributes (e.g. Material Number, Vendor, Serial Number) from a entity standpoint. One of unique identifiers, typically single attributes, is a candidate for the primary key, used by other tables to reference a record from the respective table.

Why is needed such a mechanism when in theory one table can store all the data, much like in Excel fashion?! It’s mainly a question of design, storage and maintenance efficiency, being more feasible of storing redundant data in a table of their own and referencing the corresponding record from the main table, having thus a relationship between the two tables, the identifier that references the primary key of the table thus formed being called the foreign key.

Even if not always feasible, the foreign key could be based on multiple attributes, fact that increases relation’s complexity, because when the data are retrieve, the primary/foreign key columns are used to merge the results from the two tables into a common result set. Even if logical, the mechanism can be become complicated, the relationship needing to be stored together with the actual data in a structure of its own, even more a reference constrain can be enforced in order to assure that references are not invalidated by the deletion of a record.

Tables can become really big, ranging from a few thousand of records to millions and even milliards records, making data retrieval quite complex. Therefore it makes sense to have in place a mechanism that allows faster data retrieval at least for the most used attributes in searching, and that can be achieved with the help of indexes created based on one or more columns, they are stored in a dedicated structure and include a reference to the actual record. Indexes allow a DBMS to perform a search based on the columns used in an index on an optimized index structure rather than performing the search on the table itself, and this, from performance point of view, can make quite a difference for big tables.

Data are retrieved from a database with the help of a query, a well structured statement based on SQL (Structured Language Query) standard, specifying typically the columns to be retrieved, together with the tables they belong and the primary/foreign columns used to build the relation (join) between tables, and the columns on which the search will be performed on. A query can be more complex than that, it can include other statement modifiers, subqueries, views and functions.

A subquery is a query nested inside of another query, referred also as nested query, the nesting can go even several levels, making queries hard to “read” and maintain. Fortunately a database allows encapsulating a query in database objects like views, functions and stored procedures, enabling a better management of queries, facilitating also code reuse.

A database view can be seen as a virtual table based on a single statement query, storing no data on its own and being used to limit the number of columns or records retrieved. In contrast with views, user-defined functions and stored procedures can be based on multi-statement code, the main difference between the two residing in the fact that functions can be used in queries, while stored procedures are executed individually, this implying also some differences on the manner they are executed; there are many more other differences between these two types of database objects, mainly DBMS vendor related.

Another important topic in the world of databases is concurrency, handling “simultaneous” requests coming from multiple “users” and this involves access of the same piece of data by multiple users. A DBMS has to take care in the background of such scenarios, avoid when possible and address locks, restrict users the access to data they are not entitled to see or modify. The access to data and database objects is based on accounts and roles, and even if a user has accidental access to the server on which the database stored, above the binary encoding of data, a DBMS might even encrypt the data and database objects.

As can be seen, there is a whole arsenal of database objects associated with a database, typically stored independently of the actual database that holds the data, in tables spanning over one or more databases that belong to the internal kitchen of the DBMS. This arsenal makes the difference between a simple delimited text file and a database, DBMS vendors building in their solutions even more concepts mechanism in order to address issues like reliability, accessibility, scalability, performance, security, heterogeneity and so on.

Now, after this being said, the Wikipedia definition for a database seems to be relatively accurate, and that up to a point because relatively recently appeared the concept of in-memory database (IMDB) that primarily relies on the main memory for data storage. This doesn’t make the respective definition less valuable, though frankly I prefer my own definition, hopefully I haven’t left anything important out of it.

References:
[1] Wikipedia. 2009. Database. [Online] Available from: http://en.wikipedia.org/wiki/Database (Accessed: 5 December 2009)

15 September 2009

🛢DBMS: Performance (Definitions)

"The speed with which SQL Server processes queries and returns results." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"Accomplishment of project expectations; ability to attain goals by using resources efficiently and effectively." (Timothy J Kloppenborg et al, "Project Leadership", 2003)

"A measure of the degree to which a system or component accomplishes designated functions within given constraints, such as accuracy, time, or resource usage." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"Performance is a measure of how fast a database services applications, and ultimately end-users." (Gavin Powell, "Beginning Database Design", 2006)

"quantitative measures of the execution of some element or set of elements, such as average execution time or memory usage." (Bruce P Douglass, "Real-Time Agility", 2009)

"Key characteristic of a computer system upon which it is compared to other systems." (Max Domeika, "Software Development for Embedded Multi-core Systems", 2011)

"The quality of outputs, outcomes, benefits, and results achieved." (Paul C Dinsmore et al, "Enterprise Project Governance", 2012)

"The manner in which organizations achieve results (i.e., the way they behave and operate to effect actions), as well as the outputs and outcomes of these actions (i.e., the results they achieve)." (Sally-Anne Pitt, "Internal Audit Quality", 2014)

"1)Output/results (or measures of same) The relationship between targeted output and achieved output Output or results can, and must, be measured. 2)Work Performance. How well a person has done in producing the results (output) taking into consideration all relevant circumstances. How well has the person carried out the work of the role? Work performance cannot be measured; it must be judged by a manager based on how well the person has worked to achieve the assigned output, or result, in the situation in which the work was performed." (Catherine Burke et al, "Systems Leadership, 2nd Ed,", 2018)

"a measure of what is achieved or delivered" (ITIL)

"The degree to which a system or component accomplishes its designated functions within given constraints regarding processing time and throughput rate. (IEEE 610)

29 August 2009

🛢DBMS: Extensible Markup Language (Definitions)

"A standard for a markup language, similar to HTML, that allows tags to be defined to describe any kind of data you have, making it very popular as a format for data feeds." (Mike Moran & Bill Hunt , "Search Engine Marketing, Inc", 2005)

"Facilitates the assignment of meaningful structures and definitions of data and services for use by multiple systems. XML simplifies the ability to transmit and share data." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"Simple and flexible text format used to represent data. XML was designed by the World Wide Web Consortium (W3C)." (Sara Morganand & Tobias Thernstrom , "MCITP Self-Paced Training Kit : Designing and Optimizing Data Access by Using Microsoft SQL Server 2005 - Exam 70-442", 2007)

"separates content from format, thus letting the browser decide how and where content gets displayed. XML is not a language, but a system for defining other languages so that they understand their vocabulary." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"A platform-independent markup language for specifying the structure of data in a text document used for both data storage and the transfer of data." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"A way of representing data and data relationships in text files, typically for data exchange between software of different types." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed. , 2010)

"A metalanguage used to represent and manipulate data elements. Unlike other markup languages, XML permits the manipulation of a document’s data elements. XML is designed to facilitate the exchange of structured documents such as orders and invoices over the Internet." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"A specification for creating text files that contain hierarchical data." (Rod Stephens, "Start Here!™ Fundamentals of Microsoft® .NET Programming", 2011)

"Has been created to overcome some difficulties proper to HTML (Hypertext Markup Language) that – developed as a means for instructing the Web browsers how to display a given Web page – is a ‘presentation-oriented’ markup tool. XML is called ‘extensible’ because, at the difference of HTML, is not characterized by a fixed format, but it lets the user design its own customized markup languages (using, e.g., a specific DTD, Document Type Description) for limitless different types of documents; XML is then a ‘content-oriented’ markup tool." (Gian P Zarri, "RDF and OWL for Knowledge Management", 2011)

"A set of rules for encoding documents electronically. XML was chosen as the standard message format because of its widespread use and open source development efforts." (Mike Harwood, "Internet Security: How to Defend Against Attackers on the Web" 2nd Ed., 2015)

"A standard metalanguage for defining markup languages that is based on Standard Generalized Markup Language (SGML)." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"Extensible markup language (XML) is a simple, very flexible text format derived from SGML (standard generalized markup language). While XML was originally designed to meet the challenges of large-scale electronic publishing, it plays an increasingly significant role in the exchange of a wide variety of data on the web." (Kamalendu Pal, "Integrating Heterogeneous Enterprise Data Using Ontology in Supply Chain Management", 2019)

"A universal markup language for text and data, using nested tags to add structure and meta-information to the content." (Daniel Leuck et al, "Learning Java" 5th Ed., 2020)

"A 'best practices' subset of SGML that has been designed by the Worldwide Web Consortium (W3C) for use on the Internet." (Microfocus)

"A notation in which you describe the structure of information in a text document by enclosing information in user-defined tags that define the syntactic elements. A flexible way to create common information formats and share both the format and the data on the World Wide Web, intranets, and elsewhere. J2EE deployment descriptors are expressed in XML." (Microfocus)

22 August 2009

🛢DBMS: Concurrency Control (Definitions)

"The control of concurrent access. SQL Server uses locking to allow multiple users to access and change shared data at the same time without conflicting with one another." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Concurrency control is the control on the database and transactions which are executed concurrently to ensure that each transaction completed healthy." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Concurrency control ensures that database operations can be executed concurrently without compromising correctness. Pessimistic concurrency control, such as used in systems with locks, will block any potentially conflicting operations even if they may not turn out to actually conflict. Optimistic concurrency control, the approach used by WiredTiger, will delay checking until after a conflict may have occurred, aborting and retrying one of the operations involved in any write conflict that arises." (MongoDb, "Glossary", 2008)

"Mechanisms to ensure that a database remains consistent and accurate during concurrent use." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"A DBMS feature that is used to coordinate the simultaneous execution of transactions in a multiprocessing database system while preserving data integrity." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"The control of process contention for resources within multi-process systems." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

20 August 2009

🛢DBMS: Database-as-a-Service (Definitions)

"Database-as-a-service (DBaaS) is a cloud computing service model that provides users with some form of access to a database without the need for setting up physical hardware, installing software or configuring for performance." (Kashif Munir, "Security model for cloud database as a service", 2015)

"A database hosted in the cloud on a pay per use basis, for example Amazon Web Services." (Analytics Insight)

"An on-demand, secure, scalable, and self-service database that automates the database provisioning and administration to support new and existing business applications and operational systems." (Forrester)

"Database as a service (DBaaS) is a cloud computing managed service offering that provides access to a database without requiring the setup of physical hardware, the installation of software or the need to configure the database. Most database administration and maintenance tasks are handled by the service provider, enabling users to quickly benefit from the database service." (TechTarget) [source]

"Database as a service (DBaaS) is a cloud database offering that provides customers with access to a database without having to deploy and manage the underlying infrastructure. DBaaS is delivered as a managed database service, which means that the provider takes care of patching, upgrading, and backing up the database." (Oracle) [source]

"DBaaS (also known as managed database service) is a cloud computing service that lets users access and use a cloud database system without purchasing and setting up their own hardware, installing their own database software, or managing the database themselves (not to mention hiring the high-priced talent required to do so). The cloud provider takes care of everything from periodic upgrades to backups to ensuring that the database system remains available and secure 24/7." (IBM) [source]

🛢DBMS: CRUD (Definitions)

"Create, Read, Update, and Delete. Refers to the major functions that need to be implemented in a database solution." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Acronym for the four main database operations: Create, Read, Update, Delete. These operations correspond to the SQL statements INSERT, SELECT, UPDATE, and DELETE." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"An acronym for the fundamental operations of a database: Create, Read, Update, and Delete. See MongoDB CRUD Operations." (MongoDb, "Glossary", 2008)

"A list of the only functions of data in persistent storage, in a convenient acronym form." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"This acronym is commonly used to describe basic database functions." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"A current-state analysis exercise that inventories the systems, applications, and business processes that can create, read, update, and delete critical enterprise data and attributes. It's used to identify the processes and data access points that are affecting data quality or allowing improper usage of the data without suitable validation and governance." (Forrester)

"CRUD is an acronym that comes from the world of computer programming and refers to the four functions that are considered necessary to implement a persistent storage application: create, read, update and delete." (Sumo Logic) [source]

"CRUD (acronym for Create, Read, Update, Delete) are considered to be basic functions regarding stored data. In computer programming, possible actions are often mapped to these standard CRUD functions in order to clarify the actions. For example, standard HTTP(S) actions GET and POST refer to Read and Create functions regarding stored data." (Innopay)

SQL Troubles

Pages