SQL Troubles: NoSQL

Showing posts with label NoSQL. Show all posts

27 December 2020

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

09 August 2009

🛢DBMS: NoSQL (Definitions)

"An umbrella term for non-relational data stores, hence the name. These stores sacrifice ACID transactions for greater scalability and availability." (Dean Wampler, "Functional Programming for Java Developers", 2011)

"A set of technologies that created a broad array of database management systems that are distinct from relational database systems. One major difference is that SQL is not used as the primary query language. These database management systems are also designed for distributed data stores." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A class of database management systems that consist of non-relational, distributed data stores. These systems are optimized for supporting the storage and retrieval requirements of massive-scale data-intensive applications." (IBM, "Informix Servers 12.1", 2014)

"A database that doesn’t adhere to relational database structures. Used to organize and query unstructured data." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Any of a class of database management systems that reject the limitations and drawbacks dictated by, or associated with, the relational model. NoSQL products tend to specialize in a single or limited number of areas, such as high-performance processing, big data (giga-record systems), diverse data types (video, pictures, mathematical models), documents, and so on. Their specialized focus often requires deemphasizing other areas such as data consistency and backup and recovery." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"In general, NoSQL databases provide a mechanism for storage and retrieval of data modeled in means other than the tabular relations used in relational databases." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"NoSQL means 'not only SQL' or 'no SQL at all'. Being a new type of non-relational databases, NoSQL databases are developed for efficient and scalable management of big data." (Zongmin Ma & Li Yan, "Towards Massive RDF Storage in NoSQL Databases: A Survey", 2019)

"A broad term for a set of data access technologies that do not use the SQL language as their primary mechanism for reading and writing data. Some NoSQL technologies act as key-value stores, only accepting single-value reads and writes; some relax the restrictions of the ACID methodology; still others do not require a pre-planned schema." (MySQL, "MySQL 8.0 Reference Manual Glossary")

"A NoSQL database is distinguished mainly by what it is not - it is not a structured relational database format that links multiple separate tables. NoSQL stands for 'not only SQL', meaning that SQL, or structured query language is not needed to extract and organize information. NoSQL databases tend to be more diverse and flatter than relational databases (in a flat database, all data is contained in the same, large table)." (Statistics.com)

"NoSQL is a database management system built for the complexities of working with Big Data. Unlike SQL, NoSQL does not store data in a relational format." (Xplenty) [source]

"No-SQL (aka not only SQL) database systems are distributed, non-relational databases designed for large-scale data storage and for massively-parallel data processing across a large number of commodity servers." (IBM)

"NoSQL is short for 'not only SQL'. NoSQL databases include mechanisms for storage and retrieval of data based on means other than the tabular relations used in relational databases." (Idera) [source]

"sometimes referred to as ‘Not only SQL’ as it is a database that doesn’t adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling." (Analytics Insight)

SQL Troubles

Pages

27 December 2020

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

09 August 2009

🛢DBMS: NoSQL (Definitions)

About Me