SQL Troubles: scalability

Showing posts with label scalability. Show all posts

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part VI: SQL Databases for OLTP scenarios) [new feature]

Data Warehousing Series

One interesting announcements at Ignite is the availability in public preview of SQL databases in Microsoft Fabric, "a versatile and developer-friendly transactional database built on the foundation of Azure SQL database". With this Fabric can address besides OLAP also OLTP scenarios, evolving thus from analytics to a data platform [1]. According to the announcement, besides the AI-optimized architectural aspects, the feature makes the SQL Azure simple, autonomous and secure by design [1], and these latest aspects are considered in this post.

Simplicity revolves around the deployment and configuration of databases, the creation of a new database requiring giving a name and the database is created in seconds [1]. It’s a considerable improvement compared with the relatively complex setup needed for on-premise configurations, though sometimes more flexibility in configuration is needed upfront or over database’s lifetime. To get a database ready for testing one can import a sample database or get specific data via data flows and/or pipelines [1]. As development tools one can use Visual Studio Code or SSMS [1], and probably more tools will be available in time.

The integration with both GitHub and Azure DevOps allows to configure each database under source control, which is needed for many scenarios especially when multiple resources make changes to the database objects [1]. Frankly, that’s mainly important during the development phase, respectively in scenarios in which multiple people make in parallel changes to the logic. It will be interesting to see how much overhead or challenges the feature adds to development and how smoothly everything works together!

The most important aspect for many solutions is the replication of data in near-real time to the (open-source) delta parquet format in OneLake and thus making the data available for analytics almost immediately [1]. Probably, from this aspect many cloud-based applications can benefit, even if the performance might not be as good as in other well-established architectures. However, there are many other scenarios in which one needs to maintain and use data for OLTP/OLAP purposes. This invites adequate testing and a good weighting of the advantages and disadvantages involved.

A SQL database is a native item in Fabric, and therefore it utilizes Fabric capacity units like other Fabric workloads [1]. One can use the Fabric SKU estimator (still in private preview) to estimate the costs [2], though it will be interesting to see how cost-effective the solutions are. Probably, especially when the infrastructure is already available outside of Fabric, it will be easier and cost-effective to use the mirroring functionality. One should test and have a better estimator before moving blindly from the existing infrastructure to Fabric.

SQL databases in Fabric are autonomous by design, while allowing to get the best performance and availability by default [1]. High availability is reached through zone redundancy, while performance is achieved by scaling automatically the storage and compute to accommodate the workloads [1]. The auto-optimization capability is achieved with the help of the latest Intelligent Query Processing (IQP) enhancements, respectively the creation of missing indexes to improve query performance [1]. It will be interesting to see how the whole process works, given that the maintenance of indexes usually involves some challenges (e.g. identifying covering indexes, indexes needed only for temporary workloads, duplicated indexes).

SQL databases in Fabric are automatically configured for high availability with zone redundancy, while storage and compute scale automatically to accommodate the user workload [1]. The database is auto-optimized through the latest IQP enhancements while the system creates any missing indexes to improve query performance. All data is replicated to OneLake by default [1]. Finally, the database always receives the latest security updates with auto-patching, while automatic backups help in disaster recovery scenarios [1], which can be of real help for database administrators.

References:
[1] Microsoft Fabric Updates Blog (2024) Announcing SQL database in Microsoft Fabric Public Preview [link]
[2] Microsoft Fabric Updates Blog (2024) Announcing New Recruitment for the Private Preview of Microsoft Fabric SKU Estimator [link]

03 February 2021

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool. Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution?

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution.

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal.

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available.

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources.

Previous Post <<||>> Next Post

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part IV: Quality Acceptance Criteria IV)

Data Migrations Series

Reliability

Reliability is the degree to which a solution performs its intended functions under stated conditions without failure. In other words, a DM is reliable if it performs what was intended by design. The data should be migrated only when migration’s reliability was confirmed by the users as part of the sign-off process. The dry-runs as well the final iteration for the UAT have the objective of confirming solution’s reliability.

Reversibility

Reversibility is the degree to which a solution can return to a previous state without starting the process from the beginning. For example, it should be possible to reverse the changes made to a table by returning to the previous state. This can involve having a copy of the data stored respectively deleting and reloading the data when necessary.

Considering that the sequence in which the various activities is fix, in theory it’s possible to address reversibility by design, e.g. by allowing to repeat individual steps or by creating rollback points. Rollback points are especially important when loading the data into the target system.

Robustness

Robustness is the degree to which the solution can accommodate invalid input or environmental conditions that might affect data’s processing or other requirements (e,g. performance). If the logic can be stabilized over the various iterations, the variance in data quality can have an important impact on a solutions robustness. One can accommodate erroneous input by relaxing schema’s rules and adding further quality checks.

Security

Security is the degree to which the DM solution protects the data so that only authorized people have access to the respective data to the defined level of authorization as data are moved through the solution. The security provided by a solution needs to be considered against the standards and further requirements defined within the organization. In case no such standards are available, one can in theory consider the industry best practices.

Scalability

Scalability is the degree to which the solution is able to respond to an increased workload. Given that the number of data considered during the various iterations vary in volume, a solution’s scalability needs to be considered in respect to the volume of data to be migrated.

Standardization

Standardization is the degree to which technical standards were implemented for a solution to guarantee certain level of performance or other aspects considered as import. There can be standards for data storage, processing, access, transportation, or other aspects associated with the migration processes. Moreover, especially when multiple DMs are in scope, organizations can define a set of standards and guidelines that should be further considered.

Testability

Testability is the degree to which a solution can be tested in the respect to the set of functional and data-related requirements. Even if for the success of a migration are important the data in their final form, to achieve that is needed to validate the logic and test thoroughly the transformations performed on the data. As the data go trough the data pipelines, they need to be tested in the critical points – points where the data suffer important transformations. Moreover, one can consider record counters for the records processed in each such critical point, to assure that no record was lost in the process.

Traceability

Traceability is the degree to which the changes performed on the data can be traced from the target to the source systems as record, respectively at entity level. In theory, it’s enough to document the changes at attribute level, though upon case it might needed to document also the changes performed on individual values.

Mappings at attribute level allow tracing the data flow, while mappings at value level allow tracing the changes occurrent within values.

Previous Post <> Next Post

09 August 2009

🛢DBMS: NoSQL (Definitions)

"An umbrella term for non-relational data stores, hence the name. These stores sacrifice ACID transactions for greater scalability and availability." (Dean Wampler, "Functional Programming for Java Developers", 2011)

"A set of technologies that created a broad array of database management systems that are distinct from relational database systems. One major difference is that SQL is not used as the primary query language. These database management systems are also designed for distributed data stores." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A class of database management systems that consist of non-relational, distributed data stores. These systems are optimized for supporting the storage and retrieval requirements of massive-scale data-intensive applications." (IBM, "Informix Servers 12.1", 2014)

"A database that doesn’t adhere to relational database structures. Used to organize and query unstructured data." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Any of a class of database management systems that reject the limitations and drawbacks dictated by, or associated with, the relational model. NoSQL products tend to specialize in a single or limited number of areas, such as high-performance processing, big data (giga-record systems), diverse data types (video, pictures, mathematical models), documents, and so on. Their specialized focus often requires deemphasizing other areas such as data consistency and backup and recovery." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"In general, NoSQL databases provide a mechanism for storage and retrieval of data modeled in means other than the tabular relations used in relational databases." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"NoSQL means 'not only SQL' or 'no SQL at all'. Being a new type of non-relational databases, NoSQL databases are developed for efficient and scalable management of big data." (Zongmin Ma & Li Yan, "Towards Massive RDF Storage in NoSQL Databases: A Survey", 2019)

"A broad term for a set of data access technologies that do not use the SQL language as their primary mechanism for reading and writing data. Some NoSQL technologies act as key-value stores, only accepting single-value reads and writes; some relax the restrictions of the ACID methodology; still others do not require a pre-planned schema." (MySQL, "MySQL 8.0 Reference Manual Glossary")

"A NoSQL database is distinguished mainly by what it is not - it is not a structured relational database format that links multiple separate tables. NoSQL stands for 'not only SQL', meaning that SQL, or structured query language is not needed to extract and organize information. NoSQL databases tend to be more diverse and flatter than relational databases (in a flat database, all data is contained in the same, large table)." (Statistics.com)

"NoSQL is a database management system built for the complexities of working with Big Data. Unlike SQL, NoSQL does not store data in a relational format." (Xplenty) [source]

"No-SQL (aka not only SQL) database systems are distributed, non-relational databases designed for large-scale data storage and for massively-parallel data processing across a large number of commodity servers." (IBM)

"NoSQL is short for 'not only SQL'. NoSQL databases include mechanisms for storage and retrieval of data based on means other than the tabular relations used in relational databases." (Idera) [source]

"sometimes referred to as ‘Not only SQL’ as it is a database that doesn’t adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling." (Analytics Insight)

SQL Troubles

Pages