Showing posts with label scalability. Show all posts
Showing posts with label scalability. Show all posts

03 February 2021

Data Migrations (DM): Conceptualization III (Heuristics)

Data Migration

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool.  Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution? 

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution. 

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal. 

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available. 

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources. 

Previous Post <<||>> Next Post

01 February 2021

Data Migrations (DM): Quality Acceptance Criteria IV

Data Migration
Data Migrations Series

Reliability

Reliability is the degree to which a solution performs its intended functions under stated conditions without failure. In other words, a DM is reliable if it performs what was intended by design. The data should be migrated only when migration’s reliability was confirmed by the users as part of the sign-off process. The dry-runs as well the final iteration for the UAT have the objective of confirming solution’s reliability.

Reversibility

Reversibility is the degree to which a solution can return to a previous state without starting the process from the beginning. For example, it should be possible to reverse the changes made to a table by returning to the previous state. This can involve having a copy of the data stored respectively deleting and reloading the data when necessary. 

Considering that the sequence in which the various activities is fix, in theory it’s possible to address reversibility by design, e.g. by allowing to repeat individual steps or by creating rollback points. Rollback points are especially important when loading the data into the target system. 

Robustness

Robustness is the degree to which the solution can accommodate invalid input or environmental conditions that might affect data’s processing or other requirements (e,g. performance). If the logic can be stabilized over the various iterations, the variance in data quality can have an important impact on a solutions robustness. One can accommodate erroneous input by relaxing schema’s rules and adding further quality checks.

Security 

Security is the degree to which the DM solution protects the data so that only authorized people have access to the respective data to the defined level of authorization as data are moved through the solution. The security provided by a solution needs to be considered against the standards and further requirements defined within the organization. In case no such standards are available, one can in theory consider the industry best practices.

Scalability

Scalability is the degree to which the solution is able to respond to an increased workload.  Given that the number of data considered during the various iterations vary in volume, a solution’s scalability needs to be considered in respect to the volume of data to be migrated.  

Standardization

Standardization is the degree to which technical standards were implemented for a solution to guarantee certain level of performance or other aspects considered as import. There can be standards for data storage, processing, access, transportation, or other aspects associated with the migration processes. Moreover, especially when multiple DMs are in scope, organizations can define a set of standards and guidelines that should be further considered.  

Testability

Testability is the degree to which a solution can be tested in the respect to the set of functional and data-related requirements. Even if for the success of a migration are important the data in their final form, to achieve that is needed to validate the logic and test thoroughly the transformations performed on the data. As the data go trough the data pipelines, they need to be tested in the critical points – points where the data suffer important transformations. Moreover, one can consider record counters for the records processed in each such critical point, to assure that no record was lost in the process.  

Traceability

Traceability is the degree to which the changes performed on the data can be traced from the target to the source systems as record, respectively at entity level. In theory, it’s enough to document the changes at attribute level, though upon case it might needed to document also the changes performed on individual values. 

Mappings at attribute level allow tracing the data flow, while mappings at value level allow tracing the changes occurrent within values. 

09 August 2009

DBMS: NoSQL (Definitions)

"An umbrella term for non-relational data stores, hence the name. These stores sacrifice ACID transactions for greater scalability and availability." (Dean Wampler, "Functional Programming for Java Developers", 2011)

"A set of technologies that created a broad array of database management systems that are distinct from relational database systems. One major difference is that SQL is not used as the primary query language. These database management systems are also designed for distributed data stores." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A class of database management systems that consist of non-relational, distributed data stores. These systems are optimized for supporting the storage and retrieval requirements of massive-scale data-intensive applications." (IBM, "Informix Servers 12.1", 2014)

"A database that doesn’t adhere to relational database structures. Used to organize and query unstructured data." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Any of a class of database management systems that reject the limitations and drawbacks dictated by, or associated with, the relational model. NoSQL products tend to specialize in a single or limited number of areas, such as high-performance processing, big data (giga-record systems), diverse data types (video, pictures, mathematical models), documents, and so on. Their specialized focus often requires deemphasizing other areas such as data consistency and backup and recovery." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"In general, NoSQL databases provide a mechanism for storage and retrieval of data modeled in means other than the tabular relations used in relational databases." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"NoSQL means 'not only SQL' or 'no SQL at all'. Being a new type of non-relational databases, NoSQL databases are developed for efficient and scalable management of big data." (Zongmin Ma & Li Yan, "Towards Massive RDF Storage in NoSQL Databases: A Survey", 2019)

"A broad term for a set of data access technologies that do not use the SQL language as their primary mechanism for reading and writing data. Some NoSQL technologies act as key-value stores, only accepting single-value reads and writes; some relax the restrictions of the ACID methodology; still others do not require a pre-planned schema." (MySQL, "MySQL 8.0 Reference Manual Glossary")

"A NoSQL database is distinguished mainly by what it is not - it is not a structured relational database format that links multiple separate tables. NoSQL stands for 'not only SQL', meaning that SQL, or structured query language is not needed to extract and organize information. NoSQL databases tend to be more diverse and flatter than relational databases (in a flat database, all data is contained in the same, large table)." (Statistics.com)

"NoSQL is a database management system built for the complexities of working with Big Data. Unlike SQL, NoSQL does not store data in a relational format." (Xplenty) [source]

"No-SQL (aka not only SQL) database systems are distributed, non-relational databases designed for large-scale data storage and for massively-parallel data processing across a large number of commodity servers." (IBM) 

"NoSQL is short for 'not only SQL'. NoSQL databases include mechanisms for storage and retrieval of data based on means other than the tabular relations used in relational databases." (Idera) [source]

"sometimes referred to as ‘Not only SQL’ as it is a database that doesn’t adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling." (Analytics Insight)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.