Showing posts with label snapshot. Show all posts
Showing posts with label snapshot. Show all posts

31 October 2020

Data Warehousing: Data Lakes & other Puddles

Data Warehousing

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.  

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses. 

10 June 2015

Business Intelligence: Report Snapshot (Definitions)

"A SQL Server Reporting Services report that contains data that was queried at a particular point in time and has been stored on the Report Server." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"A report that contains data captured at a specific point in time. Since report snapshots hold datasets instead of queries, report snapshots can be used to limit processing costs by running the snapshot during off-peak times." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"A report that contains data captured at a specific point in time. A report snapshot is stored in an intermediate format containing retrieved data rather than a query and rendering definitions." (Jim Joseph et al, "Microsoft® SQL Server™ 2008 Reporting Services Unleashed", 2009)

"A static report that contains data captured at a specific point in time." (Microsoft, "SQL Server 2012 Glossary", 2012)

16 April 2009

DBMS: Snapshot Replication (Definitions)

"A type of replication that takes a snapshot of current data in a publication at a Publisher and replaces the entire replica at a Subscriber on a periodic basis, in contrast to publishing changes when they occur." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A type of replication that distributes data exactly as it appears at a specific moment in time and does not monitor for modifications made to the data." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A type of replication wherein data and database objects are distributed by copying published items via the Distributor and on to the Subscriber exactly as they appear at a specific moment in time. Snapshot replication provides the distribution of both data and structure (tables, indexes, and so on) on a scheduled basis. It may be thought of as a 'whole table refresh'. No updates to the source table are replicated until the next scheduled snapshot." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"Replication type that relies on a snapshot of the entire article (table) to be automatically sent from a published database to the subscriber database(s). Distributes data exactly as it appears at a given time." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"Replication type that relies on a snapshot of the entire article (table) to be automatically sent from a published database to the subscriber database(s). Distributes data exactly as it appears at a given time." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Replication of data taken at a moment of time. With snapshot replication, the entire data set is replicated at the same time." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"A replication in which data is distributed exactly as it appears at a specific moment in time and does not monitor for updates to the data." (Microsoft, 2012)

"Snapshot replication distributes data exactly as it appears at a specific moment in time and does not monitor for updates to the data." (Microsoft Technet)

28 February 2009

DBMS: Snapshot (Definitions)

"A snapshot is a view of information at a particular point in time." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A database dump or the archiving of data out of a database as of some moment in time." (William H Inmon, "Building the Data Warehouse", 2005)

"A database snapshot is a moment-in-time recording of a database and keeps track of every change made to a database from the moment the snapshot was taken. This method often prevents user mistakes as an add-on to fault-tolerance." (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"A feature of SQL Server 2005 that enables you to indefinitely store the state of the database at a particular point in time." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"A new feature of SQL Server 2005 where a snapshot of a database can be created at any given time to preserve the state of the database. The snapshot can be queried if desired and/or the entire database can be restored from the snapshot." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The state of an object, a system, or a collection of attributes regarding a state at a particular point in time." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A read-only, static view of a database at the moment of snapshot creation." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A copy of a data structure at some point in time" (Nell Dale et al, "Object-Oriented Data Structures Using Java" 4th Ed., 2016)

"A record of the current state of the database environment." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

16 February 2009

DBMS: Data Synchronization (Definitions)

"In replication, the process that ensures the publication and destination tables contain the same schema and data. This process must occur before a subscription server can receive replicated transactions from an article or a publication." (Patrick Dalton, "Microsoft SQL Server Black Book", 1997)

"Refers to the process in which the article or articles subscribed to on a subscription server are initially synchronized with the original article or articles on the publication server." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

[automatic synchronization:] "Synchronization that is accomplished automatically by SQL Server when a server initially subscribes to a publication. A snapshot of the table data and schema are written to files for transfer to the Subscriber. The table schema and data are transferred by the distribution agent. No operator intervention is required." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of maintaining the same schema and data in a publication at a Publisher and in the replica of a publication at a Subscriber. See also initial snapshot." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of ensuring that the publication and destination tables contain the same schema and data. This process must occur before a new Subscriber can receive replicated transactions from a publication. It is also called initial synchronization." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"Synchronization is the process in replication of maintaining the same schema and data at a Publisher and at a Subscriber." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"Integrating, matching, or linking data from disparate sources." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The continuous harmonization of data attribute values between two or more different systems, with the end result being the data attribute values are the same in all of the systems." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[initial synchronization:] "The first synchronization for a subscription, during which system tables and other objects that are required by replication, and the schema and data for each article, are copied to the Subscriber." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The process by which a satellite downloads and runs the same DB2 database commands, operating system commands, and SQL statements from the satellite control server as the other members of its group download and then reports the results to the satellite control server." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A form of embedded middleware that allows applications to update data on two systems so that the data sets are identical. These services can run via a variety of different transports but typically require some application-specific knowledge of the context and notion of the data being synchronized." (Gartner)

"Data synchronization is the effort to ensure that, once data leaves a system or storage entity, it does not fall out of harmony with its source, thereby creating inconsistency in the data record." (Information) [source

 "1. In replication, the process of data and schema changes being propagated between the Publisher and Subscribers after the initial snapshot has been applied at the Subscriber. 2. In database mirroring, when a mirroring session starts or resumes, the process in which log records of the principal database that have accumulated on the principal server are sent to the mirror server, which writes these log records to disk as quickly as possible to catch up with the principal server." (Microsoft Technet)

"The process of keeping selected data in multiple data sources in agreement." (Microsoft Technet)

"The term, Synchronization, refers to the process of replicating the changes made to documents on one database to the same documents in a second instance of that database." (Couchbase)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.