24 December 2010

Data Warehousing: Data Lake (Definitions)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon, "Pentaho, Hadoop, and Data Lakes", 2010) [sorce] [first known usage]

"At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out. [...] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze." (Beulah S Purra & Pradeep Pasupuleti, "Data Lake Development with Big Data", 2015) 

"Data lakes are repositories of raw source data in their native format that are stored for extended periods." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"A repository of data used to manage disparate formats and types of data for a variety of uses." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A storage system designed to hold vast amounts of raw data in its native (ingested) format, usually in a flat or semi-structured format. Extract, transform, and load (ETL) operations are usually applied to data lakes to extract local data marts for downstream computation." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"Data Lake is an analytics system that supports the storing and processing of all types of data." (Maritta Heisel et al, "Software Architecture for Big Data and the Cloud", 2017)

"A data lake is a central repository that allows you to store all your data—structured and unstructured - in volume […]" (Holden Ackerman & Jon King, "Operationalizing the Data Lake", 2019)

"A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning." (Piethein Strengholt, "Data Management at Scale", 2020)

"A repository for storing unstructured and structured data that is downloaded in its raw form and stored by a highly scalable, distributed files system known as open source." (Marcin Flotyński et al, "Non-Technological and Technological (SupTech) Innovations in Strengthening the Financial Supervision", 2021)

"Data lakes are massive repositories for original, raw and unstructured data which is collected from various sources across a smart city. The data from data lakes can be cleansed and transformed for further analytics and modeling." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data." (databricks) [source]

"A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale." (Amazon) [source]

"A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores." (Gartner)

"A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale. It is enabled by low-cost technologies that multiple downstream facilities can draw upon, including data marts, data warehouses, and recommendation engines." (Teradata) [source]

"A data lake is a large and diverse reservoir of corporate data stored across a cluster of commodity servers running software, most often the Hadoop platform, for efficient, distributed data processing." (Qlik) [source]

"A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources." (Oracle)

"A Data Lake is a service which provides a protective ring around the data stored in a cloud object store, including authentication, authorization, and governance support." (Cloudera) [source]

"A data lake is a type of data repository that stores large and varied sets of raw data in its native format." (Red Hat) [source]

"A data lake is an unstructured data repository that contains information available for analysis. A data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation. It enables ad hoc queries, data exploration, and discovery-oriented analytics because data management and structure can be applied on the fly at runtime, unlike traditional structured data storage which requires a schema on write." (TDWI)

"A storage repository that holds a large amount of raw data in its native format until it is needed." (Solutions Review)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.