One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.
In the initial
definition provided by James Dixon, the difference between a data lake and a
data mart/warehouse was expressed metaphorically as the transition from bottled
water to lakes streamed (artificially) from various sources. It’s contrasted
thus the objective-oriented, limited and single-purposed role of the data mart/warehouse
in respect to the flow of data in nature that could be tapped and harnessed as
desired. These are though metaphors intended to sensitize the buyer. Personally,
I like to think of the data lake as an extension of the data infrastructure, in
which the data mart or warehouse is integrant part. Imposing further constrains
seem to have no benefit.
Probably
the most important characteristic of a data lake is that it makes the data of
an organization discoverable and consumable, though from there to insight and
other benefits is a long road and requires specific knowledge about the techniques
used, as well about organization’s processes and data. Without this data lake-based
solutions can lead to erroneous results, same as mixing several ingredients
without having knowledge about their usage can lead to cooking experiments aloof
from the art of cooking.
A
characteristic of data is that they go through continuous change and have
different timeliness, respectively degrees of quality in respect to the data
quality dimensions implied and sources considered. Data need to reflect the
reality at the appropriate level of detail and quality required by the
processing application(s), this applying to data warehouses/marts as well data
lake-based solutions.
Data found
in raw form don’t necessarily represent the true/truth and don’t necessarily acquire
a good quality no matter how much they are processed. Solutions need to be resilient
in respect to the data they handle through their layers, independently of the data
quality and transmission problems. Whether one talks about ETL, data migration
or other types of data processing, keeping the data integrity at various levels
and layers can be maybe the most important demand upon solutions.
Snapshots as
moment-in-time recordings of tables, entities, sets of entities, datasets or whole
databases, prove to be often the best mechanisms in keeping data integrity when
this aspect is essential to their processing (e.g. data migrations, high-accuracy
measurements). Unfortunately, the more systems are involved in the process and
the broader span of the solutions over the sources, the more difficult it
become to take such snapshots.
A SQL query’s
output represents a snapshot of the data, therefore SQL-based solutions are
usually appropriate for most of the business scenarios in which the characteristics
of data (typically volume, velocity and/or variety) make their processing manageable.
However, when the data are extracted by other means integrity is harder to obtain,
especially when there’s no timestamp to allow data partitioning on a time scale,
the handling of data integrity becoming thus in extremis a programmer’s task. In
addition, getting snapshots of the data as they are changed can be a costly and
futile task.
Further on,
maintaining data integrity can prove to be a matter of design in respect not
only to the processing of data, but also in respect to the source applications
and the business processes they implement. The mastery of the underlying principles,
techniques, patterns and methodologies, helps in the process of designing the
right solutions.
Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses.