Showing posts with label data mart. Show all posts
Showing posts with label data mart. Show all posts

22 March 2024

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part III: Architectural Applications)

 

Business Intelligence
Business Intelligence Series

Now considering the 500 houses and the skyscraper model introduced in thee previous post, which do you think will be built first? A skyscraper takes 2-10 years to build, depending on the city in which is built and the architecture characteristics. A house may take 6-12 months depending on similar factors. But one needs to build 500 houses. For sure the process can be optimized when the houses look the same, though there are many constraints one needs to consider - the number of workers, tools, and the construction material available at a given time, the volume of planning, etc. 

Within a rough estimate, it can take 2-5 years for each architecture to be built considering that on the average the advantages and disadvantages from the various areas can balance each other out. Historical data are in general needed for estimating the actual development time. One can start with a rough estimate and reevaluate the estimates up and down as more information are gathered. This usually happens in Software Engineering as well. 

Monolith vs. Distributed Architecture
Monolith vs. Distributed Architecture - 500 families

There are multiple ways in which the work can be assigned to the contractors. When the houses are split between domains, each domain can have its own contractor(s) or the contractors can be specialized by knowledge areas, or a combination of the two. Contractors’ performance should be the same, though in practice no two contractors are the same. Conversely, the chances are higher for some contractors to deliver at the expected quality. It would be useful to have worked before with the contractors and have a partnership that spans years back. There are risks on both sides, even if the risks might favor one architecture over the other, and this depends also on the quality of the contractors, designs, and planning. 

The planning must be good if not perfect to assure smooth development and each day can cost money when contractors are involved. The first planning must be done for the whole project and then split individually for each contractor and/or group of buildings. A back-and-forth check between the various plans is needed. Managing by exception can work, though it can also go terribly wrong. 

Lot of communication must occur between domains to make sure that everything fits together. Especially at the beginning, all the parties must plan together, must make sure that the rules of the games (best practices, policies, procedures, processes, methodologies) are agreed upon. Oversight (governance) needs to happen at a small scale as well on aggregate to makes sure that the rules of the game are followed. 

Now, which of the architectures do you think will fit a data warehouse (DWH)? Probably multiple voices will opt for the skyscraper, at least this is how a DWH looks from the outside. However, when one evaluates the architecture behind it, it can resemble a residential complex in which parts are bound together, but there are parts that can be distributed if needed. For example, in a DWH the HR department has its own area that's isolated from the other areas as it has higher security demands. There can be 2-3 other areas that don't share objects, and they can be distributed as well. The reasons why all infrastructure is on one machine are the costs associated with the licenses, respectively the reporting tools point to only one address. 

In data marts based DWHs, there are multiple buildings within the architecture, and thus the data marts can be distributed across a wider infrastructure, with each domain responsible for its own data mart(s). The data marts are by definition domain-dependent, and this is one of the downsides imputed to this architecture. 

Previous Post <<||>> Next Post

17 March 2024

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence
Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content). 

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply. 

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs. 

Data Products with a Data Mesh
Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort. 

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible. 

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review

14 March 2024

🧭Business Intelligence: Architecture (Part I: Monolithic vs. Distributed and Zhamak Dehghani's Data Mesh - Debunked)

Business Intelligence
Business Intelligence Series

In [1] the author categorizes data warehouses (DWHs) and lakes as monolithic architectures, as opposed to data mesh's distributed architecture, which makes me circumspect about term's use. There are two general definitions of what monolithic means: (1) formed of a single large block (2) large, indivisible, and slow to change.

In software architecture one can differentiate between monolithic applications where the whole application is one block of code, multi-tier applications where the logic is split over several components with different functions that may reside on the same machine or are split non-redundantly between multiple machines, respectively distributed, where the application or its components run on multiple machines in parallel.

Distributed multi-tire applications are a natural evolution of the two types of applications, allowing to distribute redundantly components across multiple machines. Much later came the cloud where components are mostly entirely distributed within same or across distinct geo-locations, respectively cloud providers.

Data Warehouse vs. Data Lake vs. Lakehouse [2]

From licensing and maintenance convenience, a DWH resides typically on one powerful machine with many chores, though components can be moved to other machines and even distributed, the ETL functionality being probably the best candidate for this. In what concerns the overall schema there can be two or more data stores with different purposes (operational/transactional data stores, data marts), each of them with their own schema. Each such data store could be moved on its own machine though that's not feasible.

DWHs tend to be large because they need to accommodate a considerable number of tables where data is extracted, transformed, and maybe dumped for the various needs. With the proper design, also DWHs can be partitioned in domains (e.g. define one schema for each domain) and model domain-based perspectives, at least from a data consumer's perspective. The advantage a DWH offers is that one can create general dimensions and fact tables and build on top of them the domain-based perspectives, minimizing thus code's redundancy and reducing the costs.  

With this type of design, the DWH can be changed when needed, however there are several aspects to consider. First, it takes time until the development team can process the request, and this depends on the workload and priorities set. Secondly, implementing the changes should take a fair amount of time no matter of the overall architecture used, given that the transformations that need to be done on the data are largely the same. Therefore, one should not confuse the speed with which a team can start working on a change with the actual implementation of the change. Third, the possibility of reusing existing objects can speed up changes' implementation. 

Data lakes are distributed data repositories in which structured, unstructured and semi-structured data are dumped in raw form in standard file formats from the various sources and further prepared for consumption in other data files via data pipelines, notebooks and similar means. One can use the medallion architecture with a folder structure and adequate permissions for domains and build reports and other data artefacts on top. 

A data lake's value increases when is combined with the capabilities of a DWH (see dedicated SQL server pool) and/or analytics engine (see serverless SQL pool) that allow(s) building an enterprise semantic model on top of the data lake. The result is a data lakehouse that from data consumer's perspective and other aspects mentioned above is not much different than the DWH. The resulting architecture is distributed too. 

Especially in the context of cloud computing, referring to nowadays applications metaphorically (for advocative purposes) as monolithic or distributed is at most a matter of degree and not of distinction. Therefore, the reader should be careful!

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Databricks (2022) Data Lakehouse (link)

22 October 2022

🛢DBMS: Data Warehouse/Mart (Just the Quotes)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"There are four levels of data in the architected environment - the operational level, the atomic (or the data warehouse) level, the departmental (or the data mart) level, and the individual level. These different levels of data are the basis of a larger architecture called the corporate information factory (CIF). The operational level of data holds application-oriented primitive data only and primarily serves the high-performance transaction-processing community. The data-warehouse level of data holds integrated, historical primitive data that cannot be updated. In addition, some derived data is found there. The departmental or data mart level of data contains derived data almost exclusively. The departmental or data mart level of data is shaped by end-user requirements into a form specifically suited to the needs of the department. And the individual level of data is where much heuristic analysis is done." (William H Inmon, "Building the Data Warehouse" 4th Ed., 2005)

"Having a purposeless or poorly performing dashboard is more common than not. This happens when the underlying architecture is not designed properly to support the needs of dashboard interaction. There is an obvious disconnect between the design of the data warehouse and the design of the dashboards. The people who design the data warehouse do not know what the dashboard will do; and the people who design the dashboards do not know how the data warehouse was designed, resulting in a lack of cohesion between the two. A similar disconnect can also exist between the dashboard designer and the business analyst, resulting in a dashboard that may look beautiful and dazzling but brings very little business value." (Nils H Rasmussen et al, "Business Dashboards: A visual catalog for design and deployment", 2009)

"Having multiple data lakes replicates the same problems that were created with multiple data warehouses - disparate data siloes and data fiefdoms that don't facilitate sharing of the corporate data assets across the organization. Organizations need to have a single data lake from which they can source the data for their BI/data warehousing and analytic needs. The data lake may never become the 'single version of the truth' for the organization, but then again, neither will the data warehouse. Instead, the data lake becomes the 'single or central repository for all the organization's data' from which all the organization's reporting and analytic needs are sourced." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"There are, however, many problems with independent data marts. Independent data marts: (1) Do not have data that can be reconciled with other data marts (2) Require their own independent integration of raw data (3) Do not provide a foundation that can be built on whenever there are future analytical needs." (William H Inmon & Daniel Linstedt, "Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault", 2015)

"Unfortunately, some organizations are replicating the bad data warehouse practice by creating special-purpose data lakes - data lakes to address a specific business need. Resist that urge! Instead, source the data that is needed for that specific business need into an 'analytic sandbox' where the data scientists and the business users can collaborate to find those data variables and analytic models that are better predictors of the business performance. Within the 'analytic sandbox', the organization can bring together (ingest and integrate) the data that it wants to test, build the analytic models, test the model's goodness of fit, acquire new data, refine the analytic models, and retest the goodness of fit." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Data quality in warehousing and BI is typically defined in terms of the 4 C’s - is the data clean, correct, consistent, and complete? When it comes to big data, there are two schools of thought that have different views and expectations of data quality. The first school believes that the gold standard of the 4 C’s must apply to all data (big and little) used for clinical care and performance metrics. The second school believes that in big data environments, a stringent data quality standard is impossible, too costly, or not required. While diametrically opposite opinions may play well in panel discussions, they do little to reconcile the realities of healthcare data quality." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017) 

"Data warehousing has always been difficult, because leaders within an organization want to approach warehousing and analytics as just another technology or application buy. Viewed in this light, they fail to understand the complexity and interdependent nature of building an enterprise reporting environment." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data warehouse follows a pre-built static structure to model source data. Any changes at the structural and configuration level must go through a stringent business review process and impact analysis. Data lakes are very agile. Consumption or analytical layer can be modified to fit in the model requirements. Consumers of a data lake are not constant; therefore, schema and modeling lies at the liberty of analysts and scientists." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data warehousing, as we are aware, is the traditional approach of consolidating data from multiple source systems and combining into one store that would serve as the source for analytical and business intelligence reporting. The concept of data warehousing resolved the problems of data heterogeneity and low-level integration. In terms of objectives, a data lake is no different from a data warehouse. Both are primary advocates of terms like 'single source of truth' and 'central data repository'." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] "The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

Data Warehousing

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.  

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses. 

10 May 2019

🧊💫Data Warehousing: Architecture (Part II: Data Warehousing and Microsoft Dynamics 365)

Data Warehousing

With Dynamics 365 (D365) Online Microsoft made an important strategical move on the ERP market, however in what concerns the BI & Data Warehousing (BI/DW) area Microsoft changed the rules of the game by allowing no direct SQL access to the production environment. This primarily means that will become challenging for organizations to use the existing DW infrastructure to access the D365 data, and for Vendors and Service Providers to provide BI/DW solutions integrated within the D365 platform.

D365 includes its own data warehouse (actually data mart) designed for financial reporting however as per now it can’t be extended to support other business areas. The solution favorited by Microsoft for DW seems to be the use of an Azure SQL Database aka BYOD (Bring Your Own Database) to which entity-based data can be exported incrementally (aka incremental push) or fully (aka full push) via the Data Management Framework (DMF) packages.

Because many of the D365 tables (e.g. Inventory Transactions, Products, Customers, Vendors) were overnormalized over the years and other tables were added as part of new functionality, to hide this complexity, Microsoft introduced a new layer of abstraction formed from data entities organized within an entity store. Data entities are view-like encapsulations of the underlying D365 table schema, the data import/export from and D365 being performed extensively over these data entities via the DMF, which extends the Data Import/Export Framework (DIXF).

One can use thus a BYOD as a direct source for other reporting tools as long they support a connection to Azure, otherwise the data can be further loaded into a database into the cloud, which seems to be the best option until now, as long the organization has other data that need to be consolidated for reporting. From here on, one deals with the traditional way of reporting and the available infrastructure can be extended to use an additional data source.

The BYOD solution comes with several restrictions: a package needs to be created for each business unit, no composite data entities can be exported, data entities that don’t have a unique key can’t be exported via an incremental push, data entities can change over times (new versions being available), while during synchronization no active locks should be on the database. In addition, organizations which followed this path report also some bugs that needed to be addressed via the Microsoft support. Even if the about 1700 available data entities facilitate to some degree data consumption, they seem to be more appropriate for data migrations and data integrations than for DW workloads.

In absence of direct SQL connectivity, in theory organizations can still use SSIS or similar integration tools to connect to D365 production databases and consume data entities via the Open Data Protocol (OData), a standard that defines a set of best practices for building and consuming RESTful APIs. Besides some architectural challenges, loading big tables with transactional data is reportedly slow and impracticable for loading a data warehouse. Therefore, the usability of such an architecture becomes limited in time.

Microsoft imposed a hard limitation upon its D365 architecture by making its production database inaccessible. Of course, there’s still time for Microsoft to do some magic and pull new solutions from the technology stack hat. Unfortunately, the constraints imposed to the production environments limit organizations’ choices of building a modern and flexible data warehouse. For the future it would be great if the DMF could be used directly with standard SQL Server databases, avoiding thus the need for the intermediary Azure database, or if a real-time operational solution could be provided out-of-the-box. We’ll see what the future brings...

25 March 2010

🧊Data Warehousing: Mea Culpa (Part I: A Personal Journey)

Data Warehousing


Any discussion on data warehousing topics, even unconventional, can’t avoid to mention the two most widely adopted concepts in data warehousing, B. Inmon vs. R. Kimball’s methodologies. There is lot of ink consumed already on this topic and is difficult to come with something new, however I can insert in between my experience and personal views on the topic. From the beginning I have to state that I can’t take any of the two sides because from a philosophical viewpoint I am the adept of “the middle way” and, in addition, when choosing a methodology we have to consider business’ requirements and objectives, the infrastructure, the experience of resources, and many other factors. I don’t believe one method fits all purposes, therefore some flexibility is needed into this concern even from most virulent advocates. After all in the end it counts the degree to which the final solution fits the purpose, and no matter how complex and perfect is a methodology, no matter of the precautions taken, given the complexity of software development projects there is always the risk for failure.

  

B. Inmon defines the data warehouse as a “subject-oriented, integrated, non-volatile and time-varying collection of data in support of the management’s decisions” [3] - subject-oriented because is focused on an organization’s strategic subject areas, integrated because the data are coming from multiple legacy systems in order to provide a single overview, time-variant because data warehouse’s content is time dependent, and non-volatile because in theory data warehouse’s content is not updated but refreshed. 


Within my small library and the internet articles I read on this topic, especially the ones from Kimball University cycle,  I can’t say I found a similar direct definition for data warehouse given by R. Kimball, the closest I could get to something in this direction is the data warehouse as a union of data marts, in his definition a data mart is “a process-oriented subset of the overall organization’s data based on a foundation of atomic data, and that depends only on the physics of the data-measurement events, not on the anticipated user’s questions” [2]. This reflects also an important difference between the two approaches, in Inmon’s philosophy the data marts are updated through the data warehouse, the data in the warehouse being stored in a 3rd normal form, while in data marts are multidimensional and thus denormalized.


Even if it’s a nice conceptual tool intended to simplify data manipulation, I can’t say I’m a big fan of dimensional modeling, mainly because it can be easily misused to create awful (inflexible) monster models that can be barely used, sometimes being impossible to go around them without redesigning them. Also the relational models could be easily misused though they are less complex as physical design, easier to model and they offer greater flexibility even if in theory data’s normalization could add further complexity, however there is always a trade between flexibility, complexity, performance, scalability, usability and reusability, to mention just a few of the dimensions associated with data in general and data quality in particular.

  

In order to overcome dimensional modeling issues R. Kimball recommends a four step approach – first identifying the business processes corresponding to a business measurement or event, secondly declaring the grain (level of detail) and only after that defining the dimensions and facts [1]. I have to admit that starting from the business process adds a plus to this framework because in theory it allows better visibility over the processes, supporting processed-based data analysis, though given the fact that a process could span over multiple data elements or that multiple processes could partition the same data elements, this increases the complexity of such models. I find that a model based directly on the data elements allows more flexibility in the detriment of the work needed to bring the data together, though they should cover also the processes in scope.

  

Building a data warehouse it’s quite a complex task, especially if we take into consideration the huge percentage of software projects failure that holds also in data warehousing area. On the other side not sure how much such statistics about software projects failure can be taken ad literam because different project methodologies and data collection methods are used, not always detailed information are given about the particularities of each project, it would be however interesting to know what the failure rate per methodology. Occasionally there are some numbers advanced that sustain the benefit of using one or another methodology, and ignoring the subjective approach of such justifications they often lack adequate details to support them.


My first contact with building a data warehouse was almost 8 years ago, when as part of the Asset Management System I was supposed to work on, the project included also the creation of a data warehouse. Frankly few things are more scaring than seeing two IT professionals fighting on what approach to use in order to design a data warehouse, and is needless to say that the fight lasted for several days, calls with the customer, nerves, management involved, whole arsenal of negotiations that looked like a never ending story. 


Such fights are sometimes part of the landscape and they should be avoided, the simplest alternative being to put together the advantages and disadvantages of most important approaches and balance between them, unfortunately there are still professionals who don’t know how or not willing to do that. The main problem in such cases is the time which instead of being used constructively was wasted on futile fights. When lot of time is waisted and a tight schedule applies, one is forced to do the whole work in less time, leading maybe to sloppy solutions. 

  

A few years back I had the occasion to develop one data warehouse around the two ERP systems and the other smaller systems one of the customers I worked for was having in place, SQL Server 2000 and its DTS (Data Transformation Services) functionality being of great help for this purpose. Even if I was having some basic knowledge on the two data warehousing approaches, I had to build the initial data warehouse from scratch evolving the initial solution in time along several years. 


The design was quite simple, the DTS packages extracting the data from the legacy systems and dumping them in staging tables in normalized or denormalized form, after several simple transformations loading the data in the production tables, the role of the multidimensional data marts being played successfully by views that were scaling pretty well to the existing demands. Maybe many data warehouse developers would disregard such a solution, though it was quite an useful exercise and helped me to easier understand later the literature on this topic and the issues related to it. In addition, while working on the data conversion of two ERP implementations I had to perform more complex ETL (Extract Transform Load) tasks that the ones consider in the data warehouse itself.


In what concerns software development I am an adept of rapid evolutional prototyping because it allows getting customers’ feedback in early stages and thus being possible to identify earlier the issues as per customers’ perceptions, in plus allowing customers to get a feeling of what’s possible, how the application looks like. The prototyping method proved to be useful most of the times, I would actually say all the times, and often was interesting to see how customers’ conceptualization about what they need changed with time, changes that looked simple leading to partial redesign of the application. In other development approaches with long releases (e.g. waterfall) the customer gets a glimpse of the application late in the process, often being impossible to redesign the application so the customer has to live with what he got. Call me “old fashion” but I am the adept of rapid evolutional prototyping also in what concerns the creation of data warehouses, and even if people might argue that a data warehousing project is totally different than a typical development project, it should not be forgotten that almost all software development projects share many particularities from planning to deployment and further to maintenance.


Even if also B. Inmon embraces the evolutional/iterative approach in building a data warehouse, from a philosophical standpoint the rapid evolutional prototyping applied to data warehouses I feel it’s closer to R. Kimball’s methodology, resuming in choosing a functional key area and its essential business processes, building a data mart and starting from there building other data marts for the other functional key areas, eventually integrating and aligning them in a common solution – the data warehouse. On the other side when designing a software component or a module of one application you have also to consider the final goal, as the respective component or module will be part of a broader system, even if in some cases it could exist in isolation. Same can be said also about data marts’ creation, even if sometimes a data mart is rooted in the needs of a department, you have to look also at the final goal and address the requirements from that perspective or at least be aware of them.


Previous Post <<||>> Next Post


References:

[1] M. Ross R. Kimball, (2004) Fables and Facts: Do you know the difference between dimensional modeling truth and fiction? [Online] Available from: http://intelligent-enterprise.informationweek.com/info_centers/data_warehousing/showArticle.jhtml;jsessionid=530A0V30XJXTDQE1GHPSKH4ATMY32JVN?articleID=49400912 (Accessed: 18 March 2010)

[2] R. Kimball, J. Caserta (2004). The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley Publishing Inc. ISBN: 0-7645-7923 -1

[3] Inmon W.H. (2005) Building the Data Warehouse, 4th Ed. Wiley Publishing. ISBN: 978-0-7645-9944-6 

24 March 2010

🕋Data Warehousing: Data Lake (Definitions)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon, "Pentaho, Hadoop, and Data Lakes", 2010) [sorce] [first known usage]

"At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out. [...] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze." (Beulah S Purra & Pradeep Pasupuleti, "Data Lake Development with Big Data", 2015) 

"Data lakes are repositories of raw source data in their native format that are stored for extended periods." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"A repository of data used to manage disparate formats and types of data for a variety of uses." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A storage system designed to hold vast amounts of raw data in its native (ingested) format, usually in a flat or semi-structured format. Extract, transform, and load (ETL) operations are usually applied to data lakes to extract local data marts for downstream computation." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"Data Lake is an analytics system that supports the storing and processing of all types of data." (Maritta Heisel et al, "Software Architecture for Big Data and the Cloud", 2017)

"A data lake is a central repository that allows you to store all your data—structured and unstructured - in volume […]" (Holden Ackerman & Jon King, "Operationalizing the Data Lake", 2019)

"A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning." (Piethein Strengholt, "Data Management at Scale", 2020)

"A repository for storing unstructured and structured data that is downloaded in its raw form and stored by a highly scalable, distributed files system known as open source." (Marcin Flotyński et al, "Non-Technological and Technological (SupTech) Innovations in Strengthening the Financial Supervision", 2021)

"Data lakes are massive repositories for original, raw and unstructured data which is collected from various sources across a smart city. The data from data lakes can be cleansed and transformed for further analytics and modeling." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data." (databricks) [source]

"A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale." (Amazon) [source]

"A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores." (Gartner)

"A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale. It is enabled by low-cost technologies that multiple downstream facilities can draw upon, including data marts, data warehouses, and recommendation engines." (Teradata) [source]

"A data lake is a large and diverse reservoir of corporate data stored across a cluster of commodity servers running software, most often the Hadoop platform, for efficient, distributed data processing." (Qlik) [source]

"A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources." (Oracle)

"A Data Lake is a service which provides a protective ring around the data stored in a cloud object store, including authentication, authorization, and governance support." (Cloudera) [source]

"A data lake is a type of data repository that stores large and varied sets of raw data in its native format." (Red Hat) [source]

"A data lake is an unstructured data repository that contains information available for analysis. A data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation. It enables ad hoc queries, data exploration, and discovery-oriented analytics because data management and structure can be applied on the fly at runtime, unlike traditional structured data storage which requires a schema on write." (TDWI)

"A storage repository that holds a large amount of raw data in its native format until it is needed." (Solutions Review)

27 February 2010

🕋Data Warehousing: Conformed Dimension (Definitions)

"A shared dimension that applies to two subject areas or data marts. By utilizing conformed dimensions, comparisons across data marts are meaningful." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"Dimensions are conformed when they are either exactly the same (including the keys) or one is a perfect subset of the other." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"A conformed dimension is one that is built for use by multiple data marts. Conformed dimensions promote consistency by enabling multiple data marts to share the same reference and hierarchy information." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A dimension whose data is reused by more than one dimensional design. When modeling multiple data marts, standards across marts with respect to dimensions are useful. Warehouse users may be confused when a dimension has the similar meaning but different names, structures, levels, or characteristics among multiple marts. Using standard dimensions throughout the warehousing environment can be referred to as 'conformed' dimensions." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"Dimension tables that are the same, or where one dimension table contains a perfect subset of the attributes of another. Conformance requires that data values be identical, and that the same combination of attribute values is present in each table." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A dimension that is shared between two or more fact tables. It enables the integration of data from different fact tables at query time. This is a foundational principle that enables the longevity of a data warehousing environment. By using conformed dimensions, facts can be used together, aligned along these common dimensions. The beauty of using conformed dimensions is that facts that were designed independently of each other, perhaps over a number of years, can be integrated. The use of conformed dimensions is the central technique for building an enterprise data warehouse from a set of data marts." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"Two sets of business dimensions represented in dimension tables are said to be conformed if both sets are identical in their attributes or if one set is an exact subset of the other. Conformed dimensions are fundamental in the bus architecture for a family of STARS." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A dimension that means and represents the same thing when linked to different fact tables." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

03 February 2010

🕋Data Warehousing: Data Mart [DM] (Definitions)

"A subset of the contents of a data warehouse, stored within its database. A data mart tends to contain data focused at the department level, or on a specific business area. It is frequently implemented to manage the volume and scope of data." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"A data warehouse, or repository, whose scope is limited to a single subject area." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"A type of data warehouse with data specifically designed for a defined set of functions." (Margaret Y Chu, "Blissful Data ", 2004)

[dependent data mart:] "A data mart that obtains its source data from an Enterprise Data Warehouse." (Margaret Y Chu, "Blissful Data ", 2004)

[independent data mart:] "A data mart that obtains its source data from operational systems or other external media." (Margaret Y Chu, "Blissful Data ", 2004)

"A database that contains a copy of operational data, organized to support analysis of a business process. A data mart may be a subject area within an enterprise data warehouse, or an analytic database that is departmentally focused. When not planned as part of an enterprise data warehouse, a data mart may become a stovepipe. When deployed as an adjunct to a normalized data warehouse, a data mart may contain aggregated data. When built around a conformance bus, the data mart is neither a stovepipe nor an aggregation." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

[stovepipe data mart:] "A departmentally focused data warehouse implementation that does not interoperate with other subject areas. Stovepipes are avoided through the design of a data warehouse bus - a set of conformed dimensions used consistently across subject areas." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A platform that maintains data for analysis by a single organization or user group for a specific set of business purposes." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"A departmentalized structure of data feeding from the data warehouse where data is denormalized based on the department’s need for information." (William H Inmon & Anthony Nesavich, "Tapping into Unstructured Data", 2007)

"A data mart is a specialized database containing a subset of data from a data warehouse that is needed for a particular business purpose. A data mart is used for reporting and analysis of business data." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"A smaller data warehouse that holds data of interest to a particular group." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A specialized type of data warehouse that works with a specific set of data to answer a specific need. A data mart is designed to provide quick, easy access to crucial data." (Stuart Mudie et al, "BusinessObjects™ XI Release 2 for Dummies", 2008) 

"A collection of related data from internal and external sources, transformed, integrated, and stored for the purpose of providing strategic information to a specific set of users in an enterprise." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A database in a data warehouse configuration that holds a subset of data specifically organized for a particular kind of reporting." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A database structured for specific analysis and historical reporting needs." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A smaller, more specialized version of a data warehouse that includes data from a specific functional area or department." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A decision support database supporting Business Intelligence in a limited subject area, using a dimensional data model design." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Small data warehouse designed to support a department or SBU." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"A subset of a data warehouse that is designed to focus on a specific set of business information." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A subset of a data warehouse that’s usually oriented to a business group or team." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[dependent data mart:] "A data mart whose sole source of data is the data warehouse; a dependent data mart is a component of the corporate information factory" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[independent data mart:] "A data mart whose source data comes directly from legacy systems, rather than being sourced by a data warehouse" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Data organized to support specific needs of a user community." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Andrew Pham et al, "From Business Strategy to Information Technology Roadmap", 2016)

"A focused collection of operational data that is usually confined to a specific aspect or subject of a business, such as customers, products, or suppliers. It is a more focused decision support data store than a data warehouse." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"A subset of a data warehouse that allows data to be accessed and customized by specific business functions." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Tiffany Pham et al, "From Business Strategy to Information Technology Roadmap", 2018)

"A subset of a data warehouse that contains data that is tailored and optimized for the specific reporting needs of a department or team. A data mart can be a subset of a warehouse for an entire organization, such as data that is contained in online analytical processing (OLAP) tools." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"a subset of the data warehouse used for a specific purpose. Data marts are then department-specific or related to a single line of business (LoB)."  (Francesco Corea, "An Introduction to Data: Everything You Need to Know About AI, Big Data and Data Science", 2019)

"A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers." (The Data Warehousing Institute)

"A database, usually smaller than a data warehouse, designed to help managers make strategic decisions about their business by focusing on a specific subject or department." (Microstrategy)

"A subset of the contents of a data warehouse that tends to contain data focused at the department level, or on a specific business area." (Microsoft)

"A simple data repository that houses data of a specific discipline." (Solutions Review)

"A data mart is a curated subset of data often generated for analytics and business intelligence users. Data marts are often created as a repository of pertinent information for a subgroup of workers or a particular use case." (snowflake) [source]

"A data mart is a subject-oriented database that is often a partitioned segment of an enterprise data warehouse. The subset of data held in a data mart typically aligns with a particular business unit like sales, finance, or marketing." (Talend) [source]

"A data mart is a subset of data from an enterprise data warehouse in which the relevance is limited to a specific business unit or group of users." (Informatica) [source]

"A data mart is a subset of data stored within the overall data warehouse, for the needs of a specific team, section or department within the business enterprise. […] Data marts make it much easier for individual departments to access key data insights more quickly and helps prevent departments within the business organization from interfering with each other’s data." (Sisense) [source]

"A data mart is the access layer of a data warehouse that is used to provide users with data. Data marts are often seen as small slices of the data warehouse. Data warehouses typically house enterprise-wide data, and information stored in a data mart usually belongs to a specific department or team." (Logi Analytics) [source]

"A data mart serves the same role as a data warehouse, but it is intentionally limited in scope. It may serve one particular department or line of business." (Oracle)

"The data mart is a subject-oriented slice of the data warehouse logical model serving a narrow group of users. Many data marts only need a subset of data from the full tables in the data warehouse." (Teradata) [source]

01 December 2005

IT: Data Models (Definitions)

"A system data model is a collection of the information being addressed by a specific system or function such as a billing system, data warehouse, or data mart. The system model is an electronic representation of the information needed by that system. It is independent of any specific technology or DBMS environment." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"The business data model, sometimes known as the logical data model, describes the major things ('entities') of interest to the company and the relationships between pairs of these entities. It is an abstraction or representation of the data in a given business environment, and it provides the benefits cited for any model. It helps people envision how the information in the business relates to other information in the business ('how the parts fit together')." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"The technology data model is a collection of the specific information being addressed by a particular system and implemented on a specific platform." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

[enterprise data model:] "A high-level, enterprise-wide framework that describes the subject areas, sources, business dimensions, business rules, and semantics of an enterprise." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

[canonical data model:] "The definition of a standard organization view of a particular information subject. To be practical, canonical data models include a mapping back to each application view of the same subject." (David Lyle & John G Schmidt, "Lean Integration", 2010)

[hierarchical data model:] "A data model that represents data in a tree-like structure of only one-to-many relationships, where each entity may have a ‘many’ side when related to a parent, and a ‘one’ side when related to a child." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[Enterprise Data Model (EDM):] "A conceptual data model or logical data model providing a common consistent view of shared data across the enterprise, however that is defined, at a point in time. It is common to use the term to mean a high-level, simplified data model, but that is a question of abstraction for presentation." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[network data model:] "A representation of objects and their participation in one or more owner-member sets. In such a model, a both owners and members may participate in multiple sets, affecting a network of objects and relationships." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[enterprise data model:] "A single data model that comprehensively describes the data needs of the entire organization." (Craig S Mullins, "Database Administration", 2012)

[generic data model:] "A data model of an industry, rather than of a specific company; a generic data model can be used as a template that can be customized for a given company within the industry that has been modeled." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[corporate data model:] "A data model at the corporate level for the core business objects and their relationship with each other, which is based on the core business object model." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

[Enterprise data model:] "The enterprise data model in many literatures and viewpoints is considered to be a single, standalone unified artifact that describes all data entities and data attributes and their relationships across the enterprise. In most cases this model is combined with the ambition to consolidate all data in an enterprise data warehouse (EDW)." (Piethein Strengholt, "Data Management at Scale", 2020)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.