SQL Troubles: 🪙Business Intelligence: Data Warehouse (Just the Quotes)

22 October 2015

🪙Business Intelligence: Data Warehouse (Just the Quotes)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Having a purposeless or poorly performing dashboard is more common than not. This happens when the underlying architecture is not designed properly to support the needs of dashboard interaction. There is an obvious disconnect between the design of the data warehouse and the design of the dashboards. The people who design the data warehouse do not know what the dashboard will do; and the people who design the dashboards do not know how the data warehouse was designed, resulting in a lack of cohesion between the two. A similar disconnect can also exist between the dashboard designer and the business analyst, resulting in a dashboard that may look beautiful and dazzling but brings very little business value." (Nils H Rasmussen et al, "Business Dashboards: A visual catalog for design and deployment", 2009)

"Having multiple data lakes replicates the same problems that were created with multiple data warehouses - disparate data siloes and data fiefdoms that don't facilitate sharing of the corporate data assets across the organization. Organizations need to have a single data lake from which they can source the data for their BI/data warehousing and analytic needs. The data lake may never become the 'single version of the truth' for the organization, but then again, neither will the data warehouse. Instead, the data lake becomes the 'single or central repository for all the organization's data' from which all the organization's reporting and analytic needs are sourced." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Unfortunately, some organizations are replicating the bad data warehouse practice by creating special-purpose data lakes - data lakes to address a specific business need. Resist that urge! Instead, source the data that is needed for that specific business need into an 'analytic sandbox' where the data scientists and the business users can collaborate to find those data variables and analytic models that are better predictors of the business performance. Within the 'analytic sandbox', the organization can bring together (ingest and integrate) the data that it wants to test, build the analytic models, test the model's goodness of fit, acquire new data, refine the analytic models, and retest the goodness of fit." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Data quality in warehousing and BI is typically defined in terms of the 4 C’s - is the data clean, correct, consistent, and complete? When it comes to big data, there are two schools of thought that have different views and expectations of data quality. The first school believes that the gold standard of the 4 C’s must apply to all data (big and little) used for clinical care and performance metrics. The second school believes that in big data environments, a stringent data quality standard is impossible, too costly, or not required. While diametrically opposite opinions may play well in panel discussions, they do little to reconcile the realities of healthcare data quality." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"Data warehousing has always been difficult, because leaders within an organization want to approach warehousing and analytics as just another technology or application buy. Viewed in this light, they fail to understand the complexity and interdependent nature of building an enterprise reporting environment." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data warehouse follows a pre-built static structure to model source data. Any changes at the structural and configuration level must go through a stringent business review process and impact analysis. Data lakes are very agile. Consumption or analytical layer can be modified to fit in the model requirements. Consumers of a data lake are not constant; therefore, schema and modeling lies at the liberty of analysts and scientists." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data warehousing, as we are aware, is the traditional approach of consolidating data from multiple source systems and combining into one store that would serve as the source for analytical and business intelligence reporting. The concept of data warehousing resolved the problems of data heterogeneity and low-level integration. In terms of objectives, a data lake is no different from a data warehouse. Both are primary advocates of terms like 'single source of truth' and 'central data repository'." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] "The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"A data warehouse service provides cleansed and transformed data that can be used for multiple purposes. First, it serves as a layer for reporting and BI. Second, it is a platform to query data for business or data analysis. Third, it serves as a repository to store historical data that needs to be online and available. Finally, it also acts as a source of transformed data for other downstream data marts that may cater to specific departmental requirements." (Pradeep Menon, "Data Lakehouse in Action", 2022)

"Historically, for their analytics needs, enterprises relied upon a set of tightly coupled tools, typically provided by a single vendor. Nowadays, nearly all of the components of a traditional data warehouse are independent and interchangeable. Those independent tools can be flexibly combined to provide a modern data stack. It is common for current enterprises to have separate tools for data ingestion, data pipelines, data storage and querying, data visualization and business intelligence, and data quality. Furthermore, data can flow in the opposite direction out of the data warehouse in what is referred to as reverse extract, transform, and load (ETL)." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Lakehouse is a new architecture and data storage paradigm that combines the characteristics of both data warehouses and data lakes to create a unified basis for all types of use cases to be built on top of it. There is no need to move data around. Data is curated and remains in an open format and serves as the single source of truth (SSOT) for all the consumption layers. A modern data platform has needs that span traditional data warehouses, data lakes, machine learning systems, and streaming systems and there is some overlap among these systems. A Lakehouse offers features that span all four systems [...]" (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"Simply put, 'lakehouse' refers to an open data architecture that combines the best of data lakes and data warehouses on a single platform. At this point, it would be fair to say that a lakehouse is closer to a data lake than a data warehouse. In fact, it is an extension of your data lake to support all use cases, from BI to AI. All data science and ML personas who were shunted into downstream applications because the tools of their trade were so vastly different and can now share the same stage and have access to the same data as other data personas. This eliminates the need to stitch fragile systems together and leads to better data quality and end-to-end latencies since there is no need to copy data across disparate architectures." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"Traditional data lakes provide the necessary scalability, but not the real-time concurrency and latency needed for BI use cases. Delta comes to the rescue once again by providing performance at scale with a host of optimization techniques, such as caching, data compaction, and indexing. Previously, a subset of the curated data would be pushed to a warehouse to satisfy the latency and concurrency requirements of known queries. What this meant was that if a consumer needed a different access pattern or a slightly older dataset that was not available, they would have to request that their IT or data team get involved. This took data democratization a step backward. Ideally, we should allow people to access any data that they have privileges to. Delta Lake goes a step forward and allows BI tools to access data directly from the lake instead of accessing a sliver of the data in their expensive warehouses." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"A data warehouse is a centralized repository of structured, cleaned, and verified data that has been extracted, transformed, and loaded from various sources. These steps are commonly called ETL, which stands for Extract, Transform, Load. This data processing methodology involves extracting data from multiple sources, transforming it to meet business needs, and loading it into a destination for analysis and consultation." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"A lake based on the medallion architecture combines the best of lakes and data warehouses. By breaking down silos and eliminating data duplication, it becomes a standard for building data platform architecture." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"A lakehouse is a data storage space that hosts and manages all types of data in one place (structured, semi-struc-tured, and unstructured), allowing different tools to normalize and examine this data according to organizational requirements and/or individual choices. A lakehouse thus combines the best aspects of a data lake and a data warehouse by eliminating data duplication and friction related to ingestion, transformation, and sharing of data within the organization, all in the open format, Delta Lake." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"Considered by many companies as the next generation of data architecture, the data mesh represents the natural evolution of traditional data lakes and data warehouses. While the latter are often limited by their centralized and monolithic structure, the data mesh aims to enable companies to deploy a more flexible, responsive, and massively scalable data strategy." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

SQL Troubles

Pages

22 October 2015

🪙Business Intelligence: Data Warehouse (Just the Quotes)

No comments:

About Me