SQL Troubles

30 December 2015

🪙Business Intelligence: Data Pipelines (Just the Quotes)

"Data Lake is a single window snapshot of all enterprise data in its raw format, be it structured, semi-structured, or unstructured. Starting from curating the data ingestion pipeline to the transformation layer for analytical consumption, every aspect of data gets addressed in a data lake ecosystem. It is supposed to hold enormous volumes of data of varied structures." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018

"The quality of data that flows within a data pipeline is as important as the functionality of the pipeline. If the data that flows within the pipeline is not a valid representation of the source data set(s), the pipeline doesn’t serve any real purpose. It’s very important to incorporate data quality checks within different phases of the pipeline. These checks should verify the correctness of data at every phase of the pipeline. There should be clear isolation between checks at different parts of the pipeline. The checks include checks like row count, structure, and data type validation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"For advanced analytics, a well-designed data pipeline is a prerequisite, so a large part of your focus should be on automation. This is also the most difficult work. To be successful, you need to stitch everything together." (Piethein Strengholt, "Data Management at Scale: Best Practices for Enterprise Architecture", 2020)

"A data pipeline is a series of transformation steps (functions) executed as the data flows from one step to another. Data mesh refrains from using pipelines as a top-level architectural paradigm and in between data products. The challenge with pipelines as currently used is that they don’t create clear interfaces, contracts, and abstractions that can be maintained easily as the pipeline complexity complexity grows. Due to lack of abstractions, single failure in the pipeline causes cascading failures." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data lake architecture suffers from complexity and deterioration. It creates complex and unwieldy pipelines of batch or streaming jobs operated by a central team of hyper-specialized data engineers. It deteriorates over time. Its unmanaged datasets, which are often untrusted and inaccessible, provide little value. The data lineage and dependencies are obscured and hard to track." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data mesh [...] reduces points of centralization that act as coordination bottlenecks. It finds a new way of decomposing the data architecture without slowing the organization down with synchronizations. It removes the gap between where the data originates and where it gets used and removes the accidental complexities - aka pipelines - that happen in between the two planes of data. Data mesh departs from data myths such as a single source of truth, or one tightly controlled canonical data model." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"A data pipeline is an artifact of a data engineering process. It transforms raw data into data ready for analytics. These in turn help solve problems, aid support decisions, and make our lives more convenient. In some ways, it can be thought of as the stitch between the OLTP and OLAP systems. Data pipelines are sometimes referred to as ETL, which stands for extract, transform, load, and it has a variation called extract, load, transform (ELT). The main difference between the two is whether the incoming data is first saved to disk and then transformed (data wrangling) or vice versa. The processing is loosely referred to as ETL. Although, it is fair to say ELT is relevant in the context of Data Lakes and unstructured data, whereas ETL is used for Data Warehouses." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"Historically, for their analytics needs, enterprises relied upon a set of tightly coupled tools, typically provided by a single vendor. Nowadays, nearly all of the components of a traditional data warehouse are independent and interchangeable. Those independent tools can be flexibly combined to provide a modern data stack. It is common for current enterprises to have separate tools for data ingestion, data pipelines, data storage and querying, data visualization and business intelligence, and data quality. Furthermore, data can flow in the opposite direction out of the data warehouse in what is referred to as reverse extract, transform, and load (ETL)." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Data has historically been treated as a second-class citizen, as a form of exhaust or by-product emitted by business applications. This application-first thinking remains the major source of problems in today’s computing environments, leading to ad hoc data pipelines, cobbled together data access mechanisms, and inconsistent sources of similar-yet-different truths. Data mesh addresses these shortcomings head-on, by fundamentally altering the relationships we have with our data. Instead of a secondary by-product, data, and the access to it, is promoted to a first-class citizen on par with any other business service." (Adam Bellemare,"Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Gaining more insight into data, simplifying data access, enabling shopping-for-data, augmenting traditional data governance, generating active metadata, and accelerating development of products and services are enabled by infusing AI into the Data Fabric architecture. An AI-infused Data Fabric is not only leveraging AI but also likewise an architecture to manage and deal with AI artefacts, including AI models, pipelines, etc." (Eberhard Hechler et al, "Data Fabric and Data Mesh Approaches with AI", 2023)

"While a Data Fabric is an architecture that facilitates the end-to-end integration of various data and AI pipelines across hybrid cloud environments through the use of intelligent and automated systems and applications, a Data Mesh should be seen as a solution, which is geared toward delivering data-as-a-product in an organizational federated approach." (Eberhard Hechler et al, "Data Fabric and Data Mesh Approaches with AI", 2023)

"Fabric Pipelines provide reliable and efficient end-to-end orchestration of data flows, managing ingestion, transformation, and loading through a sequence of steps that can leverage various data processing engines. They allow centralizing and orchestrating data movements from various sources, thanks to advanced connectivity features, and with great scalability. Built-in monitoring tools enable real-time tracking of data flow status and quick detection of anomalies or errors." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"It should be noted that, unlike Dataflow Gen2, in pipelines, it is not mandatory to enable staging to load data into a warehouse. Indeed, pipelines are designed for more general orchestration scenarios where you can combine various activities such as transformations, API calls, and so on to create complex workflows. They are not specifically focused on data preparation but rather on end-to-end process automation. Pipelines are more flexible and used for a variety of orchestration tasks, whereas Dataflow Gen2 is specifically designed for data preparation and transformation, hence the requirement for staging in that case." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

29 December 2015

🪙Business Intelligence: Storytelling (Just the Quotes)

"Storytelling reveals meaning without committing the error of defining it." (Hannah Arendt, "Men in Dark Times", 1968)

"Scientific practice may be considered a kind of storytelling practice [...]" (Donna Haraway, "Primate Visions", 1989)

"Storytelling is the art of unfolding knowledge in a way that makes each piece contribute to a larger truth." (Philip Gerard, "Writing a Book That Makes a Difference", 2000)

"The human mind is a wanton storyteller and even more, a profligate seeker after pattern. We see faces in clouds and tortillas, fortunes in tea leaves and planetary movements. It is quite difficult to prove a real pattern as distinct from a superficial illusion." (Richard Dawkins, "A Devil's Chaplain", 2003)

"The world of a story is not merely the sum of all the words we put on a page, or on many pages. When we talk about entering the world of a story as a reader we refer to things we picture, or imagine, and responses we form - to characters, events - all of which are prompted by, but not entirely encompassed by, the words on the page." (Peter Turchi, "Maps of the Imagination: The writer as cartographer", 2004)

"We have, as human beings, a storytelling problem. We're a bit too quick to come up with explanations for things we don't really have an explanation for." (Malcolm Gladwell, "Blink: The Power of Thinking Without Thinking", 2005)

"There is an extraordinary power in storytelling that stirs the imagination and makes an indelible impression on the mind." (Brennan Manning, "The Ragamuffin Gospel: Good News for the Bedraggled, Beat-Up, and Burnt Out", 2008)

"Mostly we rely on stories to put our ideas into context and give them meaning. It should be no surprise, then, that the human capacity for storytelling plays an important role in the intrinsically human-centered approach to problem solving, design thinking." (Tim Brown, "Change by Design: How Design Thinking Transforms Organizations and Inspires Innovation", 2009)

"The purpose of a storyteller is not to tell you how to think, but to give you questions to think upon." (Brandon Sanderson, "The Way of Kings", 2010)

"Visualizations act as a campfire around which we gather to tell stories." (Al Shalloway, 2011)

"The storytelling mind is allergic to uncertainty, randomness, and coincidence. It is addicted to meaning. If the storytelling mind cannot find meaningful patterns in the world, it will try to impose them. In short, the storytelling mind is a factory that churns out true stories when it can, but will manufacture lies when it can't." (Jonathan Gottschall, "The Storytelling Animal: How Stories Make Us Human", 2012)

"We are, as a species, addicted to story. Even when the body goes to sleep, the mind stays up all night, telling itself stories." (Jonathan Gottschall, "The Storytelling Animal", 2012)

"The fact of storytelling hints at a fundamental human unease, hints at human imperfection. Where there is perfection there is no story to tell." (Ben Okri, "A Way of Being Free", 2014)

"There is no such thing as a fact. There is only how you saw the fact, in a given moment. How you reported the fact. How your brain processed that fact. There is no extrication of the storyteller from the story." (Jodi Picoult, "Small Great Things", 2016)

"Stories can begin with a question or line of inquiry." (Kristen Sosulski, "Data Visualization Made Simple: Insights into Becoming Visual", 2018)

"Data storytelling provides a bridge between the worlds of logic and emotion. A data story offers a safe passage for your insights to travel around emotional pitfalls and through analytical resistance that typically impede facts." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

27 December 2015

🪙Business Intelligence: Insight (Just the Quotes)

"Knowledge workers and BI experts must continually evaluate the reports, dashboards, alerts, and other mechanisms for disseminating factual information to ensure the design facilitates insight." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"In fact, the analogy to storytelling is limited when applied to communicating with data. Data visualization has fundamental characteristics missing from traditional storytelling. For example, interactive data visualizations let audiences explore information to find insights that resonate with them. Visualizations take shape based to a large extent on the underlying data. And as this data changes, the emphasis and message of the visualization is likely to change." (Zach Gemignani et al, "Data Fluency", 2014)

"[…] the better insights are communicated, the more likely it is that data leads to positive action (in this case, better business decisions)." (Bernard Marr, "Data Strategy", 2017)

"Data Lake induces accessibility and catalyzes availability. It warrants data discovery platforms to soak the data trends at a horizontal scale and produce visual insights. It largely cuts down the time that goes into data preparation and exhaustive data analysis." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Bad data are expensive: my best estimate is that it costs a typical company 20% of revenue. Worse, they dilute trust - who would trust an exciting new insight if it is based on poor data! And worse still, sometimes bad data are simply dangerous; look at the damage brought on by the financial crisis, which had its roots in bad data." (Rupa Mahanti, "Data Quality: Dimensions, Measurement, Strategy, Management, and Governance", 2019)

"First, from an ethos perspective, the success of your data story will be shaped by your own credibility and the trustworthiness of your data. Second, because your data story is based on facts and figures, the logos appeal will be integral to your message. Third, as you weave the data into a convincing narrative, the pathos or emotional appeal makes your message more engaging. Fourth, having a visualized insight at the core of your message adds the telos appeal, as it sharpens the focus and purpose of your communication. Fifth, when you share a relevant data story with the right audience at the right time (kairos), your message can be a powerful catalyst for change." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Ensure you build into your data literacy strategy learning on data quality. If the individuals who are using and working with data do not understand the purpose and need for data quality, we are not sitting in a strong position for great and powerful insight. What good will the insight be, if the data has no quality within the model?" (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"In the same vein, data strategy is often a misnomer for a much wider scope of coverage, but the lack of coherence in how we use the language has led to data strategy being perceived to cover data management activities all the way through to exploitation of data in the broadest sense. The occasional use of information strategy, intelligence strategy or even data exploitation strategy may differentiate, but the lack of a common definition on what we mean tends to lead to data strategy being used as a catch-all for the more widespread coverage such a document would typically include. Much of this is due to the generic use of the term ‘data’ to cover everything from its capture, management, governance through to reporting, analytics and insight." (Ian Wallis, "Data Strategy: From definition to execution", 2021)

"Current decision-making in business suffers from insight gaps. Organizations invest in data and analytics, hoping that will provide them with insights that they can use to make decisions, but in reality, there are many challenges and obstacles that get in the way of that process. One of the biggest challenges is that these organizations tend to focus on technology and hard skills only. They are definitely important, but you will not automatically get insights and better decisions with hard skills alone. Using data to make better data-informed decisions requires not only hard skills but also soft skills as well as mindsets." (Angelika Klidas & Kevin Hanegan, "Data Literacy in Practice", 2022)

"Data engineering is the process of converting raw data into analytics-ready data that is more accessible, usable, and consumable than its raw format. Modern companies are increasingly becoming data-driven, which means they use data to make business decisions to give them better insights into their customers and business operations. They can use these to improve profitability, reduce costs, and give them a competitive edge in the market. Behind the scenes, a series of tasks and processes are performed by a host of data personas who build reliable pipelines to source, transform, and analyze data so that it is a repeatable and mostly automated process." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"Decision-makers are constantly provided data in the form of numbers or insights, or similar. The challenge is that we tend to believe every number or piece of data we hear, especially when it comes from a trusted source. However, even if the source is trusted and the data is correct, insights from the data are created when we put it in context and apply meaning to it. This means that we may have put incorrect meaning to the data and then made decisions based on that, which is not ideal. This is why anyone involved in the process needs to have the skills to think critically about the data, to try to understand the context, and to understand the complexity of the situation where the answer is not limited to just one specific thing. Critical thinking allows individuals to assess limitations of what was presented, as well as mitigate any cognitive bias that they may have." (Angelika Klidas & Kevin Hanegan, "Data Literacy in Practice", 2022)

"Understanding modern data architectures and sound data engineering principles and practices are crucial to ensure that your AI and BI strategies are reliable and defensible. Generated insights are going to be as good as the quality of the underlying data, so the upfront effort put into understanding the data, modeling it, and transforming it per the business needs goes a long way to foster innovation, productivity, and agility in your data teams." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"We are at the interesting conjunction of big data, the cloud, and artificial intelligence (AI), all of which are fueling tremendous innovation in every conceivable industry vertical and generating data exponentially. Data engineering is increasingly important as data drives business use cases in every industry vertical. You may argue that data scientists and machine learning practitioners are the unicorns of the industry, and they can work their magic for business. That is certainly a stretch of the imagination. Simple algorithms and a lot of good reliable data produce better insights than complicated algorithms with inadequate data." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"Establishing a comprehensive observability architecture necessitates a systematic approach that spans the entirety of the data pipeline, from initial telemetry collection to actionable insights accessible by diverse stakeholders. The core objective is to unify distributed data sources - metrics, logs, traces, and quality signals - into a coherent framework that enables rapid diagnosis, continuous monitoring, and strategic decision-making." (William Smith, "Soda Core for Modern Data Quality and Observability: The Complete Guide for Developers and Engineers", 2025)

"The lakehouse combines the best elements of data lakes and data warehouses for OLAP workloads. It merges the scalability and flexibility of data lakes with the management features and performance optimization of data warehouses. [...] A lakehouse eliminates the need for disjointed systems and provides a single, coherent platform for all forms of data analysis. Lakehouses enhance the performance of data queries and simplify data management, making it easier for organizations to derive insights from their data." (Denny Lee et al, "Delta Lake: The Definitive Guide", 2025)

"Viewing the dendrograms in high dimensions provides insight into how the algorithm has joined points to clusters. For example, single linkage often has edges leading to a single focal point, which might not yield a useful clustering but might help to

identify outliers. If the edges point to multiple focal points, with long edges bridging gaps in the data, the result is more likely yielding a useful clustering." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

26 December 2015

🪙Business Intelligence: Measurement (Just the Quotes)

"There is no inquiry which is not finally reducible to a question of Numbers; for there is none which may not be conceived of as consisting in the determination of quantities by each other, according to certain relations." (Auguste Comte, “The Positive Philosophy”, 1830)

"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of science.” (Lord Kelvin, "Electrical Units of Measurement", 1883)

“Of itself an arithmetic average is more likely to conceal than to disclose important facts; it is the nature of an abbreviation, and is often an excuse for laziness.” (Arthur Lyon Bowley, “The Nature and Purpose of the Measurement of Social Phenomena”, 1915)

“Science depends upon measurement, and things not measurable are therefore excluded, or tend to be excluded, from its attention.” (Arthur J Balfour, “Address”, 1917)

“It is important to realize that it is not the one measurement, alone, but its relation to the rest of the sequence that is of interest.” (William E Deming, “Statistical Adjustment of Data”, 1943)

“The purpose of computing is insight, not numbers […] sometimes […] the purpose of computing numbers is not yet in sight.” (Richard Hamming, “Numerical Methods for Scientists and Engineers”, 1962)

“A quantity like time, or any other physical measurement, does not exist in a completely abstract way. We ﬁnd no sense in talking about something unless we specify how we measure it. It is the deﬁnition by the method of measuring a quantity that is the one sure way of avoiding talking nonsense...” (Hermann Bondi, “Relativity and Common Sense”, 1964)

“Measurement, we have seen, always has an element of error in it. The most exact description or prediction that a scientist can make is still only approximate.” (Abraham Kaplan, “The Conduct of Inquiry: Methodology for Behavioral Science”, 1964)

"A mature science, with respect to the matter of errors in variables, is not one that measures its variables without error, for this is impossible. It is, rather, a science which properly manages its errors, controlling their magnitudes and correctly calculating their implications for substantive conclusions." (Otis D Duncan, "Introduction to Structural Equation Models", 1975)

“Data in isolation are meaningless, a collection of numbers. Only in context of a theory do they assume significance […]” (George Greenstein, “Frozen Star”, 1983)

"Changing measures are a particularly common problem with comparisons over time, but measures also can cause problems of their own. [...] We cannot talk about change without making comparisons over time. We cannot avoid such comparisons, nor should we want to. However, there are several basic problems that can affect statistics about change. It is important to consider the problems posed by changing - and sometimes unchanging - measures, and it is also important to recognize the limits of predictions. Claims about change deserve critical inspection; we need to ask ourselves whether apples are being compared to apples - or to very different objects." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Measurement is often associated with the objectivity and neatness of numbers, and performance measurement efforts are typically accompanied by hope, great expectations and promises of change; however, these are then often followed by disbelief, frustration and what appears to be sheer madness." (Dina Gray et al, "Measurement Madness: Recognizing and avoiding the pitfalls of performance measurement", 2015)

"Measuring anything subjective always prompts perverse behavior. [...] All measurement systems are subject to abuse." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

“The value of having numbers - data - is that they aren't subject to someone else's interpretation. They are just the numbers. You can decide what they mean for you.” (Emily Oster, “Expecting Better”, 2013)

"Until a new metric generates a body of data, we cannot test its usefulness. Lots of novel measures hold promise only on paper." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"Usually, it is impossible to restate past data. As a result, all history must be whitewashed and measurement starts from scratch." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

25 December 2015

🪙Business Intelligence: Data Mesh (Just the quotes)

"Another myth is that we shall have a single source of truth for each concept or entity. […] This is a wonderful idea, and is placed to prevent multiple copies of out-of-date and untrustworthy data. But in reality it’s proved costly, an impediment to scale and speed, or simply unachievable. Data Mesh does not enforce the idea of one source of truth. However, it places multiple practices in place that reduces the likelihood of multiple copies of out-of-date data." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data Mesh attempts to strike a balance between team autonomy and inter-term interoperability and collaboration, with a few complementary techniques. It gives domain teams autonomy to have control of their local decision making, such as choosing the best data model for their data products. While it uses the computational governance policies to impose a consistent experience across all data products; for example, standardizing on the data modeling language that all domains utilize." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data mesh is a solution for organizations that experience scale and complexity, where existing data warehouse or lake solutions have become blockers in their ability to get value from data at scale and across many functions of their business, in a timely fashion and with less friction." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data Mesh must allow for data models to change continuously without fatal impact to downstream data consumers, or slowing down access to data as a result of synchronizing change of a shared global canonical model. Data Mesh achieves this by localizing change to domains by providing autonomy to domains to model their data based on their most intimate understanding of the business without the need for central coordinations of change to a single shared canonical model." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"A core premise of data mesh is federating data ownership among domain data owners who are responsible for their data as a product. Offering the data as a product requires the data to be discoverable and to have explicitly stated quality characteristics and a clearly defined access method. Such requirements are at the core of what data catalogs support. With support for data labeling, curation, and crowdsourced feedback, data catalogs are well positioned to offer data as a product. Furthermore, data catalogs support the enforcement of compliant data usage, which becomes more important when data ownership is not managed centrally." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Data mesh relies on a distributed architecture that consists of domains. Each domain is an independent unit of data and its associated storage and compute components. When an organization contains various product units, each with its own data needs, each product team owns a domain that is operated and governed independently by the product team. […] Data mesh has a unique value proposition, not just offering scale of infrastructure and scenarios but also helping shift the organization’s culture around data," (Rukmani Gopalan, "The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture", 2022)

"Each domain data lakehouse may opt to have its data catalog. However, the critical component in this architecture is the data mesh catalog. The data mesh catalog is the master catalog used to discover the data elements available in different nodes. Each domain-oriented node will donate its metadata to the data mesh catalog. This donation of metadata determines the effectiveness of the data mesh architecture. Once the metadata is contributed, other nodes can browse through the data mesh catalog. They can select the data of interest and mutually share data between the nodes through a governed data sharing process. The critical point to note here is that, unlike the hub-spoke architecture, the data mesh architecture enables data sharing between the 'spoke nodes'. There is no hub node in a data mesh architecture." (Pradeep Menon, "Data Lakehouse in Action", 2022)

"The data mesh pattern doesn't feature a central node and is loosely coupled compared to a hub-spoke architecture. It has different data lakehouse nodes that are independent of each other. The node data lakehouses are domain-driven. A domain can be oriented in multiple ways. The original idea of data mesh alludes to a source-oriented domain aligning to business processes. However, a more practical approach would be to define a domain based on the organizational setup and practicality; for example, a domain can be a product group, it can be separate organizational entities, and it can also be a specific business process, such as marketing. Each domain has its own data lakehouse that is managed and maintained by that domain." (Pradeep Menon, "Data Lakehouse in Action", 2022)

"A data fabric is an architectural approach to provide data access across multiple technologies and platforms, and is based on a technology solution. One key contrast is that a data mesh is much more than just technology: it is a pattern that involves people and processes. Instead of taking ownership of an entire data platform, as in a data fabric, the data mesh allows data producers to focus on data production, allows data consumers to focus on consumption, and allows hybrid teams to consume other data products, blend other data to create even more interesting data products, and publish these data products - with some data governance considerations in place." (Hubert Dulay & Stephen Mooney, "Streaming Data Mesh", 2023)

"A Data Mesh views data primarily as organized around domain owners who create business-focused data products, which can be aggregated and consumed across distributed consumers, organizations, and Line of Business (LoBs) in a self-service and shopping-for-data fashion. Transforming data from disparate data sources to be consumed as data-as-a-product is an essential paradigm of any Data Mesh." (Eberhard Hechler et al, "Data Fabric and Data Mesh Approaches with AI", 2023)

"Data mesh architectures are inherently decentralized, and significant responsibility is delegated to the data product owners. A data mesh also benefits from a degree of centralization in the form of data product compatibility and common self-service tooling. Differing opinions, preferences, business requirements, legal constraints, technologies, and technical debt are just a few of the many factors that influence how we work together." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"In a data mesh, data is decentralized, while in a data fabric, centralization of data is allowed. And with data centralization like data lakes, you get the monolithic problems that come with it. Data mesh tries to apply a microservices approach to data by decomposing data domains into smaller and more agile groups." (Hubert Dulay & Stephen Mooney, "Streaming Data Mesh", 2023)

"The data mesh is an exciting new methodology for managing data at large. The concept foresees an architecture in which data is highly distributed and a future in which scalability is achieved by federating responsibilities. It puts an emphasis on the human factor and addressing the challenges of managing the increasing complexity of data architectures." (Piethein Strengholt, "Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric" 2nd Ed., 2023)

"The Data Fabric architecture can help enterprises address the challenges of data and AI governance effectively, including the orchestration and exchange of metadata across organizational implementations. First, Data Fabric pulls data from disparate data sources and orchestrates metadata exchange across organizational systems, thus providing a holistic view of data and AI at the enterprise level, which lays a solid technology foundation for a consistent and unified enterprise-level data and AI governance. Likewise, a Data Fabric architecture serves as a foundation for a Data Mesh solution, which is supporting organizational or departmental data and AI governance initiatives. Second, the advanced automation and AI technologies employed by a Data Fabric architecture can greatly simplify the implementation of data and AI governance at the enterprise or organizational level, enabling organizational federated Data Mesh initiatives, where orchestration and exchange of metadata across organizations need to be implemented as well." (Eberhard Hechler et al, "Data Fabric and Data Mesh Approaches with AI", 2023)

"The terms Data Fabric and Data Mesh are often viewed as different, conflicting, or at the best overlapping data architectures or frameworks, data management concepts, or approaches to discover, explore, govern, and consume data. However, these concepts are related to each other, where each concept emphasizes specific imperatives or objectives."(Eberhard Hechler et al, "Data Fabric and Data Mesh Approaches with AI", 2023)

"When building a data mesh, it is necessary to enable existing engineers in a domain to perform the tasks required. Domains have to capture data from their operational stores, transform (join or enrich, aggregate, balance) that data, and publish their data products to the data mesh. Self-service services are the “easy buttons” necessary to make data mesh easy to adopt with high usability. In summary, the self-services enable the domain engineers to take on many of the tasks the data engineer was responsible for across all lines of the business. A data mesh not only breaks up the monolithic data lake, but also breaks up the monolithic role of the data engineer into simple tasks the domain engineers can perform." (Hubert Dulay & Stephen Mooney, "Streaming Data Mesh", 2023)

"While a data mesh seeks to solve many of the same problems that a data fabric addresses - namely, the ability to address data in a single, composite data environment—the approach is different. While a data fabric enables users to create a single, virtual layer on top of distributed data, a data mesh further empowers distributed groups of data producers to manage and publish data as they see fit. Data fabrics allow for a low-to-no-code data virtualization experience by applying data integration within APIs that reside within the data fabric. The data mesh, however, allows for data engineers to write code for APIs with which to interface further. Without clearly defined boundaries, domains appear to be too interconnected, and ownership becomes either political or subject to interpretation. For instance, a large retailer most likely has multiple domains. [...]" (Hubert Dulay & Stephen Mooney, "Streaming Data Mesh", 2023)

"A data mesh is a decentralized data architecture with four specific characteristics. First, it requires independent teams within designated domains to own their analytical data. Second, in a data mesh, data is treated and served as a product to help the data consumer to discover, trust, and utilize it for whatever purpose they like. Third, it relies on automated infrastructure provisioning. And fourth, it uses governance to ensure that all the independent data products are secure and follow global rules."(James Serra, "Deciphering Data Architectures", 2024)

"A data mesh splits the boundaries of the exchange of data into multiple data products. This provides a unique opportunity to partially distribute the responsibility of data security. Each data product team can be made responsible for how their data should be accessed and what privacy policies should be applied." (Aniruddha Deswandikar,"Engineering Data Mesh in Azure Cloud", 2024)

"At its core, a data fabric is an architectural framework, designed to be employed within one or more domains inside a data mesh. The data mesh, however, is a holistic concept, encompassing technology, strategies, and methodologies." (James Serra, "Deciphering Data Architectures", 2024)

"Data Mesh emphasizes ensuring reliable, consistent, and interoperable data products. When data is treated as a product, quality is non-negotiable. High-quality data must meet the expectations and requirements of its users, both internally and externally. Additionally, data products must be designed with other products in mind, adhering to principles like loose coupling for easy interchangeability and high cohesion for strong functional relatedness. This feature enables the integration of different data products, ensuring seamless interoperability and greater usability. Data products should be reliable, complete, accurate, and accurate. They should also be integrated, compatible, and consistent rather than isolated, incompatible, or conflicting." (Pradeep Menon, "Data Mesh Principles, patterns, architecture, and strategies for data-driven decision making", 2024)

"It is very important to understand that data mesh is a concept, not a technology. It is all about an organizational and cultural shift within companies. The technology used to build a data mesh could follow the modern data warehouse, data fabric, or data lakehouse architecture - or domains could even follow different architectures." (James Serra, "Deciphering Data Architectures", 2024)

"To explain a data mesh in one sentence, a data mesh is a centrally managed network of decentralized data products. The data mesh breaks the central data lake into decentralized islands of data that are owned by the teams that generate the data. The data mesh architecture proposes that data be treated like a product, with each team producing its own data/output using its own choice of tools arranged in an architecture that works for them. This team completely owns the data/output they produce and exposes it for others to consume in a way they deem fit for their data." (Aniruddha Deswandikar,"Engineering Data Mesh in Azure Cloud", 2024)

"With all the hype, you would think building a data mesh is the answer to all of these 'problems' with data warehousing. The truth is that while data warehouse projects do fail, it is rarely because they can’t scale enough to handle big data or because the architecture or the technology isn’t capable. Failure is almost always because of problems with the people and/or the process, or that the organization chose the completely wrong technology." (James Serra, "Deciphering Data Architectures", 2024)

"Data mesh fundamentally reframes data governance and validation by distributing accountability to domain-oriented teams who act as custodians and producers of their respective data products. These teams possess intimate domain knowledge, which is essential for nuanced validation criteria that adapt to the semantics, context, and evolution of their datasets. By treating datasets as first-class products with clear ownership, interfaces, and service-level objectives, data mesh encourages autonomous validation workflows embedded directly within the domains where data originates and is consumed." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"Modern complex organizations increasingly confront the challenge of ensuring data quality at scale without centralizing validation activities into a single bottlenecked team. The data mesh paradigm and federated controls emerge as pivotal architectural styles and organizational patterns that enable decentralized, self-serve data quality validation while preserving coherence and reliability across diverse data products." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

24 December 2015

🪙Business Intelligence: Data Marts (Just the Quotes)

"There are four levels of data in the architected environment - the operational level, the atomic (or the data warehouse) level, the departmental (or the data mart) level, and the individual level. These different levels of data are the basis of a larger architecture called the corporate information factory (CIF). The operational level of data holds application-oriented primitive data only and primarily serves the high-performance transaction-processing community. The data-warehouse level of data holds integrated, historical primitive data that cannot be updated. In addition, some derived data is found there. The departmental or data mart level of data contains derived data almost exclusively. The departmental or data mart level of data is shaped by end-user requirements into a form specifically suited to the needs of the department. And the individual level of data is where much heuristic analysis is done." (William H Inmon, "Building the Data Warehouse" 4th Ed., 2005)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon 2010)

"Many organizations need to create data warehouses - massive data stores of timeseries data for decision support. Data are imported from various external and internal resources and are cleansed and organized in a manner consistent with the organization’s needs. After the data are populated in the data warehouse, data marts can be loaded for a specific area or department. Alternatively, data marts can be created first, as needed, and then integrated into an EDW." (Ramesh Sharda et al, "Business Intelligence: A Managerial Perspective on Analytics" 3rd Ed., 2014)

"Whereas a data warehouse combines databases across an entire enterprise, a data mart is usually smaller and focuses on a particular subject or department. A data mart is a subset of a data warehouse, typically consisting of a single subject area (e.g., marketing, operations). A data mart can be either dependent or independent. A dependent data mart is a subset that is created directly from the data warehouse. It has the advantages of using a consistent data model and providing quality data. [...] An independent data mart is a small warehouse designed for a strategic business unit (SBU) or a department, but its source is not an EDW." (Ramesh Sharda et al, "Business Intelligence: A Managerial Perspective on Analytics" 3rd Ed., 2014)

"Data mart: A subset of a data warehouse that’s usually oriented to a business group or process rather than enterprise-wide views. They have value as part of the overall enterprise data architecture, but can cause problems when they sprout uncontrolled as data silos with their own data definitions, creating data shadow systems." (Rick Sherman, "Business Intelligence Guidebook: From Data Integration to Analytics, 2015)

"Data marts promised to be quicker and cheaper to build, and provided many more benefits - including the benefit of actually being able to finish building them! The data mart was primarily a backlash to the big, cumbersome CDW projects, with the key difference being that its scope was limited to a single business group rather than the entire enterprise. Of course, that shortcut did speed things up, but at the expense of obtaining agreement on consistent data definitions, thereby guaranteeing data silos." (Rick Sherman, "Business Intelligence Guidebook: From Data Integration to Analytics, 2015)

"There are, however, many problems with independent data marts. Independent data marts: (1) Do not have data that can be reconciled with other data marts (2) Require their own independent integration of raw data (3) Do not provide a foundation that can be built on whenever there are future analytical needs." (William H Inmon & Daniel Linstedt, "Data Architecture: A Primer for the Data Scientist: Big Data, Data Warehouse and Data Vault", 2015)

"The data warehouse approach is also referred to as data marts with the usual distinction that a data mart serves a single department in an organization, while a data warehouse serves the larger organization integrating across multiple departments. Regardless of their scope, from the architectural modeling perspective they both have similar characteristics." (Zhamak Dehghani,"Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data marts are subject-oriented databases typically aligned with a particular business unit like sales, finance, or marketing. These are some-times called 'functional data marts' since they support specific business functions. Data marts accelerate business processes by allowing access to relevant information in a more timely nature since they are not aggregating the volume and variety (many data sources) that an EDW does. However, they are more transformed or normalized than an ODS." (Scott Burk et al, It’s All Analytics - Part II: Designing an Integrated AI, Analytics, and Data Science Architecture for Your Organization, 2022)

"Some sources try to distinguish the differences of data marts and EDWs by size. Size is a consequence and not a determinant. While EDWs are normally much larger, they are larger due to the fact they are pulling data from many sources and across business functions." (Scott Burk et al, It’s All Analytics - Part II: Designing an Integrated AI, Analytics, and Data Science Architecture for Your Organization, 2022)

"Traditional data stores used for analytics, such as data marts and data ware-houses, followed an ETL process, extract first by making a copy from the source, then transformations are made upon this copy, and then the data is loaded into the target system. ETL tools require processing engines for running transformations prior to loading data into a destination. Running these engines performing transformations before the load phase results in a more complex data replication process." (Scott Burk et al, It’s All Analytics - Part II: Designing an Integrated AI, Analytics, and Data Science Architecture for Your Organization, 2022)

"Data marts are used in most data warehouse approaches and focus on the data of a specific area, department, or domain of the business. A data warehouse focuses on building an SSOT for all data. A data warehouse can be made up of data marts (bottom-up approach), or data marts can be created from the data warehouse (top-down approach). In either case, data marts are smaller and less complicated than the data warehouse. As they only contain a subset of the data, data marts are often easier and quicker to establish. They are often managed by the departments the data mart focuses on." (Olivier Mertens & Breght Van Baelen, "Azure Data and AI Architect Handbook", 2023)

22 December 2015

🪙Business Intelligence: Lakehouses (Just the Quotes)

"A data lakehouse is an amalgamation of the best components from both data lakes and data warehouses. A data lakehouse implements data structure and data management features from data warehouses into a cost-effective storage like a data lake. It tries to combine the best from both worlds - data lake - based Big Data analytics and a data warehouse." (Bhadresh Shiyal, "Beginning Azure Synapse Analytics: Transition from Data Warehouse to Data Lakehouse", 2021)

"A defining characteristic of the data lakehouse architecture is allowing direct access to data as files while retaining the valuable properties of a data warehouse. Just do both!" (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Once you combine the data lake along with analytical infrastructure, the entire infrastructure can be called a data lakehouse. [...] The data lake without the analytical infrastructure simply becomes a data swamp. And a data swamp does no one any good." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"The data lakehouse architecture presents an opportunity comparable to the one seen during the early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse will unlock incredible value for organizations. [...] The lakehouse architecture equally makes it natural to manage and apply models where the data lives." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"With the data lakehouse, it is possible to achieve a level of analytics and machine learning that is not feasible or possible any other way. But like all architectural structures, the data lakehouse requires an understanding of architecture and an ability to plan and create a blueprint." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"A data lakehouse stores a lot of data. It stores data in the data lake layer and the serving layer in structured and unstructured formats. The data needs to be processed with different types of compute engines. It can be a batch-based compute or a stream-based compute. A tightly coupled compute and storage layer strips off the flexibility required in a data lakehouse. Decoupling compute and storage also has a cost implication - storage is cheap and persistent but compute is expensive and ephemeral. It gives you the flexibility to spin up compute services on-demand and scale them as required, and also gives better cost control and cost predictability." (Pradeep Menon, "Data Lakehouse in Action", 2022)

"Lakehouse is a new architecture and data storage paradigm that combines the characteristics of both data warehouses and data lakes to create a unified basis for all types of use cases to be built on top of it. There is no need to move data around. Data is curated and remains in an open format and serves as the single source of truth (SSOT) for all the consumption layers. A modern data platform has needs that span traditional data warehouses, data lakes, machine learning systems, and streaming systems and there is some overlap among these systems. A Lakehouse offers features that span all four systems [...]" (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"Simply put, 'lakehouse' refers to an open data architecture that combines the best of data lakes and data warehouses on a single platform. At this point, it would be fair to say that a lakehouse is closer to a data lake than a data warehouse. In fact, it is an extension of your data lake to support all use cases, from BI to AI. All data science and ML personas who were shunted into downstream applications because the tools of their trade were so vastly different and can now share the same stage and have access to the same data as other data personas. This eliminates the need to stitch fragile systems together and leads to better data quality and end-to-end latencies since there is no need to copy data across disparate architectures." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"The lakehouse provides a key advantage over the modern data warehouse by eliminating the need to have two places to store the same data." (Rukmani Gopalan, "The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture", 2022)

"Traditional data lakes provide the necessary scalability, but not the real-time concurrency and latency needed for BI use cases. Delta comes to the rescue once again by providing performance at scale with a host of optimization techniques, such as caching, data compaction, and indexing. Previously, a subset of the curated data would be pushed to a warehouse to satisfy the latency and concurrency requirements of known queries. What this meant was that if a consumer needed a different access pattern or a slightly older dataset that was not available, they would have to request that their IT or data team get involved. This took data democratization a step backward. Ideally, we should allow people to access any data that they have privileges to. Delta Lake goes a step forward and allows BI tools to access data directly from the lake instead of accessing a sliver of the data in their expensive warehouses." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"Like data lakes, the lakehouse architecture leverages low-cost cloud storage systems with the inherent flexibility and horizontal scalability of those systems. The goal of a lakehouse is to use existing high-performance data formats, such as Parquet, while also enabling ACID transactions (and other features). To add these capabilities, lakehouses use an open-table format, which adds features like ACID transactions, record-level operations, indexing, and key metadata to those existing data formats. This enables data assets stored on low-cost storage systems to have the same reliability that used to be exclusive to the domain of an RDBMS. Delta Lake is an example of an open-table format that supports these types of capabilities." (Bennie Haelen & Dan Davis, "Delta Lake: Up and Running - Modern Data Lakehouse Architectures with Delta Lake", 2023)

"Delta Lake is one solution for building data lakehouses, an open data architecture combining the best of data warehouses and data lakes." (Bennie Haelen & Dan Davis, "Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake", 2024)

"It is very important to understand that data mesh is a concept, not a technology. It is all about an organizational and cultural shift within companies. The technology used to build a data mesh could follow the modern data warehouse, data fabric, or data lakehouse architecture - or domains could even follow different architectures." (James Serra, "Deciphering Data Architectures", 2024)

"The term data lakehouse is a portmanteau (blend) of data lake and data warehouse. [...] The concept of a lakehouse is to get rid of the relational data warehouse and use just one repository, a data lake, in your data architecture." (James Serra, "Deciphering Data Architectures", 2024)

🪙Business Intelligence: Data Lake (Just the Quotes)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. [...] The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon, "Pentaho, Hadoop, and Data Lakes", 2010) [sorce] [first known usage]

"A data lake represents an environment that collects and stores large volumes of structured and unstructured datasets, typically in their original, unaltered forms. More than a data depository, the data lake architecture enables the various users and data science teams to conduct data exploration and related analytical activities." (EMC Education Services, "Data Science & Big Data Analytics", 2015)

"A data lake strategy supports the introduction of a separate analytics environment that off-loads the analytics being done today on your overly expensive data warehouse. This separate analytics environment provides the data science team an on-demand, fail-fast environment for quickly ingesting and analyzing a wide variety of data sources in an attempt to address immediate business opportunities independent of the data warehouse's production schedule and service level agreement (SLA) rules." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out. [...] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze." (Beulah S Purra & Pradeep Pasupuleti, "Data Lake Development with Big Data", 2015)

"Having multiple data lakes replicates the same problems that were created with multiple data warehouses - disparate data siloes and data fiefdoms that don't facilitate sharing of the corporate data assets across the organization. Organizations need to have a single data lake from which they can source the data for their BI/data warehousing and analytic needs. The data lake may never become the 'single version of the truth' for the organization, but then again, neither will the data warehouse. Instead, the data lake becomes the 'single or central repository for all the organization's data' from which all the organization's reporting and analytic needs are sourced." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"[...] the real power of the data lake is to enable advanced analytics or data science on the detailed and complete history of data in an attempt to uncover new variables and metrics that are better predictors of business performance." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"The data lake is not an incremental enhancement to the data warehouse, and it is NOT data warehouse 2.0. The data lake enables entirely new capabilities that allow your organization to address data and analytic challenges that the data warehouse could not address." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"Unfortunately, some organizations are replicating the bad data warehouse practice by creating special-purpose data lakes - data lakes to address a specific business need. Resist that urge! Instead, source the data that is needed for that specific business need into an 'analytic sandbox' where the data scientists and the business users can collaborate to find those data variables and analytic models that are better predictors of the business performance. Within the 'analytic sandbox', the organization can bring together (ingest and integrate) the data that it wants to test, build the analytic models, test the model's goodness of fit, acquire new data, refine the analytic models, and retest the goodness of fit." (Billl Schmarzo, "Driving Business Strategies with Data Science: Big Data MBA" 1st Ed., 2015)

"A data lake is a storage repository that holds a very large amount of data, often from diverse sources, in native format until needed. In some respects, a data lake can be compared to a staging area of a data warehouse, but there are key differences. Just like a staging area, a data lake is a conglomeration point for raw data from diverse sources. However, a staging area only stores new data needed for addition to the data warehouse and is a transient data store. In contrast, a data lake typically stores all possible data that might be needed for an undefined amount of analysis and reporting, allowing analysts to explore new data relationships. In addition, a data lake is usually built on commodity hardware and software such as Hadoop, whereas traditional staging areas typically reside in structured databases that require specialized servers." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A data warehouse follows a pre-built static structure to model source data. Any changes at the structural and configuration level must go through a stringent business review process and impact analysis. Data lakes are very agile. Consumption or analytical layer can be modified to fit in the model requirements. Consumers of a data lake are not constant; therefore, schema and modeling lies at the liberty of analysts and scientists." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data in the data lake should never get disposed. Data driven strategy must define steps to version the data and handle deletes and updates from the source systems." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data governance policies must not enforce constraints on data - Data governance intends to control the level of democracy within the data lake. Its sole purpose of existence is to maintain the quality level through audits, compliance, and timely checks. Data flow, either by its size or quality, must not be constrained through governance norms. [...] Effective data governance elevates confidence in data lake quality and stability, which is a critical factor to data lake success story. Data compliance, data sharing, risk and privacy evaluation, access management, and data security are all factors that impact regulation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data swamp, on the other hand, presents the devil side of a lake. A data lake in a state of anarchy is nothing but turns into a data swamp. It lacks stable data governance practices, lacks metadata management, and plays weak on ingestion framework. Uncontrolled and untracked access to source data may produce duplicate copies of data and impose pressure on storage systems." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Data warehousing, as we are aware, is the traditional approach of consolidating data from multiple source systems and combining into one store that would serve as the source for analytical and business intelligence reporting. The concept of data warehousing resolved the problems of data heterogeneity and low-level integration. In terms of objectives, a data lake is no different from a data warehouse. Both are primary advocates of terms like 'single source of truth' and 'central data repository'." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"At first, we threw all of this data into a pit called the 'data lake'. But we soon discovered that merely throwing data into a pit was a pointless exercise. To be useful - to be analyzed - data needed to (1) be related to each other and (2) have its analytical infrastructure carefully arranged and made available to the end user. Unless we meet these two conditions, the data lake turns into a swamp, and swamps start to smell after a while. [...] In a data swamp, data just sits there are no one uses it. In the data swamp, data just rots over time." (Bill Inmon et al, "Building the Data Lakehouse", 2021)

"Data lakes have been in existence for a while now, so their need is no longer questioned. What is more relevant is the specifics of the solution's implementation. Consolidating all the siloed data by itself does not constitute a data lake. However, it is a starting point. Layering in governance makes the data consumable and is a step toward a curated data lake. Big data systems provide scale out of the box but force us to make some accommodations for data quality. Age-old aspects of transactional integrity were compromised on a distributed system because it was very hard to maintain ACID compliance. Due to this, BASE properties were favored. All of this was moving the needle in the wrong direction and from pristine data lakes we were moving toward data swamps, where the data could not be trusted and hence insights that were generated on the data could not be trusted either." (Anindita Mahapatra, "Simplifying Data Engineering and Analytics with Delta", 2022)

"When it comes to data lakes, some things usually stay constant: the storage and processing patterns. Change could come in any of the following ways: Adding new components and processing or consumption patterns to respond to new requirements. […] Optimizing existing architecture for better cost or performance" (Rukmani Gopalan, "The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture", 2022)

"Unstructured and semi-structured data are often critical for AI and machine learning use cases, whereas structured and semi-structured data are critical for BI use cases. Because it natively supports all three types of data classifications, you can create a unified system that supports these diverse workloads in a data lake. These workloads can complement each other in a well-designed processing architecture, which you will learn about further on in this chapter. A data lake helps solve many of the challenges related to data volumes, types, and cost, and while Delta Lake runs on top of a data lake, it is optimized to run best on a cloud data lake." (Bennie Haelen & Dan Davis, "Delta Lake: Up and Running - Modern Data Lakehouse Architectures with Delta Lake", 2023)

"A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is without having first to structure the data and run different types of analytics - from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions." (Pradeep Menon, "Data Mesh Principles, patterns, architecture, and strategies for data-driven decision making", 2024)

"Delta Lake is a transactional storage software layer that runs on top of an existing data lake and adds RDW-like features that improve the lake’s reliability, security, and performance. Delta Lake itself is not storage. In most cases, it’s easy to turn a data lake into a Delta Lake; all you need to do is specify, when you are storing data to your data lake, that you want to save it in Delta Lake format (as opposed to other formats, like CSV or JSON)." (James Serra, "Deciphering Data Architectures", 2024)

"The allure of Data Lakes was their ability to store vast amounts of raw data. However, this advantage can become counterproductive without stringent governance and management protocols. In their zeal to harness the power of Big Data, some organizations indiscriminately dump data into their lakes. Without proper classification, curation, and quality checks, these lakes can become swamps - murky repositories filled with valuable data, redundant information, and outdated datasets. Navigating these data swamps becomes a significant challenge, leading to prolonged data retrieval times, increased chances of using obsolete or incorrect data, and a decline in the agility and efficiency of data-driven decision-making processes rather than facilitating quick and insightful analytics." (Pradeep Menon, "Data Mesh Principles, patterns, architecture, and strategies for data-driven decision making", 2024)

"A lake based on the medallion architecture combines the best of lakes and data warehouses. By breaking down silos and eliminating data duplication, it becomes a standard for building data platform architecture." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"A data lake is a distributed repository of raw and unprocessed data stored in its original format, with-out a predefined schema or structure. A data lake is designed to support a wide range of data types, sources, and use cases, such as exploration, discovery, and data experimentation. A data lake follows a 'schema on read' approach. Data is structured and processed only when it is accessed or consumed by a user or application (Extract, Load, Transform (ELT)). A data lake also enables data democratiza-tion, meaning data is accessible and available to anyone who needs it, without barriers or restrictions." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"A lakehouse is a data storage space that hosts and manages all types of data in one place (structured, semi-struc-tured, and unstructured), allowing different tools to normalize and examine this data according to organizational requirements and/or individual choices. A lakehouse thus combines the best aspects of a data lake and a data warehouse by eliminating data duplication and friction related to ingestion, transformation, and sharing of data within the organization, all in the open format, Delta Lake." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"Considered by many companies as the next generation of data architecture, the data mesh represents the natural evolution of traditional data lakes and data warehouses. While the latter are often limited by their centralized and monolithic structure, the data mesh aims to enable companies to deploy a more flexible, responsive, and massively scalable data strategy." (Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"Data Lakes embrace a schema-on-read approach, storing vast volumes of raw or lightly processed data in native formats with minimal upfront constraints. This design significantly enhances ingestion velocity and accommodates diverse, unstructured, or semi-structured datasets. However, enforcing data quality at scale becomes more complex, as traditional static constraints are absent." (William Smith, "Great Expectations for Modern Data Quality: The Complete Guide for Developers and Engineers", 2025)

"The problem with data lakes is that they have several drawbacks preventing them from being the perfect or ideal solution. The first drawback is an organizational problem: (•) How to organize data in the lake (•) How to classify, catalog, secure, document, and find it (•) How to avoid the lake turning into a swamp where data is mixed, duplicated, obsolete, or inaccessible (•) How to manage quality, governance, and traceability in the lake."(Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

"This transition to OneDrive highlights the importance of governance adapted to new methods of collaborative work and data sharing. The idea of OneLake is, therefore, based on this same concept: rather than subscribing to a data lake technology that must be maintained, why not simply subscribe to a storage service that offers a layer of abstraction over the complexities of these data storage infrastructures? As a result, the data lake becomes a controlled or governed environment, but still accessible to users who can view it as a simple and intuitive way to securely share data with their colleagues and IT teams."(Christopher Maneu et al, "The Definitive Guide to Microsoft Fabric From discovery to building a unified, secure, and scalable data platform", 2025)

17 December 2015

🪙Business Intelligence: Decision-Making (Just the Quotes)

"Charts and graphs are a method of organizing information for a unique purpose. The purpose may be to inform, to persuade, to obtain a clear understanding of certain facts, or to focus information and attention on a particular problem. The information contained in charts and graphs must, obviously, be relevant to the purpose. For decision-making purposes, information must be focused clearly on the issue or issues requiring attention. The need is not simply for 'information', but for structured information, clearly presented and narrowed to fit a distinctive decision-making context. An advantage of having a 'formula' or 'model' appropriate to a given situation is that the formula indicates what kind of information is needed to obtain a solution or answer to a specific problem." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." (Donald T Campbell, "Assessing the impact of planned social change", 1976)

"The greater the uncertainty, the greater the amount of decision making and information processing. It is hypothesized that organizations have limited capacities to process information and adopt different organizing modes to deal with task uncertainty. Therefore, variations in organizing modes are actually variations in the capacity of organizations to process information and make decisions about events which cannot be anticipated in advance." (John K Galbraith, "Organization Design", 1977)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Delay time, the time between causes and their impacts, can highly influence systems. Yet the concept of delayed effect is often missed in our impatient society, and when it is recognized, it’s almost always underestimated. Such oversight and devaluation can lead to poor decision making as well as poor problem solving, for decisions often have consequences that don’t show up until years later. Fortunately, mind mapping, fishbone diagrams, and creativity/brainstorming tools can be quite useful here." (Stephen G Haines, "The Manager's Pocket Guide to Strategic and Business Planning", 1998)

"Blissful data consist of information that is accurate, meaningful, useful, and easily accessible to many people in an organization. These data are used by the organization’s employees to analyze information and support their decision-making processes to strategic action. It is easy to see that organizations that have reached their goal of maximum productivity with blissful data can triumph over their competition. Thus, blissful data provide a competitive advantage.". (Margaret Y Chu, "Blissful Data", 2004)

"Dashboards and visualization are cognitive tools that improve your 'span of control' over a lot of business data. These tools help people visually identify trends, patterns and anomalies, reason about what they see and help guide them toward effective decisions. As such, these tools need to leverage people's visual capabilities. With the prevalence of scorecards, dashboards and other visualization tools now widely available for business users to review their data, the issue of visual information design is more important than ever." (Richard Brath & Michael Peters, "Dashboard Design: Why Design is Important," DM Direct, 2004)

"Decision-makers process priors incorrectly in several ways. First, people tend to assess probability from the representativeness of an outcome rather than from its frequency. When supporting information is added to make an outcome more coherent and congruent with a representative mental image, people tend to judge the outcome more probable, even though the added qualifications and constraints by definition make it less probable. […] Second, humans often judge relative probability of outcomes by assessing similarity rather than frequency. […] Third, when given worthless evidence in a Bayesian framework, people tend to ignore prior probabilities and use the worthless evidence." (Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"Human decision-making in the face of uncertainty is not only prone to error, it is also biased against Bayesian principles. We are not randomly suboptimal in our decisions. We are systematically suboptimal. (Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"If you simply present data, it’s easy for your audience to say, Oh, that’s interesting, and move on to the next thing. But if you ask for action, your audience has to make a decision whether to comply or not. This elicits a more productive reaction from your audience, which can lead to a more productive conversation - one that might never have been started if you hadn’t recommended the action in the first place." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"All human storytellers bring their subjectivity to their narratives. All have bias, and possibly error. Acknowledging and defusing that bias is a vital part of successfully using data stories. By debating a data story collaboratively and subjecting it to critical thinking, organizations can get much higher levels of engagement with data and analytics and impact their decision making much more than with reports and dashboards alone." (James Richardson, 2017)

"An actionable task means that it is possible to act on its result. That action might be to present a useful result to a decision maker or to proceed to a next step in a different result. An answer is actionable when it no longer needs further work to make sense of it." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Business intelligence tools can only present the facts. Removing biases and other errors in decision making are dynamics of company culture that affect how well business intelligence is used." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"The problem is when biases and inaccurate data also get filtered into the gut. In this case, the gut-feel decision making should be supported with objective data, or errors in decision making may occur." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Most discussions of decision making assume that only senior executives make decisions or that only senior executives' decisions matter. This is a dangerous mistake. Decisions are made at every level of the organization, beginning with individual professional contributors and frontline supervisors. These apparently low-level decisions are extremely important in a knowledge-based organization." (Zach Gemignani et al, "Data Fluency", 2014)

"Apart from the secondary benefits of digital data, which are many, such as faster and cheaper information collection and distribution, the primary benefit is better decision making based on evidence. Despite our intellectual powers, when we allow our minds to become disconnected from reliable information about the world, we tend to screw up and make bad decisions." (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"Probabilities allow us to quantify future events and are an important aid to rational decision making. Without them, we can become seduced by anecdotes and stories." (Daniel J Levitin, "Weaponized Lies", 2017)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"People aren’t as good at making decisions as they think. We like to think of ourselves as rational actors, but our informational-processing limitations, emotions, and biases get in our way. The world is complex and humans have developed ways to help simplify it. So-called cognitive biases are ways our brains help us take shortcuts to deal with four primary problems: information overload, lack of meaning, the need to act fast, and knowing what needs to be remembered for later." (Shonna D Watters et al, "The Practical Guide for HR Analytics: Using data to inform, transform, and empower HR decisions", 2019)

"The second rule of communication is to know what you want to achieve. Hopefully the aim is to encourage open debate, and informed decision-making. But there seems no harm in repeating yet again that numbers do not speak for themselves; the context, language and graphic design all contribute to the way the communication is received. We have to acknowledge we are telling a story, and it is inevitable that people will make comparisons and judgements, no matter how much we only want to inform and not persuade. All we can do is try to pre-empt inappropriate gut reactions by design or warning." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Information relevance refers to the extent to which information is appropriate for the decision-making situation facing the manager. Extraneous or extra information distracts the decision-maker from the assigned task and information overload frustrates the decision-maker and impairs the decision-making process. Relevant information must pertain to the problems, decisions and responsibilities of the recipient." (C S V Murthy, "Data and Businesss Analytics", 2020)

"Information that is complete means information that covers key issues and is sufficient to support the decision-making situation at hand without critical omissions. The more complete a body of information, is obviously, the more expensive it is to develop and maintain. Care must also be taken not to provide extra information than needed, due to its expense, and not to provide so much information that the recipient will suffer from information overload (information indigestion)." (C S V Murthy, "Data and Businesss Analytics", 2020)

"The concept of programmed decisions is important because the ultimate (and unachievable) goal of information systems is to provide purely programmed decisions. Because this is not possible, we seek to provide the optimum type of information to the human decision-maker, who then makes non-programmable decisions. Decisions lend themselves to programming techniques if they are repetitive and routine, and if a procedurs can be worked out for handling them so that each is neither an ad hoc decision nor one to be treated as a new situation each time it arises." (C S V Murthy, "Data and Businesss Analytics", 2020)

"Our machines are helpers, not decision makers. Their insights are not the final word in the discussion, merely the work of our most nimble observers who can ramp up time spent on analysis by factors that our counterparts even a generation ago would have a hard time believing." (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)

"The goal of using data visualization to make better and faster decisions may lead people to think that any data visualization that is not immediately understood is a failure. Yes, a good visualization should allow you to see things that you might have missed, and to glean insights faster, but you still have to think." (Steve Wexler, "The Big Picture: How to use data visualization to make better decisions - faster", 2021)

"Data literacy is something that affects everyone and every organization. The more people who can debate, analyze, work with, and use data in their daily roles, the better data-informed decision-making will be." (Angelika Klidas & Kevin Hanegan, "Data Literacy in Practice", 2022)

"Data may indeed be the new oil. But just like crude oil, data needs refining. It must be transformed into information. This is why we clean, combine, model, and visualize data. The output of all this work - whether you do it on your own, get some help, or use a (semi-)automatic process - includes reports and dashboards that provide insights into various aspects of the organization’s dealings, which decision-makers can then consume to make critical business decisions." (Jeroen ter Heerdt et al, "Microsoft Power BI Visual Calculations: Simplifying DAX", 2026)

🪙Business Intelligence: Business Intelligence (Just the Quotes)

"A key sign of successful business intelligence is the degree to which it impacts business performance." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Successful business intelligence is influenced by both technical aspects and organizational aspects. In general, companies rate organizational aspects (such as executive level sponsorship) as having a higher impact on success than technical aspects. And yet, even if you do everything right from an organizational perspective, if you don’t have high quality, relevant data, your BI initiative will fail." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"The data architecture is the most important technical aspect of your business intelligence initiative. Fail to build an information architecture that is flexible, with consistent, timely, quality data, and your BI initiative will fail. Business users will not trust the information, no matter how powerful and pretty the BI tools. However, sometimes it takes displaying that messy data to get business users to understand the importance of data quality and to take ownership of a problem that extends beyond business intelligence, to the source systems and to the organizational structures that govern a company’s data." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"There is one crucial aspect of extending the reach of business intelligence that has nothing to do with technology and that is Relevance. Understanding what information someone needs to do a job or to complete a task is what makes business intelligence relevant to that person. Much of business intelligence thus far has been relevant to power users and senior managers but not to front/line workers, customers, and suppliers." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Data migration is not just about moving data from one place to another; it should be focused on: realizing all the benefits promised by the new system when you entertained the concept of new software in the first place; creating the improved enterprise performance that was the driver for the project; importing the best, the most appropriate and the cleanest data you can so that you enhance business intelligence; maintaining all your regulatory, legal and governance compliance criteria; staying securely in control of the project." (John Morris, "Practical Data Migration", 2009)

"Dashboards are collections of several linked visualizations all in one place. The idea is very popular as part of business intelligence: having current data on activity summarized and presented all in one place. One danger of cramming a lot of disparate information into one place is that you will quickly hit information overload. Interactivity and small multiples are definitely worth considering as ways of simplifying the information a reader has to digest in a dashboard. As with so many other visualizations, layering the detail for different readers is valuable." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"The way we explore data today, we often aren't constrained by rigid hypothesis testing or statistical rigor that can slow down the process to a crawl. But we need to be careful with this rapid pace of exploration, too. Modern business intelligence and analytics tools allow us to do so much with data so quickly that it can be easy to fall into a pitfall by creating a chart that misleads us in the early stages of the process." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"Self-service BI initiatives help organizations become more data-driven and democratize access to data. But data can’t be used if it can’t be found. Search and discovery of trustworthy data is a core value of enterprise data catalogs, and the value extends well beyond business users." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

See also: the [definitions] and the [index] of similar posts.