SQL Troubles: myths

Showing posts with label myths. Show all posts

22 August 2023

🔖Book Review: Laurent Bossavit's The Leprechauns of Software Engineering (2015)


	Software Engineering should be the "establishment and use of sound engineering principles to obtain economically software that is reliable and works on real machines efficiently" [2]. Working for more than 20 years in the field I feel sometimes that its foundation is a strange mix of sound and questionable ideas that take the form of methodologies, principles, standards, myths, folklore, statistics and other similar concepts that form its backbone. I tend to look with critical eyes at the important numbers advanced in research and pseudo-scientific papers especially when they’re related to my job, this because I know that statistics are seldom what they appear to be - there are accidental and sometimes even intended errors made to support the facts. Unfortunately, the missing row data and often the information about the methodologies used in collecting and processing the respective data make numbers and/or graphics' understanding more challenging, not to mention the considerable amount of effort and time spent to uncover the evidence trail.
Fortunately, there are other professionals who went further down the path of bibliographical references and shared their findings in blogs, papers, books and other media content. It’s also the case of Laurent Bossavit, who in his book, "The Leprechauns of Software Engineering" (2015), looks behind some of the numbers that over time become part of the leprechaunish folklore of IT professionals, puts them into the historical context and provides in appendix the evidence trails for the reader to validate his findings. Over several chapters the author focuses mainly on the cost of defects, Boehm’s cone of uncertainty, the differences in productivity amount individual programmers (aka 10x claim), respectively the relation between poor requirements and defects. His most important finding is that the references used in most of the researched sources advancing the above numbers were secondary, while the actual sources provide no direct information of empirical data or the methodology for its collection. The way the numbers are advanced and used makes one question the validity of the measurements performed, respectively the character of the mistakes the authors made. Many of the cited papers hardly match the academic requirements of other scientific fields, being a mix of false claims, improperly conducted research and citations. Secondly, he argues that the small sample sizes used as basis for the experiments, the small population formed usually of students, respectively the way numbers were mixed without any reliable scientific character makes him (and the reader as well) question even more how the experiments were performed in the respective papers. With this, it is more likely that a bigger number of research based on these sources should raise further concerns. The reader can thus ask himself/herself how deep the domino effect goes inside of the Software Engineering field. In author’s opinion Software Engineering as social process "needs to be studied with tools that borrow as much from the social and cognitive sciences as they do from the mathematical theories of computation". How much is possible to extend the theories and models of the respective fields is an open topic. The bottom line, the field of Software Engineering needs better and scientific empirical experiments that are based on commonly agreed definitions, data collection and processing techniques, respectively higher standards for research publications. Without this, we’ll continue to compare apples with peaches and mix them in calculations so we can get some stories that support our leprechaunish theories. Overall, the book is a good read for software engineers as well as for other IT professionals. Even if it barely scratched the surface of software myths and folklore, there’s enough material for the readers who want to dive deeper. Previous Post <<\|\|>> Next Post References: [1] Laurent Bossavit (2015) "The Leprechauns of Software Engineering" [2] Friedrich Bauer (1972) "Software Engineering", Information Processing

03 November 2018

🔭Data Science: Myths (Just the Quotes)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Random errors can always be determined by repeating measurements under identical conditions. […] this statement is true only for time-related random errors." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Systematic errors can be determined inductively. It should be quite obvious that it is not possible to determine the scale error from the pattern of data values." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

[myth] " It has been said that process behavior charts work because of the central limit theorem."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the data must be normally distributed before they can be placed on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "It has been said that the observations must be independent - data with autocorrelation are inappropriate for process behavior charts." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] " It has been said that the process must be operating in control before you can place the data on a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

[myth] "The standard deviation statistic is more efficient than the range and therefore we should use the standard deviation statistic when computing limits for a process behavior chart."(Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The search for better numbers, like the quest for new technologies to improve our lives, is certainly worthwhile. But the belief that a few simple numbers, a few basic averages, can capture the multifaceted nature of national and global economic systems is a myth. Rather than seeking new simple numbers to replace our old simple numbers, we need to tap into both the power of our information age and our ability to construct our own maps of the world to answer the questions we need answering." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"An oft-repeated rule of thumb in any sort of statistical model fitting is 'you can't fit a model with more parameters than data points'. This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints [...]. A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. [...] this misconception, which I like to call the 'model complexity myth' [...] is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive." (Jake Vanderplas, "The Model Complexity Myth", 2015) [source]

"Hollywood loves the myth of a lone scientist working late nights in a dark laboratory on a mysterious island, but the truth is far less melodramatic. Real science is almost always a team sport. Groups of people, collaborating with other groups of people, are the norm in science - and data science is no exception to the rule. When large groups of people work together for extended periods of time, a culture begins to emerge. " (Mike Barlow, "Learning to Love Data Science", 2015)

"One of the biggest truths about the real–time analytics is that nothing is actually real–time; it's a myth. In reality, it's close to real–time. Depending upon the performance and ability of a solution and the reduction of operational latencies, the analytics could be close to real–time, but, while day-by-day we are bridging the gap between real–time and near–real–time, it's practically impossible to eliminate the gap due to computational, operational, and network latencies." (Shilpi Saxena & Saurabh Gupta, "Practical Real-time Data Processing and Analytics", 2017)

"The field of big-data analytics is still littered with a few myths and evidence-free lore. The reasons for these myths are simple: the emerging nature of technologies, the lack of common definitions, and the non-availability of validated best practices. Whatever the reasons, these myths must be debunked, as allowing them to persist usually has a negative impact on success factors and Return on Investment (RoI). On a positive note, debunking the myths allows us to set the right expectations, allocate appropriate resources, redefine business processes, and achieve individual/organizational buy-in." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"The first myth is that prediction is always based on time-series extrapolation into the future (also known as forecasting). This is not the case: predictive analytics can be applied to generate any type of unknown data, including past and present. In addition, prediction can be applied to non-temporal (time-based) use cases such as disease progression modeling, human relationship modeling, and sentiment analysis for medication adherence, etc. The second myth is that predictive analytics is a guarantor of what will happen in the future. This also is not the case: predictive analytics, due to the nature of the insights they create, are probabilistic and not deterministic. As a result, predictive analytics will not be able to ensure certainty of outcomes." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"In the world of data and analytics, people get enamored by the nice, shiny object. We are pulled around by the wind of the latest technology, but in so doing we are pulled away from the sound and intelligent path that can lead us to data and analytical success. The data and analytical world is full of examples of overhyped technology or processes, thinking this thing will solve all of the data and analytical needs for an individual or organization. Such topics include big data or data science. These two were pushed into our minds and down our throats so incessantly over the past decade that they are somewhat of a myth, or people finally saw the light. In reality, both have a place and do matter, but they are not the only solution to your data and analytical needs. Unfortunately, though, organizations bit into them, thinking they would solve everything, and were left at the alter, if you will, when it came time for the marriage of data and analytical success with tools." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"[...] the focus on Big Data AI seems to be an excuse to put forth a number of vague and hand-waving theories, where the actual details and the ultimate success of neuroscience is handed over to quasi- mythological claims about the powers of large datasets and inductive computation. Where humans fail to illuminate a complicated domain with testable theory, machine learning and big data supposedly can step in and render traditional concerns about finding robust theories. This seems to be the logic of Data Brain efforts today. (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"The myth of replacing domain experts comes from people putting too much faith in the power of ML to find patterns in the data. [...] ML looks for patterns that are generally pretty crude - the power comes from the sheer scale at which they can operate. If the important patterns in the data are not sufficiently crude then ML will not be able to ferret them out. The most powerful classes of models, like deep learning, can sometimes learn good-enough proxies for the real patterns, but that requires more training data than is usually available and yields complicated models that are hard to understand and impossible to debug. It’s much easier to just ask somebody who knows the domain!" (Field Cady, "Data Science: The Executive Summary: A Technical Book for Non-Technical Professionals", 2021)

28 February 2017

🧊Data Warehousing: Data Load Optimization (Part I: A Success Story)

Data Warehousing Series

Introduction

This topic has been waiting in the queue for almost two years already - since I finished optimizing an already existing relational data warehouse within a SQL Server 2012 Enterprise Edition environment. Through various simple techniques I managed then to reduce the running time for the load process by more than 65%, from 9 to 3 hours. It’s a considerable performance gain, considering that I didn’t have to refactor any business logic implemented in queries.

The ETL (Extract, Transform, Load) solution was making use of SSIS (SQL Server Integration Services) packages to load data sequentially from several sources into staging tables, and from stating further into base tables. Each package was responsible for deleting the data from the staging tables via TRUNCATE, extracting the data 1:1 from the source into the staging tables, then loading the data 1:1 from the staging table to base tables. It’s the simplest and a relatively effective ETL design I also used with small alterations for data warehouse solutions. For months the data load worked smoothly, until data growth and eventually other problems increased the loading time from 5 to 9 hours.

Using TABLOCK Hint

Using SSIS to bulk load data into SQL Server provides an optimum of performance and flexibility. Within a Data Flow, when “Table Lock” property on the destination is checked, it implies that the insert records are minimally logged, speeding up the load by a factor of two. The TABLOCK hint can be used also for other insert operations performed outside of SSIS packages. At least in this case the movement of data from staging into base tables was performed in plain T-SQL, outside of SSIS packages. Also further data processing had benefitted from this change. Only this optimization step alone provided 30-40% performance gain.

Drop/Recreating the Indexes on Big Tables

As the base tables were having several indexes each, it proved beneficial to drop the indexes for the big tables (e.g. with more than 1000000 records) before loading the data into the base tables, and recreate the indexes afterwards. This was done within SSIS, and provided an additional 20-30% performance gain from the previous step.

Consolidating the Indexes

Adding missing indexes, removing or consolidating (overlapping) indexes are typical index maintenance tasks, apparently occasionally ignored. It doesn’t always bring much performance as compared with the previous methods, though dropping and consolidating some indexes proved to be beneficial as fewer data were maintained. Data processing logic benefited from the creation of new indexes as well.

Running Packages in Parallel

As the packages were run sequentially (one package at a time), the data load was hardly taking advantage of the processing power available on the server. Even if queries could use parallelism, the benefit was minimal. Enabling packages run in parallel added additional performance gain, however this minimized the availability of processing resources for other tasks. When the data load is performed overnight, this causes minimal overhead, however it should be avoided when the data are loading to business hours.

Using Nonclustered Indexes

In my analysis I found out that many tables, especially the ones storing prepared data, were lacking a clustered index, even if further indexes were built on them. I remember that years back there was a (false) myth that fact and/or dimension tables don’t need clustered indexes in SQL Server. Of course clustered indexes have downsides (e.g. fragmentation, excessive key-lookups) though their benefits exceed by far the downsides. Besides missing clustered index, there were cases in which the tables would have benefited from having a narrow clustered index, instead of a multicolumn wide clustered index. Upon case also such cases were addressed.

Removing the Staging Tables

Given the fact that the source and target systems are in the same virtual environment, and the data are loaded 1:1 between the various layers, without further transformations and conversions, one could load the data directly into the base tables. After some tests I came to the conclusion that the load from source tables into the staging table, and the load from staging table into base table (with TABLOCK hint) were taking almost the same amount of time. This means that the base tables will be for the same amount of the time unavailable, if the data were loaded from the sources directly into the base tables. Therefore one could in theory remove the staging tables from the architecture. Frankly, one should think twice when doing such a change, as there can be further implications in time. Even if today the data are imported 1:1, in the future this could change.

Reducing the Data Volume

Reducing the data volume was identified as a possible further technique to reduce the amount of time needed for data loading. A data warehouse is built based on a set of requirements and presumptions that change over time. It can happen for example that even if the reports need only 1-2 years’ worth of data, the data load considers a much bigger timeframe. Some systems can have up to 5-10 years’ worth of data. Loading all data without a specific requirement leads to waste of resources and bigger load times. Limiting the transactional data to a given timeframe can make a considerable difference. Additionally, there are historical data that have the potential to be archived.

There are also tables for which a weekly or monthly refresh would suffice. Some tables or even data sources can become obsolete, however they continue to be loaded in the data warehouse. Such cases occur seldom, though they occur. Also some unused or redundant column could have been removed from the packages.

Further Thoughts

There are further techniques to optimize the data load within a data warehouse like partitioning large tables, using columnstore indexes or optimizing the storage, however my target was to provide maximum sufficient performance gain with minimum of effort and design changes. Therefore I stopped when I considered that the amount of effort is considerable higher than the performance gain.

Further Reading:
[1] TechNet (2009) The Data Loading Performance Guide, by Thomas Kejser, Peter Carlin & Stuart Ozer (link)
[2] MSDN (2010) Best Practices for Data Warehousing with SQL Server 2008 R2, by Mark Whitehorn, Keith Burns & Eric N Hanson (link)
[3] MSDN (2012) Whitepaper: Fast Track Data Warehouse Reference Guide for SQL Server 2012, by Eric Kraemer, Mike Bassett, Eric Lemoine & Dave Withers (link)
[4] MSDN (2008) Best Practices for Data Warehousing with SQL Server 2008, by Mark Whitehorn & Keith Burns (link)
[5] TechNet (2005) Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server, by Gandhi Swaminathan (link)
[6] SQL Server Customer Advisory Team (2013) Top 10 Best Practices for Building a Large Scale Relational Data Warehouse (link)

27 February 2016

🧭Business Intelligence: Perspectives (Part 2: The Complexity Myth)

Introduction

While looking over “Business Intelligence Concepts and Platform Capabilities” Coursera MOOC resources for Module 2 I run into two similar articles from Solutions Review, respectively Information Age. What caught my attention was the easiness with which the complexity of BI “myth” is approached in both columns.

According to the two sources the capabilities of nowadays BI tools “enabled business users to easily identify and present trends in an impactful way” [1], and “do not require an expert at the helm” [2]. It became thus simpler for users to independently query data and create interactive reports and presentations [2]. In both columns one can read between the lines that the simplicity of using BI tools is equivalent with negating the complexity of BI, which from my point of view is false. In fact here are regarded especially the self-service BI tools, in trend nowadays, that allow users to easily perform ad-hoc analysis with a minimal involvement from IT. Self-service BI is only a subset of what BI for organizations means, and just a capability from the many BI capabilities an organization needs in theory, even if some organizations might use it extensively.

Beyond the Surface

A BI tool is not a BI solution per se, even if many generic BI solutions for different systems are available out of the box. This is one of the biggest confusion managers, users and unfortunately also BI professionals make. A BI tool offers the technological basis for creating a BI infrastructure, though it comes with no guarantees. It takes a well-defined IT and business strategy, one or more successful projects, skillful developers and users in order to harness the BI investment.

On the other side it’s also true that organizations can obtain results also from less, though BI doesn’t equates with any ad-hoc analysis performed by users, even if they use BI tools for this purpose. BI is not only about tools, reporting and revealing trends in the data. BI often implies a holistic knowledge about the business and certain data awareness, without which users will start aggregating and comparing apples with pears and wonder why they taste and look different.

If everything were so simple then why so many BI projects fail to deliver what’s expected? Why so many managers complain that they don’t have the data they need, when they need them? Sure maybe the problem lies in over-complexifying the whole BI landscape and treating everything from a high-level, though that’s more likely not it.

It’s a Teamwork Knowledge Game

BI is or needs to be monitoring and problem solving oriented. This requires a deep understanding about processes and business. There are business users and also BI professionals who don’t have the knowledge one needs in order to approach a business problem. One can see that from the premises they have, the questions they raise, the data they consider, the models they build, and the results.

From a BI professional’s perspective, even if one has a broad knowledge about various businesses, one often lacks the insight in a given business. BI professionals can seldom provide adequate BI solutions without input and feedback from the business. Some BI professionals rely too much on their knowledge, same as the business sometimes expects a maximum output from BI professionals by providing a minimum of input.

Considering the business users, quite often their focus and knowledge cover only the data boundaries of their department, while many problems extend over those boundaries. They know facts that are not necessarily reflected in the data. Even if they are closer to the data than other parties, they still lack some data-awareness (including statistical awareness) in order to approach problems.

Somebody was saying ironically when talking about users’ data and problem solving skills - “not everybody is a Bill Gates or Steve Jobs”. Continuing the idea, one can’t expect users to act as such. For sure there are many business users who are better problem solvers than BI consultants, though on the other side one can’t expect that the average business user will have the same skillset as an experienced BI consultant. This is in fact one of the problems of self-service BI. Probably with time and effort organization will develop such resources, though some help from BI professionals will be still needed. Without a good cooperation between the business and BI professionals an organization might not have the hoped results when investing in BI

More on Complexity

The complexity arises when one tries to make more with the data, especially the data found in raw form. Usually the complexity of raw data can be addressed by building a logical or physical model that allows easier consumption of data. Here is the point where the users find themselves overwhelmed, because for this is required a good knowledge of the physical data model and its semantics, the technical knowledge to build models and the skills to reengineer the logic available in the source systems. These are the themes BI professionals are supposed to excel in. Talking about models, they are the most difficult to build because they reflect various segments of the business, they reflect a breakdown of the complexity. It’s also the point where many BI projects fail as the built models don’t reflect the reality or aren’t capable to answer to business questions.

Coming back to the two columns, I have to point out that the complexity of a subject or domain can’t be judged based on how easy is to approach basic tasks. The complexity lies typically when one goes beyond the basics, when one dives into details. In case of BI its complexity starts when one attempts mixing various technologies and knowledge domains to model and solve daily business problems in an integrated, holistic, aligned, consistent and cost-effective manner. The more the technologies, the knowledge domains and constraints one has to consider, the more complex the BI landscape and solutions become.

On the other side this doesn’t mean that the BI infrastructure can’t be simplified, that BI can’t rely heavily or exclusively on self-service BI solutions. However for each strategy there are advantages and disadvantages and one more likely has to consider both sides of the coin in the process. And self-service BI has its own trade-offs, weaknesses that can be transformed in strengths with time.

Conclusion

When one considers nowadays BI tools capabilities, ad-hoc analyses are relatively easy to perform and can lead to results, though such analyses don’t equate with BI and the simplicity with which they are performed don’t necessarily imply that BI is simple as a whole. When one considers the complexity of nowadays businesses, the more one dives in various problems a business has, the more complex the BI landscape seems. In the end it’s in each organization powers to simplify and harmonize its BI infrastructure to a degree that its business goals aren’t affected negatively.

Previous Post <<||>> Next Post

Resources
[1] Information Age (2015) 5 Myths about Intelligence, by Ben Rossi, [Online] Available from: http://www.information-age.com/technology/information-management/123460271/5-myths-about-business-intelligence
[2] SolutionsReview (2015) Top 5 Business Intelligence Myths Revealed, by Timothy King, [Online] Available from: http://solutionsreview.com/business-intelligence/top-5-business-intelligence-myths-revealed
[3] Gartner (2016) Magic Quadrant for Business Intelligence and Analytics Platforms, by Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel, Thomas W. Oestreich [Online] Available from: https://www.gartner.com/doc/reprints?id=1-2XXET8P&ct=160204&st=sb
[4] Coursera (2016) Business Intelligence Concepts, Tools, and Applications MOOC, led by Jahangir Karimi, University of Colorado, [Online] Available from: https://www.coursera.org/learn/business-intelligence-tools

13 December 2015

🪙Business Intelligence: Data Analytics Myths (Just the Quotes)

"The simplicity of the process behavior chart can be deceptive. This is because the simplicity of the charts is based on a completely different concept of data analysis than that which is used for the analysis of experimental data. When someone does not understand the conceptual basis for process behavior charts they are likely to view the simplicity of the charts as something that needs to be fixed. Out of these urges to fix the charts all kinds of myths have sprung up resulting in various levels of complexity and obstacles to the use of one of the most powerful analysis techniques ever invented." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The first myth is that prediction is always based on time-series extrapolation into the future (also known as forecasting). This is not the case: predictive analytics can be applied to generate any type of unknown data, including past and present. In addition, prediction can be applied to non-temporal (time-based) use cases such as disease progression modeling, human relationship modeling, and sentiment analysis for medication adherence, etc. The second myth is that predictive analytics is a guarantor of what will happen in the future. This also is not the case: predictive analytics, due to the nature of the insights they create, are probabilistic and not deterministic. As a result, predictive analytics will not be able to ensure certainty of outcomes." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"Another myth is that we shall have a single source of truth for each concept or entity. […] This is a wonderful idea, and is placed to prevent multiple copies of out-of-date and untrustworthy data. But in reality it’s proved costly, an impediment to scale and speed, or simply unachievable. Data Mesh does not enforce the idea of one source of truth. However, it places multiple practices in place that reduces the likelihood of multiple copies of out-of-date data." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Data mesh [...] reduces points of centralization that act as coordination bottlenecks. It finds a new way of decomposing the data architecture without slowing the organization down with synchronizations. It removes the gap between where the data originates and where it gets used and removes the accidental complexities - aka pipelines - that happen in between the two planes of data. Data mesh departs from data myths such as a single source of truth, or one tightly controlled canonical data model." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"I think sometimes organizations are looking at tools or the mythical and elusive data driven culture to be the strategy. Let me emphasize now: culture and tools are not strategies; they are enabling pieces." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"In the world of data and analytics, people get enamored by the nice, shiny object. We are pulled around by the wind of the latest technology, but in so doing we are pulled away from the sound and intelligent path that can lead us to data and analytical success. The data and analytical world is full of examples of overhyped technology or processes, thinking this thing will solve all of the data and analytical needs for an individual or organization. Such topics include big data or data science. These two were pushed into our minds and down our throats so incessantly over the past decade that they are somewhat of a myth, or people finally saw the light. In reality, both have a place and do matter, but they are not the only solution to your data and analytical needs. Unfortunately, though, organizations bit into them, thinking they would solve everything, and were left at the alter, if you will, when it came time for the marriage of data and analytical success with tools." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Unlike other analytical data management paradigms, data mesh does not embrace the concept of the mythical single source of truth. Every data product provides a truthful portion of the reality - for a particular domain - to the best of its ability, a single slice of truth." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

16 November 2011

📉Graphical Representation: Myths (Just the Quotes)

"The search for better numbers, like the quest for new technologies to improve our lives, is certainly worthwhile. But the belief that a few simple numbers, a few basic averages, can capture the multifaceted nature of national and global economic systems is a myth. Rather than seeking new simple numbers to replace our old simple numbers, we need to tap into both the power of our information age and our ability to construct our own maps of the world to answer the questions we need answering." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"Clarity is related to two other principles of good data presentation: precision and efficiency. Precision refers to ensuring that the data are presented accurately with minimal error. This is a topic that is equally important to data presentation as it is to data management. Always keep in mind: don’t mislead the audience. As already mentioned, people can be fooled by visual images, but they can also be misled by the myth of the infallible graphic. This refers to a tendency to believe there is an important association among concepts simply because they are correlated." (John Hoffmann, "Principles of Data Management and Presentation", 2017)

"The first myth is that prediction is always based on time-series extrapolation into the future (also known as forecasting). This is not the case: predictive analytics can be applied to generate any type of unknown data, including past and present. In addition, prediction can be applied to non-temporal (time-based) use cases such as disease progression modeling, human relationship modeling, and sentiment analysis for medication adherence, etc. The second myth is that predictive analytics is a guarantor of what will happen in the future. This also is not the case: predictive analytics, due to the nature of the insights they create, are probabilistic and not deterministic. As a result, predictive analytics will not be able to ensure certainty of outcomes." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"I think sometimes organizations are looking at tools or the mythical and elusive data driven culture to be the strategy. Let me emphasize now: culture and tools are not strategies; they are enabling pieces." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Data generation is part of knowledge production. It is the generation of material that can help frame a debate or dispel a myth. As designers of visualizations, we choose what data and which stories are amplified through our work. As critical visualization designers, we assume a responsibility for producing those stories and for the ways they were produced." (Peter A Hall & Patricio Dávila, "Critical Visualization: Rethinking the Representation of Data", 2022)

21 October 2006

⛩️Laurent Bossavit - Collected Quotes

"A lot of research in software engineering strikes me as hopelessly naive in one of two ways. Most of it fails entirely to account for the social and belief aspects altogether. It looks at its object of inquiry as if it was entirely material and inert; as if 'software' was some kind of naturally occurring substance, the properties of which can be revealed in the equivalent of a test tube." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"As practitioners, it is both in our interest and within our responsibility to pay attention to research. This includes not just the findings of such research, but also its processes and its institutions. Read research papers; find out what’s happening in that world and why it’s not more relevant to your work; weigh in; make your voice heard." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"As software professionals, we should be interested in knowing at least the basics of our own history, for just the same reasons that as citizens we are expected to know about our national history and about world history: so that we will be able to make informed decisions and know who to trust, who to listen to; so that we are not deceived by lies. Untrue histories generally have an agenda - 'someone trying to sell you something', as the saying goes." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"But 'average cost to fix one defect' is a stupid metric […] It makes bad projects look good, and good projects look bad. How? By failing to divide the costs of fixing into two categories: fixed costs of detecting and fixing defects - costs which are the same no matter how buggy or how good the product is - and variable costs, those which you pay for each defect." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"But if you want to count defects, you first have to decide what (literally) counts as one. We have a hard time even agreeing among a single team on what counts as a defect - in fact I’m pretty sure my own thinking has changed over the years, so I’m not even agreeing with myself." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"But we have now reached the most pressing problem in software engineering: low standards for research publications. Most of what passes for 'research' in the discipline is ridiculously careless with respect to examining the 'terms of inquiry'." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"Creating mockups to communicate is not intrinsically a bad idea. But, as we are subject to confirmation bias, there’s always a risk that we will stop at our first design attempt and become reluctant to ask if there are better ways to achieve the same goals. Making these first ideas very detailed; putting them into a document; and especially blessing that document with the label 'requirements' are all moves which make further revision less likely, and put us more at risk from confirmation bias." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"Debugging is known as an open-ended sort of activity, and even seasoned programmers expect variable completion times when faced with this type of task." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"I do not dispute that there are large variations in reported measurements of programmer productivity; however, close examination of the evidence suggests that this observed variability originates in vague definitions of the term, in unreliable instruments of measurement, or in uncontrolled environmental factors, much more than it does in the intrinsic capabilities of programmers at comparable levels of training and experience." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"If a picture is worth a thousand words, a false picture is a thousand times more serious than one careless word." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"Often, the hard work in research isn’t in performing the experiment and collecting data (that tends to be grunt work, in fact, in most disciplines). The hard work consists of designing the experiment that will rule out the most alternative explanations for what you see happening (or think you see happening)." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"Science journalism is a fine and important thing, but it has a well-known failure mode: sensationalism, where the lure of an attention-grabbing headline causes writers to toss caution to the wind and radically misrepresent a claim." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"Software development […] needs to be studied with tools that borrow as much from the social and cognitive sciences as they do from the mathematical theories of computation." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"Software engineering is a social process, not a naturally occurring one- it therefore has the property that what we believe about software engineering has causal impacts on what is real about software engineering." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"The problem is that we cannot infer variations in individual productivity from data collected at the team level: we do not have an adequate theory of how a team’s productivity results from the aggregation of individual abilities, and in particular we cannot assume that a team’s output is a linear sum of individual 'productivities'." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"The problem, as we know, is that projects are very different from each other: there are big projects and large projects, there are projects with lots of defects and projects with… even greater numbers of defects. How do we make these comparable with each other?" (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"The wonderful thing about over-precise statistics is that they are the easiest to expose as being Leprechauns, as you can trace their spread from one dubious source to another." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"We cannot leave this enterprise to academia: that system’s inertia is too great. I believe that many of us will have to become scientists, and study software development even as we practice it." (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

"When looking at a graph or chart, purported to be backed by empirical data, remember to ask: what specific measurement does each of the points I’m looking at represent? If the picture is a curve, and it has many more data points than were actually measured, ask: was the method for interpolating the missing points actually valid?" (Laurent Bossavit, "The Leprechauns of Software Engineering", 2015)

SQL Troubles

Pages