Showing posts with label data collection. Show all posts
Showing posts with label data collection. Show all posts

18 April 2023

📊Graphical Representation: Graphics We Live By I (The Analytics Marathon)

Graphical Representation
Graphical Representation Series

In a diagram adapted from an older article [1], Brent Dykes, the author of "Effective Data Storytelling" [2], makes a parallel between Data Analytics and marathon running, considering that an organization must pass through the depicted milestones, the percentages representing how many organizations reach the respective milestones:

It's a nice visualization and the metaphor makes sense given that running a marathon requires a long-term strategy to address the gaps between the current and targeted physical/mental form and skillset required to run a marathon, respectively for approaching a set of marathons and each course individually. Similarly, implementing a Data Analytics initiative requires a Data Strategy supposed to address the gaps existing between current and targeted state of art, respectively the many projects run to reach organization's goals. 

It makes sense, isn't it? On the other side the devil lies in details and frankly the diagram raises several questions when is compared with practices and processes existing in organizations. This doesn't mean that the diagram is wrong, just that it doesn't seem to reflect entirely the reality. 

The percentages represent author's perception of how many organizations reach the respective milestones, probably in an repeatable manner (as there are several projects). Thus, only 10% have a data strategy, 100% collect data, 80% of them prepare the data, while at the opposite side only 15% communicate insight, respectively 5% act on information.

Considering only the milestones the diagram looks like a funnel and a capability maturity model (CMM). Typically, the CMMs are more complex than this, evolving with technologies' capabilities. All the mentioned milestones have a set of capabilities that increase in complexity and that usually help differentiated organization's maturity. Therefore, the model seems too simple for an actual categorization.  

Typically, data collection has a specific scope resuming to surveys, interviews and/or research. However, the definition can be extended to the storage of data within organizations. Thus, data collection as the gathering of raw data is mainly done as part of their value supporting processes, and given the degree of digitization of data, one can suppose that most organizations gather data for the different purposes, even if only a small part are maybe digitized.

Even if many organizations build data warehouses, marts, lakehouses, mashes or whatever architecture might be en-vogue these days, an important percentage of the reporting needs are covered by standard reports or reporting tools that access directly the source systems without data preparation or even data visualization. The first important question is what is understood by data analytics? Is it only the use of machine learning and statistical analysis? Does it resume only to pattern and insight finding or does it includes also what is typically considered under the Business Intelligence umbrella? 

Pragmatically thinking, Data Analytics should consider BI capabilities as well as its an extension of the current infrastructure to consider analytic capabilities. On the other side Data Warehousing and BI are considered together by DAMA as part of their Data Management methodology. Moreover, organizations may have a Data Strategy and a BI strategy, respectively a Data Analytics strategy as they might have different goals, challenges and bodies to support them. To make it even more complicated, an organization might even consider all these important topics as part of the Data or even Information Governance, or consider BI or Analytics without Data Management. 

So, a Data Strategy might or might not address Data Analytics at all. It's a matter of management philosophy, organizational structure, politics and other factors. Probably, having a strayegy related to data should count. Even if a written and communicated data-related strategy is recommended for all medium to big organizations, only a small percentage of them have one, while small organizations might ignore the topic completely.

At least in the past, data analysis and its various subcomponents was performed before preparing and visualizing the data, or at least in parallel with data visualization. Frankly, it's a strange succession of steps. Or does it refers to exploratory data analysis (EDA) from a statistical perspective, which requires statistical experience to model and interpret the facts? Moreover, data exploration and discovery happen usually in the early stages.

The most puzzling step is the last one - what does the author intended with it? Ideally, data should be actionable, at least that's what one says about KPIs, OKRs and other metrics. Does it make sense to extend Data Analytics into the decision-making process? Where does a data professional's responsibilities end and which are those boundaries? Or does it refer to the actions that need to be performed by data professionals? 

The natural step after communicating insight is for the management to take action and provide feedback. Furthermore, the decisions taken have impact on the artifacts built and a reevaluation of the business problem, assumptions and further components is needed. The many steps of analytics projects are iterative, some iterations affecting the Data Strategy as well. The diagram shows the process as linear, which is not the case.

For sure there's an interface between Data Analytics and Decision-Making and the processes associated with them, however there should be clear boundaries. E.g., it's a data professional's responsibility to make sure that the data/information is actionable and eventually advise upon it, though whether the entitled people act on it is a management topic. Not acting upon an information is also a decision. Overstepping boundaries can put the data professional into a strange situation in which he becomes responsible and eventually accountable for an action not taken, which is utopic.

The final question - is the last mile representative for the analytical process? The challenge is not the analysis and communication of data but of making sure that the feedback processes work and the changes are addressed correspondingly, that value is created continuously from the data analytics infrastructure, that data-related risks and opportunities are addressed as soon they are recognized. 

As any model, a diagram doesn't need to be correct to be useful and might not be even wrong in the right context and argumentation. A data analytics CMM might allow better estimates and comparison between organizations, though it can easily become more complex to use. Between the two models lies probably a better solution for modeling the data analytics process.

[1] Brent Dykes (2022) "Data Analytics Marathon: Why Your Organization Must Focus On The Finish", Forbes (link)
[2] Brent Dykes (2019) Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals (link)

16 December 2018

🔭Data Science: Data Collection (Just the Quotes)

"There are two aspects of statistics that are continually mixed, the method and the science. Statistics are used as a method, whenever we measure something, for example, the size of a district, the number of inhabitants of a country, the quantity or price of certain commodities, etc. […] There is, moreover, a science of statistics. It consists of knowing how to gather numbers, combine them and calculate them, in the best way to lead to certain results. But this is, strictly speaking, a branch of mathematics." (Alphonse P de Candolle, "Considerations on Crime Statistics", 1833)

"Just as data gathered by an incompetent observer are worthless - or by a biased observer, unless the bias can be measured and eliminated from the result - so also conclusions obtained from even the best data by one unacquainted with the principles of statistics must be of doubtful value." (William F White, "A Scrap-Book of Elementary Mathematics: Notes, Recreations, Essays", 1908)

"[...] scientists are not a select few intelligent enough to think in terms of 'broad sweeping theoretical laws and principles'. Instead, scientists are people specifically trained to build models that incorporate theoretical assumptions and empirical evidence. Working with models is essential to the performance of their daily work; it allows them to construct arguments and to collect data." (Peter Imhof, Science Vol. 287, 1935–1936)

"Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology." (Egon Pearson, 1936)

"Scientific data are not taken for museum purposes; they are taken as a basis for doing something. If nothing is to be done with the data, then there is no use in collecting any. The ultimate purpose of taking data is to provide a basis for action or a recommendation for action. The step intermediate between the collection of data and the action is prediction." (William E Deming, "On a Classification of the Problems of Statistical Inference", Journal of the American Statistical Association Vol. 37 (218), 1942)

"Data should be collected with a clear purpose in mind. Not only a clear purpose, but a clear idea as to the precise way in which they will be analysed so as to yield the desired information." (Michael J Moroney, "Facts from Figures", 1951)

"The technical analysis of any large collection of data is a task for a highly trained and expensive man who knows the mathematical theory of statistics inside and out. Otherwise the outcome is likely to be a collection of drawings - quartered pies, cute little battleships, and tapering rows of sturdy soldiers in diversified uniforms - interesting enough in the colored Sunday supplement, but hardly the sort of thing from which to draw reliable inferences." (Eric T Bell, "Mathematics: Queen and Servant of Science", 1951)

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric statistics", Journal of the American Statistical Association 52, 1957)

"Philosophers of science have repeatedly demonstrated that more than one theoretical construction can always be placed upon a given collection of data." (Thomas Kuhn, "The Structure of Scientific Revolutions", 1962) 

"It has been said that data collection is like garbage collection: before you collect it you should have in mind what you are going to do with it." (Russell Fox et al, "The Science of Science", 1964)

"Typically, data analysis is messy, and little details clutter it. Not only confounding factors, but also deviant cases, minor problems in measurement, and ambiguous results lead to frustration and discouragement, so that more data are collected than analyzed. Neglecting or hiding the messy details of the data reduces the researcher's chances of discovering something new." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"If we gather more and more data and establish more and more associations, however, we will not finally find that we know something. We will simply end up having more and more data and larger sets of correlations." (Kenneth N Waltz, "Theory of International Politics Source: Theory of International Politics", 1979)

"The systematic collection of data about people has affected not only the ways in which we conceive of a society, but also the ways in which we describe our neighbour. It has profoundly transformed what we choose to do, who we try to be, and what we think of ourselves." (Ian Hacking, "The Taming of Chance", 1990)

"When looking at the end result of any statistical analysis, one must be very cautious not to over interpret the data. Care must be taken to know the size of the sample, and to be certain the method for gathering information is consistent with other samples gathered. […] No one should ever base conclusions without knowing the size of the sample and how random a sample it was. But all too often such data is not mentioned when the statistics are given - perhaps it is overlooked or even intentionally omitted." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"We do not realize how deeply our starting assumptions affect the way we go about looking for and interpreting the data we collect." (Roger A Lewin, "Kanzi: The Ape at the Brink of the Human Mind", 1994)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Consideration needs to be given to the most appropriate data to be collected. Often the temptation is to collect too much data and not give appropriate attention to the most important. Filing cabinets and computer files world-wide are filled with data that have been collected because they may be of interest to someone in future. Most is never of interest to anyone and if it is, its existence is unknown to those seeking the information, who will set out to collect the data again, probably in a trial better designed for the purpose. In general, it is best to collect only the data required to answer the questions posed, when setting up the trial, and plan another trial for other data in the future, if necessary." (P Portmann & H Ketata, "Statistical Methods for Plant Variety Evaluation", 1997)

"Data are collected as a basis for action. Yet before anyone can use data as a basis for action the data have to be interpreted. The proper interpretation of data will require that the data be presented in context, and that the analysis technique used will filter out the noise."  (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data are generally collected as a basis for action. However, unless potential signals are separated from probable noise, the actions taken may be totally inconsistent with the data. Thus, the proper use of data requires that you have simple and effective methods of analysis which will properly separate potential signals from probable noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Just as dynamics arise from feedback, so too all learning depends on feedback. We make decisions that alter the real world; we gather information feedback about the real world, and using the new information we revise our understanding of the world and the decisions we make to bring our perception of the state of the system closer to our goals." (John D Sterman, "Business dynamics: Systems thinking and modeling for a complex world", 2000) 

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data is a fact of life. As time goes by, we collect more and more data, making our original reason for collecting the data harder to accomplish. We don't collect data just to waste time or keep busy; we collect data so that we can gain knowledge, which can be used to improve the efficiency of our organization, improve profit margins, and on and on. The problem is that as we collect more data, it becomes harder for us to use the data to derive this knowledge. We are being suffocated by this raw data, yet we need to find a way to use it." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions." (Sheldon M Ross, "Introductory Statistics" 3rd Ed., 2009)

"Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions." (Ron Larson & Betsy Farber, "Elementary Statistics: Picturing the World" 5th Ed., 2011)

"The discrepancy between our mental models and the real world may be a major problem of our times; especially in view of the difficulty of collecting, analyzing, and making sense of the unbelievable amount of data to which we have access today." (Ugo Bardi, "The Limits to Growth Revisited", 2011)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Each systems archetype embodies a particular theory about dynamic behavior that can serve as a starting point for selecting and formulating raw data into a coherent set of interrelationships. Once those relationships are made explicit and precise, the 'theory' of the archetype can then further guide us in our data-gathering process to test the causal relationships through direct observation, data analysis, or group deliberation." (Daniel H Kim, "Systems Archetypes as Dynamic Theories", The Systems Thinker Vol. 24 (1), 2013)

"Statistics is an integral part of the quantitative approach to knowledge. The field of statistics is concerned with the scientific study of collecting, organizing, analyzing, and drawing conclusions from data." (Kandethody M Ramachandran & Chris P Tsokos, "Mathematical Statistics with Applications in R" 2nd Ed., 2015)

"The term data, unlike the related terms facts and evidence, does not connote truth. Data is descriptive, but data can be erroneous. We tend to distinguish data from information. Data is a primitive or atomic state (as in ‘raw data’). It becomes information only when it is presented in context, in a way that informs. This progression from data to information is not the only direction in which the relationship flows, however; information can also be broken down into pieces, stripped of context, and stored as data. This is the case with most of the data that’s stored in computer systems. Data that’s collected and stored directly by machines, such as sensors, becomes information only when it’s reconnected to its context."  (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"Big data is, in a nutshell, large amounts of data that can be gathered up and analyzed to determine whether any patterns emerge and to make better decisions." (Daniel Covington, Analytics: Data Science, Data Analysis and Predictive Analytics for Business, 2016)

"Statistics can be defined as a collection of techniques used when planning a data collection, and when subsequently analyzing and presenting data." (Birger S Madsen, "Statistics for Non-Statisticians", 2016)

"Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. […] Statistics is the science of learning from data." (Moore McCabe & Alwan Craig, "The Practice of Statistics for Business and Economics" 4th Ed., 2016)

"Collecting data through sampling therefore becomes a never-ending battle to avoid sources of bias. [...] While trying to obtain a random sample, researchers sometimes make errors in judgment about whether every person or thing is equally likely to be sampled." (Daniel J Levitin, "Weaponized Lies", 2017)

"Just because there’s a number on it, it doesn’t mean that the number was arrived at properly. […] There are a host of errors and biases that can enter into the collection process, and these can lead millions of people to draw the wrong conclusions. Although most of us won’t ever participate in the collection process, thinking about it, critically, is easy to learn and within the reach of all of us." (Daniel J Levitin, "Weaponized Lies", 2017)

"Measurements must be standardized. There must be clear, replicable, and precise procedures for collecting data so that each person who collects it does it in the same way." (Daniel J Levitin, "Weaponized Lies", 2017)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"The desire to collect as much data as possible must be balanced with an approximation of which data sources are useful to address a business issue. It is worth mentioning that often the value of internal data is high. Most internal data has been cleansed and transformed to suit the mission. It should not be overlooked simply because of the excitement of so much other available data." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A random collection of interesting but disconnected facts will lack the unifying theme to become a data story - it may be informative, but it won’t be insightful." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019

"Each decision about what data to gather and how to analyze them is akin to standing on a pathway as it forks left and right and deciding which way to go. What seems like a few simple choices can quickly multiply into a labyrinth of different possibilities. Make one combination of choices and you’ll reach one conclusion; make another, equally reasonable, and you might find a very different pattern in the data." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"It’d be nice to fondly imagine that high-quality statistics simply appear in a spreadsheet somewhere, divine providence from the numerical heavens. Yet any dataset begins with somebody deciding to collect the numbers. What numbers are and aren’t collected, what is and isn’t measured, and who is included or excluded are the result of all-too-human assumptions, preconceptions, and oversights." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Unless we’re collecting data ourselves, there’s a limit to how much we can do to combat the problem of missing data. But we can and should remember to ask who or what might be missing from the data we’re being told about. Some missing numbers are obvious […]. Other omissions show up only when we take a close look at the claim in question." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"What is the purpose of collecting data? People gather and store data for at least three different reasons that I can discern. One reason is that they want to build an arsenal of evidence with which to prove a point or defend an agenda that they already had to begin with. This path is problematic for obvious reasons, and yet we all find ourselves traveling on it from time to time. Another reason people collect data is that they want to feed it into an artificial intelligence algorithm to automate some process or carry out some task. […] A third reason is that they might be collecting data in order to compile information to help them better understand their situation, to answer questions they have in their mind, and to unearth new questions that they didn't think to ask." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

[Murphy’s Laws of Analysis:] "(1) In any collection of data, the figures that are obviously correct contain errors. (2) It is customary for a decimal to be misplaced. (3) An error that can creep into a calculation, will. Also, it will always be in the direction that will cause the most damage to the calculation." (G C Deakly)

"[…] numerous samples collected without a clear idea of what is to be done with the data are commonly less useful than a moderate number of samples collected in accordance with a specific design." (William C Krumbein)

More quotes on " Data Collection" at

16 March 2017

⛏️Data Management: Missing Data (Definitions)

"Noise in a bivalent testing input pattern in which one or more components have been changed from the correct value to a value midway between the correct and the incorrect value, i.e. a + 1, or a -1, has been changed to a O." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"Many databases have cases where not all the attribute values are known. These can be due to structural reasons (e.g., parity for males), due to changes or variations in data collection methodology, or due to nonresponses. In the latter case, it is important to distinguish between ignorable and nonignorable nonresponse. The former must be addressed even though the latter can (usually) be treated as random." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"Observations where one or more variables contain no value." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"data are said to be missing when there is no information for one or more pattern on one or more features in a research study." (Pedro J García-Laencina et al, "Classification with Incomplete Data", 2010)

"Missing data, also known as lost data, is the data that is lost in an inner join when rows of the tables being joined do not match with any other rows. Missing data can also occur with one-sided joins on the side that is not being preserved. This definition ignores all the other reasons for missing data." (Michael M David & Lee Fesperman, "Advanced SQL Dynamic Data Modeling and Hierarchical Processing", 2013)

"It refers that no data value is stored for the variable in the observation." (Liang-Ting Tsai et al, "Weighting Imputation for Categorical Data", 2014)

"Observations which were planned and are missing." (OECD)

"In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data." (Wikipedia)

11 February 2017

⛏️Data Management: Data Collection (Definitions)

"The gathering of information through focus groups, interviews, surveys, and research as required to develop a strategic plan." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)

"The process of gathering raw or primary specific data from a single source or from multiple sources." (Adrian Stoica et al, "Field Evaluation of Collaborative Mobile Applications", 2008) 

"A combination of human activities and computer processes that get data from sources into files. It gets the file data using empirical methods such as questionnaire, interview, observation, or experiment." (Jens Mende, "Data Flow Diagram Use to Plan Empirical Research Projects", 2009)

"A systematic process of gathering and measuring information about the phenomena of interest." (Kaisa Malinen et al, "Mobile Diary Methods in Studying Daily Family Life", 2015)

"The process of capturing events in a computer system. The result of a data collection operation is a log record. The term logging is often used as a synonym for data collection." (Ulf Larson et al, "Guidance for Selecting Data Collection Mechanisms for Intrusion Detection", 2015)

"This refers to the various approaches used to collect information." (Ken Sylvester, "Negotiating in the Leadership Zone", 2015)

"Set of techniques that allow gathering and measuring information on certain variables of interest." (Sara Eloy et al, "Digital Technologies in Architecture and Engineering: Exploring an Engaged Interaction within Curricula", 2016)

"with respect to research, data collection is the recording of data for the purposes of a study. Data collection for a study may or may not be the original recording of the data." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

"The process of retrieving data from different sources and storing them in a unique location for further use." (Deborah Agostino et al, "Social Media Data Into Performance Measurement Systems: Methodologies, Opportunities, and Risks", 2018)

"It is the process of gathering data from a variety of relevant sources in an established systematic fashion for analysis purposes." (Yassine Maleh et al, 'Strategic IT Governance and Performance Frameworks in Large Organizations", 2019)

"A process of storing and managing data." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

"The process and techniques for collecting the information for a research project." (Tiffany J Cresswell-Yeager & Raymond J Bandlow, "Transformation of the Dissertation: From an End-of-Program Destination to a Program-Embedded Process", 2020)

"The method of collecting and evaluating data on selected variables, which helps in analyzing and answering relevant questions is known as data collection." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Datasets are created by collecting data in different ways: from manual or automatic measurements (e.g. weather data), surveys (census data), records of decisions (budget data) or ongoing transactions (spending data), aggregation of many records (crime data), mathematical modelling (population projections), etc." (Open Data Handbook)

04 February 2017

💠🛠️SQL Server: Administration (Killing Sessions - Killing ‘em Softly and other Snake Stories)


There are many posts on the web advising succinctly how to resolve a blocking situation by terminating a session via kill command, though few of them warn about its use and several important aspects that need to be considered. The command is powerful and, using an old adagio, “with power comes great responsibility”, responsibility not felt when reading between the lines. The easiness with people treat the topic can be seen in questions like “is it possibly to automate terminating sessions?” or in explicit recommendations of terminating the sessions when dealing with blockings.

A session is created when a client connects to a RDBMS (Relational Database Management System) like SQL Server, being nothing but an internal logical representation of the connection. It is used further on to perform work against the database(s) via (batches of) SQL statements. Along its lifetime, a session is uniquely identified by an SPID (Server Process ID) and addresses one SQL statement at a time. Therefore, when a problem with a session occurs, it can be traced back to a query, where the actual troubleshooting needs to be performed.

Even if each session has a defined scope and memory space, and cannot interact with other sessions, sessions can block each other when attempting to use the same data resources. Thus, a blocking occurs when one session holds a lock on a specific resource and a second session attempts to acquire a conflicting lock type on the same resource. In other words, the first session blocks the second session from acquiring a resource. It’s like a drive-in to a fast-food in which autos must line up into a queue to place an order. The second auto can’t place an order until the first don’t have the order – is blocked from placing an order. The third auto must wait for the second, and so on. Similarly, sessions wait in line for a resource, fact that leads to a blocking chain, with a head (the head/lead blocking) and a tail (the sessions that follow). It’s a FIFO (first in, first out) queue and using a little imagination one can compare it metaphorically with a snake. Even if imperfect, the metaphor is appropriate for highlighting some important aspects that can be summed up as follows:

  • Snakes have their roles in the ecosystem
  • Not all snakes are dangerous
  • Grab the snake by its head
  • Killing ‘em Softly
  • Search for a snake’s nest
  • Snakes can kill you in sleep
  • Snake taming

Warning: snakes as well blockings need to be handled by a specialist, so don’t do it by yourself unless you know what are you doing!

Snakes have their roles in the ecosystem

Snakes as middle-order predators have an important role in natural ecosystems, as they feed on prey species, whose numbers would increase exponentially if not kept under control. Fortunately, natural ecosystems have such mechanism that tend to auto-regulate themselves. Artificially built ecosystems need as well such auto-regulation mechanisms. As a series of dynamical mechanisms and components that work together toward a purpose, SQL Server is an (artificial) ecosystem that tends to auto-regulate itself. When its environment is adequately sized to handle the volume of information or data it must process then the system will behave smoothly. As soon it starts processing more data than it can handle, it starts misbehaving to the degree that one of its resources gets exhausted.

Just because a blocking occurs doesn’t mean that is a bad thing and needs to be terminated. Temporary blockings occur all the time, as unavoidable characteristic of any RDBMS with lock-based concurrency like SQL Server. They are however easier to observe in systems with heavy workload and concurrent access. The more users in the system touch the same data, the higher the chances for a block to occur. A good design database and application architecture typically minimize blockings’ occurrence and duration, making them almost unobservable. At the opposite extreme poor database design combined with poor application design can make from blockings a DBA’s nightmare. Persistent blockings can be a sign of poor database or application design or a sign that one of the environment’s limits was reached. It’s a sign that something must be done. Restarting the SQL server, terminating sessions or adding more resources have only a temporary effect. The opportunity lies typically in addressing poor database and application design issues, though this can be costlier with time.

Not all snakes are dangerous

A snake’s size is the easiest characteristic on identifying whether a snake is dangerous or not. Big snakes inspire fear for any mortal. Similarly, “big” blockings (blockings consuming an important percentage of the available resources) are dangerous and they have the potential of bringing the whole server down, eating its memory resources slowly until its life comes to a stop. It can be a slow as well a fast death.

Independently of their size, poisonous snakes are a danger for any living creature. By studying snakes’ characteristics like pupils’ shape and skin color patterns the folk devised simple general rules (with local applicability) for identifying whether snakes are poisonous or not. Thus, snakes with diamond-shaped pupils or having color patterns in which red touches yellow are likely/believed to be poisonous. By observing the behavior of blockings and learning about SQL Server’s internals one can with time understand the impact of each blocking on server’s performance.

Grab the snake by its head

Restraining a snake’s head assures that the snake is not able to bite, though it can be dangerous, as the snake might believe is dealing with a predator that is trying to hurt it, and reach accordingly. On the other side troubleshooting blockings must start with the head, the blocking session, as it’s the one which created the blocking problem in the first place.

In SQL Server sp_who and its alternative sp_who2 provide a list of all sessions, with their status, SPID and a reference with the SPID of the session blocking it. It displays thus all the blocking pairs. When one deals with a few blockings one can easily see whether the sessions form a blocking chain. Especially in environments under heavy load one can deal with a handful of blockings that make it difficult to identify all the formed blocking chains. Identifying blocking chains is necessary because by identifying and terminating directly the head blocking will often make the whole blocking chain disappear. The other sessions in the chain will perform thus their work undisturbed.

Going and terminating each blocking session in pairs as displayed in sp_who is not recommended as one terminates more sessions than needed, fact that could have unexpected repercussions. As a rule, one should restore system’s health by making minimal damage.

In many cases terminating the head session will make the blocking chain disperse, however there are cases in which the head session is replaced by other session (e.g. when the sessions involve the same or similar queries). One will need to repeat the needed steps until all blocking chain dissolve.

Killing ‘em Softly 

Killing a snake, no matter how blamable the act, it is sometimes necessary. Therefore, it should be used as ultimate approach, when there is no other alternative and when needed to save one’s or others’ life. Similarly killing a session should be done only in extremis, when necessary. For example, when server’s performance has deprecated considerably affecting other users, or when the session is hanging indefinitely.

Kill command is powerful, having the power of a hammer. The problem is that when you have a hammer, every session looks like a nail. Despite all the aplomb one has when using a tool like a hammer, one needs to be careful in dealing with blockings. A blocking not addressed correspondingly can kick back, and in special cases the bite can be deadly, for system as well for one’s job. Killing the beast is the easiest approach. Kill one beast and another one will take its territory. It’s one of the laws of nature applicable also to database environments. The difference is that if one doesn’t addresses the primary cause that lead to a blocking, the same type of snake more likely will appear repeatedly.

Unfortunately, the kill command is no bulletproof for terminating a session, it may only severe the snake. As the documentation warns, there can be cases in which the method won’t have any effect on the blocking, the blocking continuing to room around. So, might be a good idea to check whether the session disappeared and keep an eye on it until it totally disappeared. Especially when dealing with a blocking chain it can happen that the head session is replaced by another session, which probably was waiting for the same resources as the previous head session. It may happen that one deals with two or more blocking chains independent from each other. Such cases appear seldom but are possible.

Killing the head session with a blocking without gathering some data provides less opportunities for learning, for understanding what’s happening in your system, of identifying what caused the blocking to occur. Therefore, before jumping to kill a session, collect the data you need for further troubleshooting.

Search for a snake’s nest 

With the warning that unless one deals with small snakes, might not be advisable in searching for a snake’s nest, the idea behind this heuristic is that with a snake’s occurrence more likely there is also a nest not far away, where several other snakes might hide. Similarly, a query that causes permanent blockings might be the indication for code that generates a range of misbehaving queries. It can be same code or different pieces of code. One can attempt to improve the performance of a query that leads to blockings by adding more resources on the server or by optimizing SQL Server’s internals, though one can’t compensate for poor programming. When possible, one needs to tackle the problem at the source, otherwise performance improvements are only temporary.

Snakes can kill you in sleep 

When wondering into the wild as well when having snakes as pets one must take all measures to assure that nobody’s health is endangered. Same principle should apply to databases as well, and the first line of defense resides in actively monitoring the blockings and addressing them timely as they occur. Being too confident that nothing happens and no taking the necessary precautions can prove to be a bad strategy when a problem occurs. In some situations, the damage might be acceptable in comparison with the effort and costs needed to build the monitoring infrastructure, though for critical systems it can come with important costs.

Snakes’ Taming 

Having snakes as pets doesn’t seem like a good idea, and there are so many reasons why one shouldn’t do it (see PETA’s reasons)! On the other side, there are also people with uncommon hobbies, that not only limit themselves at having a snake pet, but try to tame them, to have them behave like pets. There are people who breed snakes to harness their venom for various purposes, occupation that requires handling snakes closely. There are also people who brought their relation with snakes at level of art, since ancient Egypt snake charming being a tradition in countries from Southeast Asia, Middle East, and North Africa. Even if not all snakes are tameable, snake’s taming and charming is possible. In the process the tamer must deprogram or control snakes’ behavior, following a specific methodology in a safe environment.

No matter how much one tries to avoid persistent blockings, one can learn from troubleshooting blockings, about their sources, behavior as well about own limitations. One complex blocking can be a good example with which one can test his knowledge about SQL Server internals as well about applications’ architecture. Each blocking provides a scenario in which one can learn something.

When fighting with a blocking, it’s wise to do it within a safe environment, typically a test or development environment. Fighting with it in a production environment can cause unnecessary stress and damage. So, if you don’t have a safe environment in which to carry the fight, then build one and try to keep the same essential characteristics as in production environment!

There will be also situations in which one must fight with a blocking in the production environment. Then, be careful in not damaging the data as well the environment, and take all the needed precautions!


The comparison between snakes and blockings might not be perfect, though hopefully it will imprint in reader’s mind the dangers of handling blockings inappropriately and increase the awareness in what concerns related topics.

15 February 2015

📊Business Intelligence: Reporting (Definitions)

"An automated business process or related functionality that provides a detailed, formal account of relevant or requested information." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[enterprise reporting:] "1.The process of producing reports using unified views of enterprise data. 2.A category of software tools used to produce reports; a term for what were simply known as reporting tools." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[ad hoc reporting:] "A reporting system that enables end users to run queries and create custom reports without having to know the technicalities of the underlying database schema and query syntax." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A process by which insight is presented in a visually appealing and informative manner." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"The practice of reporting what has happened, analyzing contributing data to determine why it happened, and monitoring new data to determine what is happening now. Also known as descriptive analytics and business intelligence." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"The process of collecting data from various sources and presenting it to business people in an understandable way." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A common interaction with an organizing system." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"The function or activity for generating documents that contain information organized in a narrative, graphic, or tabular form, often in a repeatable and regular fashion." (Jonathan Ferrar et al., 2017)

"Business intelligence reporting, or BI reporting, is the process of gathering data by utilizing different software and tools to extract relevant insights. Ultimately, it provides suggestions and observations about business trends, empowering decision-makers to act." (Data Pine) [source

"When we talk about reporting in business intelligence (BI), we are talking about two things. One is reporting strictly defined. The other is 'reporting' taken in a more general meaning. In the first case, reporting is the art of collecting data from various data sources and presenting it to end-users in a way that is understandable and ready to be analyzed. In the second sense, reporting means presenting data and information, so it also includes analysis–in other words, allowing end-users to both see and understand the data, as well as act on it." (Logi Analytics) [source

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.