SQL Troubles: quality

Showing posts with label quality. Show all posts

03 May 2025

🧭Business Intelligence: Perspectives (Part 31: More on Data Visualization)

Business Intelligence Series

There are many reasons why the data visualizations available in the different mediums can be considerate as having poor quality and unfortunately there is often more than one issue that can be corroborated with this - the complexity of the data or of the models behind them, the lack of identifying the right data, respectively aspects that should be visualized, poor data visualization software or the lack of skills to use its capabilities, improper choice of visual displays, misleading choice of scales, axes and other elements, the lack of clear outlines for telling a story respectively of pushing a story too far, not adapting visualizations to changing requirements or different perspectives, to name just the most important causes.

The complexity of the data increases with the dimensions associated typically with what we call currently big data - velocity, volume, value, variety, veracity, variability and whatever V might be in scope. If it's relatively easy to work with a small dataset, understanding its shapes and challenges, our understanding power decreases with the Vs added into the picture. Of course, we can always treat the data alike, though the broader the timeframe, the higher the chances are for the data to have important changing characteristics that can impact the outcomes. It can be simple definition changes or more importantly, the model itself. Data, processes and perspectives change fluidly with the many requirements, and quite often the further implications for reporting, visualizations and other aspects are not considered.

Quite often there's a gap between what one wants to achieve with a data visualization and the data or knowledge available. It might be a matter of missing values or whole attributes that would help to delimit clearly the different perspectives or of modelling adequately the processes behind. It can be the intrinsic data quality issues that can be challenging to correct after the fact. It can also be our understanding about the processes themselves as reflected in the data, or more important, on what's missing to provide better perspectives. Therefore, many are forced to work with what they have or what they know.

Many of the data visualizations inadvertently reflect their creators' understanding about the data, procedures, processes, and any other aspects related to them. Unfortunately, also business users or other participants have only limited views and thus their knowledge must be elicited accordingly. Even then, it might be pieces of data that are not reflected in any knowledge available.

If one tortures enough data, one or more stories worthy of telling can probably be identified. However, much of the data is dull to the degree that some creators feel forced to add elements. Earlier, one could have blamed the software for it, though modern software provides nice graphics and plenty of features that can help graphics creators in the process. Even data with high quality can reveal some challenges difficult to overcome. One needs to compromise and there can be compromises in many places to the degree that one can but wonder whether the end result still reflects reality. Unfortunately, it's difficult to evaluate the impact of such gaps, however progress can be made occasionally by continuously evaluating the gaps and finding the appropriate methods to address them.

Not all stories must have complex visualizations in which multiple variables are used to provide the many perspectives. Some simple visualizations can be enough for establishing common ground on which something more complex (or simple) can be built upon. Data visualization is a continuous process of exploration, extrapolation, evaluation, testing assumptions and ideas, where one's experience can be a useful mediator between the various forces.

Previous Post <<||>> Next Post

08 March 2025

#️⃣Software Engineering: Programming (Part XVI: The Software Quality Perspective and AI)

Software Engineering Series

Organizations tend to complain about poor software quality developed in-house, by consultancy companies or third parties, without doing much in this direction. Unfortunately, this agrees with the bigger picture reflected by the quality standards adopted by organizations - people talk and complain about them, though they aren’t that eager to include them in the various strategies, or even if they are considered, they are seldom enforced adequately!

Moreover, even if quality standards are adopted, and a lot of effort may be spent in this direction (as everybody has strong opinions and there are many exceptions), as projects progress, all the good intentions come to an end, the rules fading on the way either because are too strict, too general, aren’t adequately prioritized or communicated, or there’s no time to implement (all of) them. This applies in general to programming and to the domains that revolve around data – Business Intelligence, Data Analytics or Data Science.

The volume of good quality code and deliverables is not only a reflection of an organization’s maturity in dealing with best practices but also of its maturity in handling technical debt, Project Management, software and data quality challenges. All these aspects are strongly related to each other and therefore require a systemic approach rather than focusing on the issues locally. The systemic approach allows organizations to bridge the gaps between business areas, teams, projects and any other areas of focus.

There are many questionable studies on the effect of methodologies on software quality and data issues, proclaiming that one methodology is better than the other in addressing the multifold aspects of software quality. Besides methodologies, some studies attempt to correlate quality with organizations’ size, management or programmers’ experience, the size of software, or whatever characteristic might seem to affect quality.

Bad code is written independently of companies’ size or programmer's experience, management or organization’s maturity. Bad code doesn’t necessarily happen all at once, but it can depend on circumstances, repetitive team, requirements and code changes. There are decisions and actions that sooner or later can affect the overall outcome negatively.

Rewriting the code from scratch might look like an approachable measure though it’s seldom the cost-effective solution. Allocating resources for refactoring is usually a better approach, though this tends to increase considerably the cost of projects, and organizations might be tempted to face the risks, whatever they might be. Independently of the approaches used, sooner or later the complexity of projects, requirements or code tends to kick back.

There are many voices arguing that AI will help in addressing the problems of software development, quality assurance and probably other areas. It’s questionable how much AI will help to address the gaps, non-concordances and other mistakes in requirements, and how it will develop quality code when it has basic "understanding" issues. Even if step by step all current issues revolving around AI will be fixed, it will take time and multiple iterations until meaningful progress will be made.

At least for now, AI tools like Copilot or ChatGPT can be used for learning a programming language or framework through predefined or ad-hoc prompts. Probably, it can be used also to identify deviations from best practices or other norms in scope. This doesn’t mean that AI will replace for now code reviews, testing and other practices used in assuring the quality of software, but it can be used as an additional method to check for what was eventually missed in the other methods.

AI may also have hidden gems that when discovered, polished and sized, may have a qualitative impact on software development and software. Only time will tell what’s possible and achievable.

Previous Post <<||>> Next Post

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

Business Intelligence Series

There’s a saying that applies to many contexts ranging from software engineering to data analysis and visualization related solutions: "fools rush in where angels fear to tread" [1]. Much earlier, an adage attributed to Confucius provides a similar perspective: "do not try to rush things; ignore matters of minor advantage". Ignoring these advices, there's the drive in rapid prototyping to jump in with both feet forward without checking first how solid the ground is, often even without having adequate experience in the field. That’s understandable to some degree – people want to see progress and value fast, without building a foundation or getting an understanding of what’s happening, respectively possible, often ignoring the full extent of the problems.

A prototype helps to bring the requirements closer to what’s intended to achieve, though, as the practice often shows, the gap between the initial steps and the final solutions require many iterations, sometimes even too many for making a solution cost-effective. There’s almost always a tradeoff between costs and quality, respectively time and scope. Sooner or later, one must compromise somewhere in between even if the solution is not optimal. The fuzzier the requirements and what’s achievable with a set of data, the harder it gets to find the sweet spot.

Even if people understand the steps, constraints and further aspects of a process relatively easily, making sense of the data generated by it, respectively using the respective data to optimize the process can take a considerable effort. There’s a chain of tradeoffs and constraints that apply to a certain situation in each context, that makes it challenging to always find optimal solutions. Moreover, optimal local solutions don’t necessarily provide the optimum effect when one looks at the broader context of the problems. Further on, even if one brought a process under control, it doesn’t necessarily mean that the process works efficiently.

This is the broader context in which data analysis and visualization topics need to be placed to build useful solutions, to make a sensible difference in one’s job. Especially when the data and processes look numb, one needs to find the perspectives that lead to useful information, respectively knowledge. It’s not realistic to expect to find new insight in any set of data. As experience often proves, insight is rarer than finding gold nuggets. Probably, the most important aspect in gold mining is to know where to look, though it also requires luck, research, the proper use of tools, effort, and probably much more.

One of the problems in working with data is that usually data is analyzed and visualized in aggregates at different levels, often without identifying and depicting the factors that determine why data take certain shapes. Even if a well-suited set of dimensions is defined for data analysis, data are usually still considered in aggregate. Having the possibility to change between aggregates and details is quintessential for data’s understanding, or at least for getting an understanding of what's happening in the various processes.

There is one aspect of data modeling, respectively analysis and visualization that’s typically ignored in BI initiatives – process-wise there is usually data which is not available and approximating the respective values to some degree is often far from the optimal solution. Of course, there’s often a tradeoff between effort and value, though the actual value can be quantified only when gathering enough data for a thorough first analysis. It may also happen that the only benefit is getting a deeper understanding of certain aspects of the processes, respectively business. Occasionally, this price may look high, though searching for cost-effective solutions is part of the job!

Previous Post <<||>> Next Post

References:
[1] Alexander Pope (cca. 1711) An Essay on Criticism

26 January 2025

🧭Business Intelligence: Perspectives (Part 25: Grounding the Roots)

Business Intelligence Series

When building something that is supposed to last, one needs a solid foundation on which the artifact can be built upon. That’s valid for castles, houses, IT architectures, and probably most important, for BI infrastructures. There are so many tools out there that allow building a dashboard, report or other types of BI artifacts with a few drag-and-drops, moving things around, adding formatting and shiny things. In many cases all these steps are followed to create a prototype for a set of ideas or more formalized requirements keeping the overall process to a minimum.

Rapid prototyping, the process of building a proof-of-concept by focusing at high level on the most important design and functional aspects, is helpful and sometimes a mandatory step in eliciting and addressing the requirements properly. It provides a fast road from an idea to the actual concept, however the prototype, still in its early stages, can rapidly become the actual solution that unfortunately continues to haunt the dreams of its creator(s).

Especially in the BI area, there are many solutions that started as a prototype and gained mass until they start to disturb many things around them with implications for security, performance, data quality, and many other aspects. Moreover, the mass becomes in time critical, to the degree that it pulled more attention and effort than intended, with positive and negative impact altogether. It’s like building an artificial sun that suddenly becomes a danger for the nearby planet(s) and other celestial bodies.

When building such artifacts, it’s important to define what goals the end-result must or would be nice to have, differentiating clearly between them, respectively when is the time to stop and properly address the aspects mandatory in transitioning from the prototype to an actual solution that addresses the best practices in scope. It’s also the point when one should decide upon solution’s feasibility, needed quality acceptance criteria, and broader aspects like supporting processes, human resources, data, and the various aspects that have impact. Unfortunately, many solutions gain inertia without the proper foundation and in extremis succumb under the various forces.

Developing software artifacts of any type is a balancing act between all these aspects, often under suboptimal circumstances. Therefore, one must be able to set priorities right, react and change direction (and gear) according to the changing context. Many wish all this to be a straight sequential road, when in reality it looks more like mountain climbing, with many peaks, valleys and change of scenery. The more exploration is needed, the slower the progress.

All these aspects require additional time, effort, resources and planning, which can easily increase the overall complexity of projects to the degree that it leads to (exponential) effort and more important - waste. Moreover, the complexity pushes back, leading to more effort, and with it to higher costs. On top of this one has the iteration character of BI topics, multiple iterations being needed from the initial concept to the final solution(s), sometimes many steps being discarded in the process, corners are cut, with all the further implications following from this.

Somewhere in the middle, between minimum and the broad overextending complexity, is the sweet spot that drives the most impact with a minimum of effort. For some organizations, respectively professionals, reaching and remaining in the zone will be quite a challenge, though that’s not impossible. It’s important to be aware of all the aspects that drive and sustain the quality of artefacts, data and processes. There’s a lot to learn from successful as well from failed endeavors, and the various aspects should be reflected in the lessons learned.

Previous Post <<||>> Next Post

22 March 2024

🧭Business Intelligence: Perspectives (Part 9: Dashboards Are Dead & Other Crap)

Business Intelligence Series

I find annoying the posts that declare that a technology is dead, as they seem to seek the sensational and, in the end, don't offer enough arguments for the positions taken; all is just surfing though a few random ideas. Almost each time I klick on such a link I find myself disappointed. Maybe it's just me - having too great expectations from ad-hoc experts who haven't understood the role of technologies and their lifecycle.

At least until now dashboards are the only visual tool that allows displaying related metrics in a consistent manner, reflecting business objectives, health, or other important perspective into an organization's performance. More recently notebooks seem to be getting closer given their capabilities of presenting data visualizations and some intermediary steps used to obtain the data, though they are still far away from offering similar capabilities. So, from where could come any justification against dashboard's utility? Even if I heard one or two expert voices saying that they don't need KPIs for managing an organization, organizations still need metrics to understand how the organization is doing as a whole and taken on parts.

Many argue that the design of dashboards is poor, that they don't reflect data visualization best practices, or that they are too difficult to navigate. There are so many books on dashboard and/or graphic design that is almost impossible not to find such a book in any big library if one wants to learn more about design. There are many resources online as well, though it's tough to fight with a mind's stubbornness in showing no interest in what concerns the topic. Conversely, there's also lot of crap on the social networks that qualify after the mainstream as best practices.

Frankly, design is important, though as long as the dashboards show the right data and the organization can guide itself on the respective numbers, the perfectionists can say whatever they want, even if they are right! Unfortunately, the numbers shown in dashboards raise entitled questions and the reasons are multiple. Do dashboards show the right numbers? Do they focus on the objectives or important issues? Can the number be trusted? Do they reflect reality? Can we use them in decision-making?

There are so many things that can go wrong when building a dashboard - there are so many transformations that need to be performed, that the chances of failure are high. It's enough to have several blunders in the code or data visualizations for people to stop trusting the data shown.

Trust and quality are complex concepts and there’s no standard path to address them because they are a matter of perception, which can vary and change dynamically based on the situation. There are, however, approaches that allow to minimize this. One can start for example by providing transparency. For each dashboard provide also detailed reports that through drilldown (or also by running the reports separately if that’s not possible) allow to validate the numbers from the report. If users don’t trust the data or the report, then they should pinpoint what’s wrong. Of course, the two sources must be in synch, otherwise the validation will become more complex.

There are also issues related to the approach - the way a reporting tool was introduced, the way dashboards flooded the space, how people reacted, etc. Introducing a reporting tool for dashboards is also a matter of strategy, tactics and operations and the various aspects related to them must be addressed. Few organizations address this properly. Many organizations work after the principle "build it and they will come" even if they build the wrong thing!

Previous Post <<||>> Next Post

17 February 2024

🧭Business Intelligence: A Software Engineer's Perspective I (Houston, we have a Problem!)

Business Intelligence Series

One of the critics addressed to the BI/Data Analytics, Data Engineering and even Data Science fields is their resistance to applying Software Engineering (SE) methods in practice. SE can be regarded as the application of sound methods, methodologies, techniques, principles, and practices to obtain high quality economic software in a reproducible manner. At minimum, should be applied SE techniques and practices proven to work, for example the use of best practices, reference technologies, standardized processes for requirements gathering and management, etc. This doesn't mean that one should apply the full extent of SE but consider a minimum that makes sense to adopt.

Unfortunately, the creation of data artifacts (queries, reports, data models, data pipelines, data visualizations, etc.) as process seem to be done after the principle of least action, though least action means here the minimum interaction to push pieces on a board rather than getting the things done. At high level, the process is as follows: get the requirements, build something, present results, get more requirements, do changes, present the results, and the process is repeated ad infinitum.

Given that data artifact's creation finds itself at the intersection of two or more knowledge areas in which knowledge is exchanged in several iterations between the parties involved until a common ground is achieved, this process is totally inefficient from multiple perspectives. First of all, it takes considerably more time than planned to reach a solution, resources being wasted in the process, multiple forms of waste being involved. Secondly, the exchange and retention of knowledge resulting from the process is minimal, mainly on a need by basis. This might look as an efficient approach on the short term, but is inefficient overall.

BI reflects the general issues from SE - most of the issues can be traced back to requirements - if the requirements are incorrect and there's no magic involved in between, then one can't expect for the solution to be correct. The bigger the difference between the initial and final requirements elicited in the process, the more resources are wasted. The more time passes between the start of the development phase and the time a solution is presented to the customer, the longer it takes to build the final solution. Same impact have the time it takes to establish a common ground and other critical factors for success involved in the process.

One can address these issues through better requirements elicitation, rapid prototyping, the use of agile methodologies and similar approaches, though the general feeling is that even if they bring improvements, they don't address the root causes - lack of data literacy skills, lack of knowledge about the business, lack of maturity in planning and executing tasks, the inexistence of well-designed processes and procedures, respectively the lack of an engineering mindset.

These inefficiencies have low impact when building a report occasionally, though they accumulate and tend to create systemic issues in what concerns the overall BI effort. They are addressed locally by experts and in general through a strategic approach like the elaboration of a BI strategy, though organizations seldom pay attention to them. Some organizations consider that they are automatically addressed as part of the data culture, though data culture focuses in general on data literacy and not on the whole set of assumptions mentioned above.

An experienced data professional sees more likely the inefficiencies, tries to address them locally in his interactions with the various stakeholders, he/she can build a business case for addressing them, though it depends on organizations to recognize that they have a problem, respective address the inefficiencies in a strategic and systemic manner!

Previous Post <<||>> Next Post

03 October 2023

🧮ERP: Implementations (Part VIII: It’s a Matter of Complexity)

ERP Implementations Series

There are many factors to blame for implementation process’ inefficiency, however many of the factors can be associated with the complexity of the project itself, respectively of the application(s) involved. The problem of complexity can be addressed by either answering to complexity with complexity, building a complex team to handle the tasks, which is seldom feasible even if many organizations do it, respectively by simplifying the implementation process and/or the application.

In what concerns the project, the complexity starts with requirement’s elicitation, the iterative transformations they suffer until the final functional requirements document is finalized, their evaluation and mapping to features, respectively gap’s identification. It’s a complex task because it involves understanding the business as well the functionality available in the target system(s). Then comes the effort estimation, which, as the name suggests, is just a guess based on available historical numbers and/or experts’ opinion. High-level requirements are easier to manage than low-level requirements, however they allow for more gaps in understanding. The more detailed the specifications, the more they should help in the estimation process, though that’s the theory. A considerable number of factors can impact the process.

Even if there are standard activities in the implementation process, the number of resources involved from the customer as well from the partner(s) side makes the whole planning process a nightmare for any Project Manager, no matter how experienced he/she is.

Ideally, each member of the team should behave like a trooper, knowing by instinct when and what needs to be done, which are the expectations, etc. This might be close to expectation on the partner side as the resources more likely participated in similar projects, though there’s always a mix between levels of expertise, resources migrating between projects. Unfortunately, that’s seldom (never) the case on the customer side as the gap between reality and expectation is considerable.

Each team member requires a minimum of information/knowledge so he/she can perform the activities assigned. Moreover, the volume of coordination and cooperation is considerably higher than in other projects, complexity that increases with organization’s size and is inverse proportional with organization’s maturity in managing projects and implementation-related activities. There’s thus a minimum of initial communication needed, and furthermore communication needs to occur between the parties involved. Moreover, the higher the lack of cohesion between the parties, the higher the need for communication and this applies especially when multiple organizations are involved in the project.

The triple constraint of Project Management between scope, cost, and time, respectively on quality has an important impact on the project. Resources need to be available when the project needs them and, especially on the partner side, only when they are needed. The implementation project to be feasible for the partner, its resources must work on several projects in parallel or the timing must be perfect, that no waiting times are involved, respectively the effort is concentrated only when needed. Such precision is possible maybe at project’s beginning, though the further the project evolves, the more challenging becomes the coordination of resources. Similar considerations apply to the customer as well.

Thus, a more realistic expectation is to have resources available only at certain points in time, and the resources should be capable of juggling between projects, respectively between project and other activities. Prioritizing is a must, and sometimes the operations or other projects have higher priority. When the time is not available, resources need to compromise by reducing the level of quality.

On the other side, it would be great if most of the effort could be concentrated at the beginning of the project, the later interactions being minimal.

Previous <<||>> Next

19 October 2022

🌡Performance Management: Mastery (Part II: First Time Right - The Aim toward Operational Excellence)

Performance Management Series

Rooted in Six Sigma methodology as a step toward operational excellence, First Time Right (FTR) implies that any procedure is performed in the right manner the first time and every time. It equates to minimizing the waste in its various forms (inventory, motion, overprocessing, overproduction, waiting, transportation, defects). Like many quality concepts from the manufacturing industry, the concept was transported in the software development process as principle, process, goal and/or metric. Thus, it became part of Software Engineering, Project Management, Data Science, and any other similar endeavors whose outcome results in software products.

Besides the quality aspect, FTR is rooted also in the economic imperative – the need to achieve something in the minimum amount of time with the minimum of effort. It’s about being efficient in delivering a product or achieving a given target. It can be associated with continuous improvement, learning and mastery, the aim being to encompass FTR as part of organization’s culture.

Even if not explicitly declared, FTR lurks in each task planned. It seems that it became common practice to plan with the FTR in mind, however between this theoretical aim and practice there’s as usual an important gap. Unfortunately, planners, managers and even tasks' performers often forget that mistakes are made, that several iterations are needed to get the job done. It starts with the communication between people in clarifying the requirements and ends with the formal sign off. All the deviations from the FTR add up in the deviations between expected and actual effort, though probably more important are the deviations from the plan and all the consequences deriving from it. Especially in complex projects this adds up into a spiral of issues that can easily reinforce themselves.

Many of the jobs that imply creativity, innovation, research or exploration require at least several iterations to get the job done and this is independent of participants’ professionalism and experience. Moreover, the more quality one needs, the higher the effort, the 80/20 being sometimes a good approximation of the effort needed. In extremis, aiming for perfection instead of excellence can make certain tasks a never-ending story.

Achieving FTR requires practice - the more novelty, the higher the complexity, the communication or the synchronization needs, the more practice is needed. It starts with the individual to master the individual tasks and ends with the team, where communication, synchronization and other aspects need to be considered. The practice is usually achieved on hands-on work as part of the daily duties, project work, and so on. Unfortunately, it’s based primarily on individual experience, and seldom groomed in advance, as preparation for future tasks. That’s why sometimes when efficiency is needed in performing critical complex tasks, one also needs to consider the learning curve in achieving the required quality.

Of course, many organizations demand from job applicants experience and, when possible, they hire people with experience, however the diversity, complexity and changing nature of tasks require further practice. This aspect is somehow recognized in the implementation in organizations of the various forms of DevOps, though how many organizations adopt it and enforce it on a regular basis? Moreover, a major requirement of nowadays businesses is to be agile, and besides the mere application of methodologies, being agile means to have also a FTR mindset.

FTR starts with the wish for mastery at individual and team level and, with the right management attention, by allocating time for learning, self-development in the important areas, providing relevant feedback and building an infrastructure for knowledge sharing and harnessing, FTR can become part of organization’s culture. It’s up to each of us to do it!

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part III: Quality Acceptance Criteria III)

Data Migrations Series

Repeatability

Repeatability is the degree with which a DM can be repeated and obtain consistent results between repetitions. Even if a DM is supposed to be a one-time activity for a project, to guarantee a certain level of quality it’s important to consider several iterations in which the data requirements are refined and made sure that the data can be imported as needed into the target system(s). Considered as a process, as long the data and the rules haven’t changed, the results should be the same or have the expected level of deviation from expectations.

This requirement is important especially for the data migrated during UAT and Go-Live, time during which the input data and rules need to remain frozen (even if small changes in the data can still occur). In fact, that’s the role of UAT – to assure that the data have the expected quality and when compared to the previous dry-run, that it attains the expected level of consistency.

Reusability

Reusability is the degree to which the whole solution, parts of the logic or data can be reused for multiple purposes. Master data and the logic associated with them have high reusability potential as they tend to be referenced by multiple entities.

Modularity

Modularity is the degree to which a solution is composed of discrete components such that a change to one component has minimal impact on other components. It applies to the solution itself but also to the degree to which the logic for the various entities is partitioned so to assure a minimal impact.

Partitionability

Partitionability is the degree to which data or logic can be partitioned to address the various requirements. Despite the assurance that the data will be migrated only once, in practice this assumption can be easily invalidated. It’s enough to increase the system freeze by a few days and/or to have transaction data that suddenly requires master data not considered. Even if the deltas can be migrated in system manually, it’s probably recommended to migrate them using the same logic. Moreover, the performing of incremental loads can be a project requirement.

Data might need to be partitioned into batches to improve processing’s performance. Partitioning the logic based on certain parameters (e.g. business unit, categorical values) allows more flexibility in handling other requirements (e.g. reversibility, performance, testability, reusability).

Performance

Performance refers to the degree a piece of software can process data into an amount of time considered as acceptable for the business. It can vary with the architecture and methods used, respectively data volume, veracity, variance, variability, or quality.

Performance is a critical requirement for a DM, especially when considering the amount of time spent on executing the logic during development, tests and troubleshooting, as well for other activities. Performance is important during dry-runs but more important during Go-Live, as it equates with a period during which the system(s) are not available for the users. Upon case, a few hours of delays can have an important impact on the business. In extremis, the delays can sum up to days.

Predictability

Predictability is the degree to which the results and behavior of a solution, respectively the processes involve are predictable based on the design, implementation or other factors considered (e.g. best practices, methodology used, experience, procedures and processes). Highly predictable solutions are desirable, though reaching the required level of performance and quality can be challenging.

The results from the dry-runs can offer an indication on whether the data migrated during UAT and Go-Live provide a certain level of assurance that the DM will be a success. Otherwise, an additional dry-run should be planned during UAT, if the schedule allows it.

Previous Post <> Nest Post

20 May 2020

💼Project Management: Project Planning (Part V: Some Thoughts on Planning II)

A project’s dependency on resources’ (average) utilization time (UT) and quality expectations expressed as a quality factor (QF) doesn’t come as a surprise, as hopefully one is acquainted with project’s triangle which reflects the dependency between scope, cost and time in respect to quality. Even if this dependency is intuitive, it’s difficult to express it in numbers and study the way it affects the project. That was the purpose of the model built previously.

From the respective model there are a few things to ponder. First, it’s a utopia to plan with 90% UT, unless one is really sure that the resources have enough work to bring the idle time close to zero. A single person can achieve maybe a 90% UT if he works alone on the project, though even then there are phases in which the input or feedback from other people is necessary. The more people involved into the project and the higher the dependency between their activities, the higher the chances that the (average) UT will decrease considerably.

When in addition there’s also a geographical or organizational boundary between team members, the UT will decrease even more. In consequence, in big projects like ERP implementations the team members from customer and vendor side are allocated fully to the project; when this is not possible, then on the vendor side the consultants need to be involved in at least two projects to cover the idle time. Of course, with good planning, communication, and awareness of the work ahead one can try minimizing the idle time, though that’s less likely to happen.

Probably, a better idea would be planning with 75% or even 60% UT though the values depend on team's experience in handling similar projects. If the team members are involved also in operational activities or other projects, then a 50% UT is more realistic.

Secondly, in the previous post was considered in respect to quality the 80%-20% rule which applies to the various deliverables, though the rule has a punctual character. Taken on the average the rule is somehow attenuated. Therefore, in the model was considered a sprung between factors of 1 to 2 with a step of 0,25 for each 5% quality increase. It's needed to prove whether the values are realistic and how much they depend on project's characteristics.

On the other side, quality is difficult to quantify, and 100% quality is hypothetical. One discusses in theory about 3 sigma (the equivalent of 93,3 accuracy) or 4 sigma (99,4 accuracy) in respect to the number of errors found in the code, though from there on everything is fuzzy. In software projects each decision has the potential of leading to an error, and there’s lot of interpretability as long there’s no fix basis against to compare the deviations. One needs precise and correct specification for that.

I think that one should target in a first phase 80% quality (on average) and further build from there, try to improve the quality iteratively as the project goes on and as lessons are learned. In other words, a project plan, a concept, a design document doesn’t need to be perfect from the beginning but should be good enough to allow working with it. One can detail them as progress is made into the project, and hopefully their quality should converge to a value that is acceptable for the business.

Thirdly, in case a planning tool was used, one can use the model backwards to roughly prove timeline’s feasibility, dividing the planned effort by the estimated effort and the number of resources involved to identify the implied utilization time.

19 May 2020

💼Project Management: Project Planning (Part IV: Some Thoughts on Planning I)

One of the issues in Project Management (PM) planning is that the planner idealizes a resource and activities performed by it much like a machine. Unlike machines whose uptime can approach 100%, a human resource can work at most 90% of the available time (aka utilization time), the remaining 10% being typically associated with interruptions – internal emails and meetings, casual communications, pauses, etc. For resources split between projects or operations the utilization time can be at most 70%, however a realistic value is in general between 40% and 60% on average. What does it mean this for a project?

So, if a resource has a volume of work W, the amount of time needed to complete the work would be at best W/UT, where UT is the utilization time of the respective resource. “At best” because in each project there are additional idle time resulted from waste related activities – waiting for sign off, for information, for other resource to complete the time, etc.

The utilization time is not the only factor to consider. Upon case, the delivered work can reach maybe on average 80% of the expected quality. This applies to documentation and concepts as well for written code, bug testing and other project activities. To reach in the range of 100% one more likely will need 4 times of the effort associated with reaching 80% of the expected quality, however this value is dependent also on people’s professionalism and the degree with which the requirements were understood and possibly achievable. Therefore, the values vehiculated can be regarded as “boundary” values.

Let’s consider a quality factor (QF) which has a value of 1 for 80%, with an increase of 0,25 for each 5% of quality increase. Thus, with an initial effort estimation of 100 days, this is how the resulted effort modifies for various UT and QF values:

Considering that a project can target between 60% and 95% UT, and between 80% and 95% quality, for an initial estimation of 100 days the actual project duration can range between 117 and 292 days, where the lowest, respectively the right bound values are more realistic.

The model is simplistic as it doesn’t reflect the nonlinear aspect of the factors involved and the dependencies existing between them. It also doesn’t reflect the maturity of an organization to handle the projects and the tasks involved. However, it can be used to increase the awareness in how the utilization time and expected quality can affect a project’s timeline, and to check on whether one’s planning is realistic.

For example, at project’s start one can target an UT of 70% and a quality of 85%, which for 100 days of estimated effort will result in about 178 days of actual effort. Now diving the value by the number of resources involved, e.g. 4, it results that the project could be finished in about 44,5 days. This value can be compared then with the actual plan in which the activities are listed.

During the project it would be useful to look on how the UT changed and by how much, to understand the impact the change has on the project. For example, a decrease of 5% in utilization time can delay the project with 2,5 days which is not much, though for a project of 1000 days with talk already about one month. Same, it will be helpful to check how much the quality deviated from the expectation, because a decrease in quality by 5% can result in an additional effort of extra 8 days, which for 1000 days would mean almost 4 months of delay.

16 July 2019

🧱IT: Quality of Service [QoS] (Definitions)

"The guaranteed performance of a network connection." (Tom Petrocelli, "Data Protection and Information Lifecycle Management", 2005)

"QoS (Quality of Service) is a metric for quantifying desired or delivered degree of service reliability, priority, and other measures of interest for its quality." (Bo Leuf, "The Semantic Web: Crafting infrastructure for agency", 2006)

"a criterion of performance of a service or element, such as the worst-case execution time for an operation." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)

"The QoS describes the non-functional aspects of a service such as performance." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"QoS (Quality of Service) Networking technology that enables network administrators to manage bandwidth and give priority to desired types of application traffic as it traverses the network." (Mark Rhodes-Ousley, "Information Security: The Complete Reference" 2nd Ed., 2013)

"A negotiated contract between a user and a network provider that renders some degree of reliable capacity in the shared network." (Gartner)

"Quality of service (QoS) is the description or measurement of the overall performance of a service, especially in terms of the user’s experience. Typically it is used in reference to telephony or computer networks, or to online and cloud-hosted services." (Barracuda) [source]

"The measurable end-to-end performance properties of a network service, which can be guaranteed in advance by a Service Level Agreement between a user and a service provider, so as to satisfy specific customer application requirements. Note: These properties may include throughput (bandwidth), transit delay (latency), error rates, priority, security, packet loss, packet jitter, etc." (CNSSI 4009-2015)

13 July 2019

🧱IT: Service Level Agreement [SLA] (Definitions)

"A signed agreement of system service requirements between two parties (such as your company and an ASP or between your department and end users) that defines the guidelines, response times, actions, and so on, that will be adhered to for the life of the agreement." (Allan Hirt et al, "Microsoft SQL Server 2000 High Availability", 2004)

"A contract with a service provider, be it an internal IT organization, application service provider, or outsourcer, that specifies discrete reliability and availability requirements for an outsourced system. An SLA might also include other requirements such as support of certain technology standards or data volumes. An outsourcer’s failure to adhere to the terms laid out in an SLA could result in financial penalties." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"A formal negotiated agreement between two parties. It is a contract that exists between customers and their service provider, or between service providers. It records the common understanding about services, priorities, responsibilities, guarantees, and so on, with the main purpose to agree on the level of service." (Tilak Mitra et al, "SOA Governance", 2008)

"An agreement between a customer and a product or service provider that defines conditions under which the provider will offer support or additional services to the customer, and what level of services will be offered under each of those conditions." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"An agreement between a service provider and a service recipient that formally defines the levels of service that are to be provided." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"A formal negotiated agreement between two parties that usually records the common understanding about priorities, responsibilities, and warranties, with the main purpose of agreeing on the quality of the service. For example, an SLA may specify the levels of availability, serviceability, performance, operation, or other attributes of the service (such as billing and even penalties in the case of violations of the SLA)." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A written legal contract between a service provider and client wherein the service provider guarantees a minimum level of service." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"A contracted guarantee of service delivery for a program, transaction, service, or workload." (Craig S Mullins, "Database Administration", 2012)

"The part of a contract between two parties that outlines the delivery of services within defined timeframes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A statement to customers or the user community about the service the IT department will provide. It can refer to a variety of metrics, such as performance, up-time, resolution time, and so on." (Bill Holtsnider & Brian D Jaffe, "IT Manager's Handbook" 3rd Ed., 2012)

"An agreement between an IT service provider and a customer to provide a specific level of reliability for a service. It stipulates performance expectations such as minimum uptime and maximum downtime levels. Many SLAs include monetary penalties if the IT service provider does not provide the service as promised." (Darril Gibson, "Effective Help Desk Specialist Skills", 2014)

"The service or maintenance contract that states the explicit levels of support, response time windows or ranges, escalation procedures in the event of a persistent problem, and possible penalties for nonconformance in the event the vendor does not meet its contractual obligations." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"A contract for formally defined services. Particular aspects of the service (scope, quality, responsibilities) are agreed between the service provider and the service user. A common feature of an SLA is a contracted delivery time of the service or performance." (Thomas C Wilson, "Value and Capital Management", 2015)

"A portion of a service contract that promises specific levels of service." (Weiss, "Auditing IT Infrastructures for Compliance" 2nd Ed, 2015)

"A contract between a service provider (either internal or external) and the end user that defines the level of service expected from the service provider." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide)", 2017)

20 November 2018

🔭Data Science: Qualitative vs Quantitative (Just the Quotes)

"To us […] the only acceptable point of view appears to be the one that recognizes both sides of reality - the quantitative and the qualitative, the physical and the psychical - as compatible with each other, and can embrace them simultaneously […] It would be most satisfactory of all if physis and psyche (i.e., matter and mind) could be seen as complementary aspects of the same reality." (Wolfgang Pauli', "The Influence of Archetypal Ideas on the Scientific Theories of Kepler", [Lecture at the Psychological Club of Zurich], 1948)

"A model is a qualitative or quantitative representation of a process or endeavor that shows the effects of those factors which are significant for the purposes being considered. A model may be pictorial, descriptive, qualitative, or generally approximate in nature; or it may be mathematical and quantitative in nature and reasonably precise. It is important that effective means for modeling be understood such as analog, stochastic, procedural, scheduling, flow chart, schematic, and block diagrams." (Harold Chestnut, "Systems Engineering Tools", 1965)

"As is used in connection with systems engineering, a model is a qualitative or quantitative representation of a process or endeavor that shows the effects of those factors which are significant for the purposes being considered. Modeling is the process of making a model. Although the model may not represent the actual phenomenon in all respects, it does describe the essential inputs, outputs, and internal characteristics, as well as provide an indication of environmental conditions similar to those of actual equipment." (Harold Chestnut, "Systems Engineering Tools", 1965)

"In the long run, qualitative changes always outweigh quantitative ones. Quantitative predictions of economic and social trends are made obsolete by qualitative changes in the rules of the game. Quantitative predictions of technological progress are made obsolete by unpredictable new inventions. I am interested in the long run, the remote future, where quantitative predictions are meaningless. The only certainty in that remote future is that radically new things will be happening." (Freeman J Dyson, "Disturbing the Universe", 1979)

"[…] the meaning of the word 'solve' has undergone a series of major changes. First that word meant 'find a formula'. Then its meaning changed to 'find approximate numbers'. Finally, it has in effect become 'tell me what the solutions look like'. In place of quantitative answers, we seek qualitative ones." (Ian Stewart, "Nature's Numbers: The unreal reality of mathematics", 1995)

"Quantify. If whatever it is you’re explaining has some measure, some numerical quantity attached to it, you’ll be much better able to discriminate among competing hypotheses. What is vague and qualitative is open to many explanations." (Carl Sagan, "The Demon-Haunted World: Science as a Candle in the Dark", 1995)

"Quantitative knowing is dependent on qualitative knowledge [...] In quantitative data analysis, numbers map onto aspects of reality. Numbers themselves are meaningless unless the data analyst understands the mapping process and the nexus of theory and categorization in which objects under study are conceptualized." John T Behrens, "Principles and Procedures of Exploratory Data Analysis", 1997)

"Modeling, in a general sense, refers to the establishment of a description of a system (a plant, a process, etc.) in mathematical terms, which characterizes the input-output behavior of the underlying system. To describe a physical system […] we have to use a mathematical formula or equation that can represent the system both qualitatively and quantitatively. Such a formulation is a mathematical representation, called a mathematical model, of the physical system." (Guanrong Chen & Trung Tat Pham, "Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems", 2001)

"Reductionism argues that from scientific theories which explain phenomena on one level, explanations for a higher level can be deduced. Reality and our experience can be reduced to a number of indivisible basic elements. Also qualitative properties are possible to reduce to quantitative ones." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"As every bookie knows instinctively, a number such as reliability - a qualitative rather than a quantitative measure - is needed to make the valuation of information practically useful." (Hans Christian von Baeyer, "Information, The New Language of Science", 2003)

"In order to understand how mathematics is applied to understanding of the real world it is convenient to subdivide it into the following three modes of functioning: model, theory, metaphor. A mathematical model describes a certain range of phenomena qualitatively or quantitatively. […] A (mathematical) metaphor, when it aspires to be a cognitive tool, postulates that some complex range of phenomena might be compared to a mathematical construction." (Yuri I Manin," Mathematics as Metaphor: Selected Essays of Yuri I. Manin", 2007)

"Our culture, obsessed with numbers, has given us the idea that what we can measure is more important than what we can't measure. Think about that for a minute. It means that we make quantity more important than quality." (Donella Meadows, "Thinking in Systems: A Primer", 2008)

"A commonly accepted principle of systems dynamics is that a quantitative change, beyond a critical point, results in a qualitative change. Accordingly, a difference in degree may become a difference in kind. This doesn't mean that an increased quantity of a given variable will bring a qualitative change in the variable itself. However, when the state of a system depends on a set of variables, a quantitative change in one variable beyond the inflection point will result in a change of phase in the state of the system. This change is a qualitative one, representing a whole new set of relationships among the variables involved." (Jamshid Gharajedaghi, "Systems Thinking: Managing Chaos and Complexity A Platform for Designing Business Architecture" 3rd Ed., 2011)

"Whether information comes in a quantitative or qualitative flavor is not as important as how you use it. [...] The key to making a good forecast […] is not in limiting yourself to quantitative information. Rather, it’s having a good process for weighing the information appropriately. […] collect as much information as possible, but then be as rigorous and disciplined as possible when analyzing it. [...] Many times, in fact, it is possible to translate qualitative information into quantitative information." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"For although it is certainly true that quantitative measurements are of great importance, it is a grave error to suppose that the whole of experimental physics can be brought under this heading. We can start measuring only when we know what to measure: qualitative observation has to precede quantitative measurement, and by making experimental arrangements for quantitative measurements we may even eliminate the possibility of new phenomena appearing." (Heinrich B G Casimir)

27 December 2017

🗃️Data Management: Data Quality (Just the Quotes)

"[...] it is a function of statistical method to emphasize that precise conclusions cannot be drawn from inadequate data." (Egon S Pearson & H Q Hartley, "Biometrika Tables for Statisticians" Vol. 1, 1914)

"Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data." (Roy D G Allen, "Statistics for Economists", 1951)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"There is no substitute for honest, thorough, scientific effort to get correct data (no matter how much it clashes with preconceived ideas). There is no substitute for actually reaching a correct chain of reasoning. Poor data and good reasoning give poor results. Good data and poor reasoning give poor results. Poor data and poor reasoning give rotten results." (Edmund C Berkeley, "Computers and Automation", 1969)

"Detailed study of the quality of data sources is an essential part of applied work. [...] Data analysts need to understand more about the measurement processes through which their data come. To know the name by which a column of figures is headed is far from being enough." (John W Tukey, "An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects", 1982)

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"Data obtained without any external disturbance or corruption are called clean; noisy data mean that a small random ingredient is added to the clean data." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Probability theory is a serious instrument for forecasting, but the devil, as they say, is in the details - in the quality of information that forms the basis of probability estimates." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"If the data is usually bad, and you find that you have to gather some data, what can you do to do a better job? First, recognize what I have repeatedly said to you, the human animal was not designed to be reliable; it cannot count accurately, it can do little or nothing repetitive with great accuracy. [...] Second, you cannot gather a really large amount of data accurately. It is a known fact which is constantly ignored. It is always a matter of limited resources and limited time. [...] Third, much social data is obtained via questionnaires. But it a well documented fact the way the questions are phrased, the way they are ordered in sequence, the people who ask them or come along and wait for them to be filled out, all have serious effects on the answers." (Richard Hamming, "The Art of Doing Science and Engineering: Learning to Learn", 1997)

"Blissful data consist of information that is accurate, meaningful, useful, and easily accessible to many people in an organization. These data are used by the organization’s employees to analyze information and support their decision-making processes to strategic action. It is easy to see that organizations that have reached their goal of maximum productivity with blissful data can triumph over their competition. Thus, blissful data provide a competitive advantage." (Margaret Y Chu, "Blissful Data", 2004)

"Let’s define dirty data as: ‘… data that are incomplete, invalid, or inaccurate’. In other words, dirty data are simply data that are wrong. […] Incomplete or inaccurate data can result in bad decisions being made. Thus, dirty data are the opposite of blissful data. Problems caused by dirty data are significant; be wary of their pitfalls." (Margaret Y Chu, "Blissful Data", 2004)

"Processes must be implemented to prevent bad data from entering the system as well as propagating to other systems. That is, dirty data must be intercepted at its source. The operational systems are often the source of informational data; thus dirty data must be fixed at the operational data level. Implementing the right processes to cleanse data is, however, not easy." (Margaret Y Chu, "Blissful Data", 2004)

"Equally critical is to include data quality definition and acceptable quality benchmarks into the conversion specifications. No product design skips quality specifications. including quality metrics and benchmarks. Yet rare data conversion follows suit. As a result, nobody knows how successful the conversion project was until data errors get exposed in the subsequent months and years. The solution is to perform comprehensive data quality assessment of the target data upon conversion and compare the results with pre-defined benchmarks." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Much data in databases has a long history. It might have come from old 'legacy' systems or have been changed several times in the past. The usage of data fields and value codes changes over time. The same value in the same field will mean totally different thing in different records. Knowledge or these facts allows experts to use the data properly. Without this knowledge, the data may bc used literally and with sad consequences. The same is about data quality. Data users in the trenches usually know good data from bad and can still use it efficiently. They know where to look and what to check. Without these experts, incorrect data quality assumptions are often made and poor data quality becomes exposed." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"The big part of the challenge is that data quality does not improve by itself or as a result of general IT advancements. Over the years, the onus of data quality improvement was placed on modern database technologies and better information systems. [...] In reality, most IT processes affect data quality negatively, Thus, if we do nothing, data quality will continuously deteriorate to the point where the data will become a huge liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"While we might attempt to identify and correct most data errors, as well as try to prevent others from entering the database, the data quality will never be perfect. Perfection is practically unattainable in data quality as with the quality of most other products. In truth, it is also unnecessary since at some point improving data quality becomes more expensive than leaving it alone. The more efficient our data quality program, the higher level of quality we will achieve- but never will it reach 100%. However, accepting imperfection is not the same as ignoring it. Knowledge of the data limitations and imperfections can help use the data wisely and thus save time and money, The challenge, of course, is making this knowledge organized and easily accessible to the target users. The solution is a comprehensive integrated data quality meta data warehouse." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Achieving a high level of data quality is hard and is affected significantly by organizational and ownership issues. In the short term, bandaging problems rather than addressing the root causes is often the path of least resistance." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Communicate loudly and widely where there are data quality problems and the associated risks with deploying BI tools on top of bad data. Also advise the different stakeholders on what can be done to address data quality problems - systematically and organizationally. Complaining without providing recommendations fixes nothing." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Data quality is such an important issue, and yet one that is not well understood or that excites business users. It’s often perceived as being a problem for IT to handle when it’s not: it’s for the business to own and correct." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Depending on the extent of the data quality issues, be careful about where you deploy BI. Without a reasonable degree of confidence in the data quality, BI should be kept in the hands of knowledge workers and not extended to frontline workers and certainly not to customers and suppliers. Deploy BI in this limited fashion as data quality issues are gradually exposed, understood, and ultimately, addressed. Don’t wait for every last data quality issue to be resolved; if you do, you will never deliver any BI capabilities, business users will never see the problem, and quality will never improve." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"The data architecture is the most important technical aspect of your business intelligence initiative. Fail to build an information architecture that is flexible, with consistent, timely, quality data, and your BI initiative will fail. Business users will not trust the information, no matter how powerful and pretty the BI tools. However, sometimes it takes displaying that messy data to get business users to understand the importance of data quality and to take ownership of a problem that extends beyond business intelligence, to the source systems and to the organizational structures that govern a company’s data." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Many new data scientists tend to rush past it to get their data into a minimally acceptable state, only to discover that the data has major quality issues after they apply their (potentially computationally intensive) algorithm and get a nonsense answer as output. (Sandy Ryza, "Advanced Analytics with Spark: Patterns for Learning from Data at Scale", 2009)

"Access to more information isn’t enough - the information needs to be correct, timely, and presented in a manner that enables the reader to learn from it. The current network is full of inaccurate, misleading, and biased information that often crowds out the valid information. People have not learned that 'popular' or 'available' information is not necessarily valid." (Gene Spafford, 2010)

"Are data quality and data governance the same thing? They share the same goal, essentially striving for the same outcome of optimizing data and information results for business purposes. Data governance plays a very important role in achieving high data quality. It deals primarily with orchestrating the efforts of people, processes, objectives, technologies, and lines of business in order to optimize outcomes around enterprise data assets. This includes, among other things, the broader cross-functional oversight of standards, architecture, business processes, business integration, and risk and compliance. Data governance is an organizational structure that oversees the compliance and standards of enterprise data." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Data governance is about putting people in charge of fixing and preventing data issues and using technology to help aid the process. Any time data is synchronized, merged, and exchanged, there have to be ground rules guiding this. Data governance serves as the method to organize the people, processes, and technologies for data-driven programs like data quality; they are a necessary part of any data quality effort." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise. Data quality is an elusive subject that can defy measurement and yet be critical enough to derail a single IT project, strategic initiative, or even an entire company." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"When we find data quality issues due to valid data during data exploration, we should note these issues in a data quality plan for potential handling later in the project. The most common issues in this regard are missing values and outliers, which are both examples of noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Metadata is the key to effective data governance. Metadata in this context is the data that defines the structure and attributes of data. This could mean data types, data privacy attributes, scale, and precision. In general, quality of data is directly proportional to the amount and depth of metadata provided. Without metadata, consumers will have to depend on other sources and mechanisms." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"The quality of data that flows within a data pipeline is as important as the functionality of the pipeline. If the data that flows within the pipeline is not a valid representation of the source data set(s), the pipeline doesn’t serve any real purpose. It’s very important to incorporate data quality checks within different phases of the pipeline. These checks should verify the correctness of data at every phase of the pipeline. There should be clear isolation between checks at different parts of the pipeline. The checks include checks like row count, structure, and data type validation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"First, from an ethos perspective, the success of your data story will be shaped by your own credibility and the trustworthiness of your data. Second, because your data story is based on facts and figures, the logos appeal will be integral to your message. Third, as you weave the data into a convincing narrative, the pathos or emotional appeal makes your message more engaging. Fourth, having a visualized insight at the core of your message adds the telos appeal, as it sharpens the focus and purpose of your communication. Fifth, when you share a relevant data story with the right audience at the right time (kairos), your message can be a powerful catalyst for change." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"The one unique characteristic that separates a data story from other types of stories is its fundamental basis in data. [...] The building blocks of every data story are quantitative or qualitative data, which are frequently the results of an analysis or insightful observation. Because each data story is formed from a collection of facts, each one represents a work of nonfiction. While some creativity may be used in how the story is structured and delivered, a true data story won’t stray too far from its factual underpinnings. In addition, the quality and trustworthiness of the data will determine how credible and powerful the data story is." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Data is dirty. Let's just get that out there. How is it dirty? In all sorts of ways. Misspelled text values, date format problems, mismatching units, missing values, null values, incompatible geospatial coordinate formats, the list goes on and on." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"Bad data makes bad models. Bad models instruct people to make ineffective or harmful interventions. Those bad interventions produce more bad data, which is fed into more bad models." (Cory Doctorow, "Machine Learning’s Crumbling Foundations", 2021)

"[...] data mesh introduces a fundamental shift that the owners of the data products must communicate and guarantee an acceptable level of quality and trustworthiness - specific to their domain - as an intrinsic characteristic of their data product. This means cleansing and running automated data integrity tests at the point of the creation of a data product." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Ensure you build into your data literacy strategy learning on data quality. If the individuals who are using and working with data do not understand the purpose and need for data quality, we are not sitting in a strong position for great and powerful insight. What good will the insight be, if the data has no quality within the model?" (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"[...] the governance function is accountable to define what constitutes data quality and how each data product communicates that in a standard way. It’s no longer accountable for the quality of each data product. The platform team is accountable to build capabilities to validate the quality of the data and communicate its quality metrics, and each domain (data product owner) is accountable to adhere to the quality standards and provide quality data products." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Bad data is costly to fix, and it’s more costly the more widespread it is. Everyone who has accessed, used, copied, or processed the data may be affected and may require mitigating action on their part. The complexity is further increased by the fact that not every consumer will “fix” it in the same way. This can lead to divergent results that are divergent with others and can be a nightmare to detect, track down, and rectify." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Data has historically been treated as a second-class citizen, as a form of exhaust or by-product emitted by business applications. This application-first thinking remains the major source of problems in today’s computing environments, leading to ad hoc data pipelines, cobbled together data access mechanisms, and inconsistent sources of similar-yet-different truths. Data mesh addresses these shortcomings head-on, by fundamentally altering the relationships we have with our data. Instead of a secondary by-product, data, and the access to it, is promoted to a first-class citizen on par with any other business service." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"In truth, no one knows how much bad data quality costs a company – even companies with mature data quality initiatives in place, who are measuring hundreds of data points for their quality struggle to accurately measure quantitative impact. This is often a deal-breaker for senior leaders when trying to get approval for a budget for data quality work. Data quality initiatives often seek substantial budgets and are up against projects with more tangible benefits." (Robert Hawker, "Practical Data Quality", 2023)

"The biggest mistake that can be made in a data quality initiative is focusing on the wrong data. If you fix data that does not impact a critical business process or drive important decisions, your initiative simply will not make the difference that you want it to." (Robert Hawker, "Practical Data Quality", 2023)

"The data should be monitored in the source, it should be corrected in the source, and it should then feed the secondary source(s) with high-quality data that can be used without workarounds. The reduction in workarounds will make the data engineers, scientists, and data visualization specialists much more productive." (Robert Hawker, "Practical Data Quality", 2023)

"The problem of bad data has existed for a very long time. Data copies diverge as their original source changes. Copies get stale. Errors detected in one data set are not fixed in duplicate ones. Domain knowledge related to interpreting and understanding data remains incomplete, as does support from the owners of the original data." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Errors using inadequate data are much less than those using no data at all." (Charles Babbage)

SQL Troubles

Pages

03 May 2025

🧭Business Intelligence: Perspectives (Part 31: More on Data Visualization)

08 March 2025

#️⃣Software Engineering: Programming (Part XVI: The Software Quality Perspective and AI)

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

26 January 2025

🧭Business Intelligence: Perspectives (Part 25: Grounding the Roots)

22 March 2024

🧭Business Intelligence: Perspectives (Part 9: Dashboards Are Dead & Other Crap)

17 February 2024

🧭Business Intelligence: A Software Engineer's Perspective I (Houston, we have a Problem!)

03 October 2023

🧮ERP: Implementations (Part VIII: It’s a Matter of Complexity)

19 October 2022

🌡Performance Management: Mastery (Part II: First Time Right - The Aim toward Operational Excellence)

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part III: Quality Acceptance Criteria III)

20 May 2020

💼Project Management: Project Planning (Part V: Some Thoughts on Planning II)

19 May 2020

💼Project Management: Project Planning (Part IV: Some Thoughts on Planning I)

16 July 2019

🧱IT: Quality of Service [QoS] (Definitions)

13 July 2019

🧱IT: Service Level Agreement [SLA] (Definitions)

20 November 2018

🔭Data Science: Qualitative vs Quantitative (Just the Quotes)

27 December 2017

🗃️Data Management: Data Quality (Just the Quotes)

About Me