SQL Troubles: Data Quality Assessment

Showing posts with label Data Quality Assessment. Show all posts

10 November 2012

📦Data Migrations (DM): Data Quality’s Perspective I (A Bird’s-Eye View)

Data Migrations Series

Imagine you just finished a Data Migration (DM) project, everything went smoothly, the data were loaded into the new system with a minimum amount of issues, inherent sometimes to such complex projects, the users started to use the new system, everybody seemed to be satisfied, and a few weeks later within the company rumors propagate with the speed of light – “the migrated data are wrong”, “the new system can’t be used” , “IT did a bad job”, “we have to get back to the previous system”, and so on. The panic propagates, a few heads fall, the business tries to revert to the old system but there’s lot of new data available in the new system, and it’s not so trivial to move the data back to the old, in the meantime other rumors appear, and… it’s just a scenario but this could happen to any company if not the appropriate measures were taken at the right time. What could help a company when something like this happens?! A good Plan B aka a good Migration Fallback Plan/Policy, but that’s something nobody would like to do except extreme situations.

A common approach to any type of projects as well to a DM project is to identify and mitigate the risks before or during the project. That’s something I started to do a few days ago, to prepare a list with the risks associated with DM projects. For this exercise I tried to remember what things went wrong in previous similar projects I worked on and to figure out what else could go wrong. Some online resources helped me to refresh my memory too, and I think I found also two or three things I haven’t really thought about. My attempt was primarily focused on this type of problem mentioned above – minimizing the risks of not having the right data when the new system goes live. Before jumping into the thematic I would like to sketch the bigger picture, as I perceive it.

Having the right data when a system goes live primarily means having good Data Quality (DQ) in the target system after the data were migrated! As a DM is the best exemplification of the GIGO (Garbage-In Garbage-Out) principle, in order to have good DQ in the target is important to handle DQ latest during the DM project. That’s essential and common sense – you can’t expect to have good data in the new system when there’s lot of garbage in the old. So, a DM and a Risk Management for such a project should be built around this. In fact not having a DQ initiative or project in a DM project is one of the most important risks a company can take. Maybe in small DM, a DQ initiative isn’t necessary, though when the data are important for your company, DQ is a must! In addition DQ assessments have to be performed in alignment with the new system, and not the old. Even if the data have good quality into the old system, the quality of your data after DM will be judged in corroboration with the new system. This is a requirement that can be easily overlooked and its implications misunderstood!

Many think that DQ is one time activity, we do it for a DM project and we’ll have quality data and never have to care about their quality anymore. Totally false! DQ has to be part of a broader strategy, call it Data Governance, Master Data Management, Data Management or any other initiative in which data plays an important role. DQ is an on going, iterative and consolidated effort, it doesn’t end after DM but continues for the whole data life-cycle, as long the data have value for an organization. It doesn’t help if the data have high quality when the data are migrated and a few weeks or months later the overall quality and trust in data decreased considerably.

Keeping an acceptable level of DQ must be an organization’s strategy, and must be built a culture toward DQ. People need to be aware of the importance of having good quality data, and especially the consequences of having bad quality data. DQ doesn’t concern only the owners or stewards of data, or the people working with data, it concerns the whole organization because decisions are made based on those data, processes are changed and improved, an organization’s performance is often judged based on data. The quality of data is a matter of perception, on how users see the quality of data in corroboration with the needs they have, and the needs change over time. Primarily being aware what good DQ means and which are an organization’s needs in respect to data, it’s also a way of minimizing the negative perception of data, of gaining trust in data and some solid basis on which decisions can be made. Secondarily, these organizational data needs need to be addressed in a DM, they are the success factors upon which the success of a DM project is judged.

For sure considerable costs are associated with DQ initiatives and everything related to data which doesn’t always represent a direct cost component in the products or services handled by an organization. Considering that not all data have the same importance for an organization, it makes sense to prioritize the DQ effort as a whole and the data cleaning needs in particular, the focus should be the data with the highest impact and with time to tackle data with lower and lower impact. It must be found equilibrium between the DQ costs and the value of data. Most probably is important to spend resources on raising people’s awareness in respect to DQ early rather than cleaning retroactively data later. It also make sense to invest in tools that help to clean data using automated or semi-automated methods, though some manual/visual control need to be in place too.

DQ and the way the problems associated with it are tackled depends more on an organization’s internal kitchen – people, partners, organization, strategy, maturity, culture, geography, infrastructure, methodologies used, etc. What it matters is how the various negative and important aspects of an organization are aligned in order to take advantage of one of the most important assets an organization has is its data! For this is important to adopt methodologies that support DQ, align them and tweak them as requested, in order to make most of your data! But before or while doing that remember that a DM is an organization’s opportunity to change the quality of its data and its data strategy!

Previous Post <<||>> Next Post

13 January 2010

🗄️Data Management: Data Quality (Part II: An Introduction)

Introduction

From time to time, I found myself in a meeting with people blaming the poor quality of data for the strange numbers appearing in a report, and often they were right. Sometimes you can see that without going too deep into details, other times you have to spend some considerable amount of time and effort and look over the raw data to identify the records that cause the unexpected variations. We know we have a problem with data quality, though why nobody is doing anything to improve it? Who’s responsible in the end for data quality? How data quality issues appear, how the data quality can be improved and by what means? When and where the data quality must be considered? Why do we have to consider data quality and in definitive what is data quality? Which are data quality’s characteristics or dimensions?

Without pretending to be an expert on data quality but using my and others’ experience and understanding in dealing with data quality, I will try to approach the above questions in several posts. And as usual, before answering more complex questions, inevitably we have to define first what data quality is.

What is Data Quality?

In several consulted sources, mainly books and blogs, data quality’s definition revolve around the 'fitness for use' syntagma (e.g. [1], [3], [4]). The definition makes data quality highly dependent on the context in which it’s considered because each request comes with different data quality expectations/requirements and data quality levels. Even more, the data could have acceptable quality in the legacy system(s), the system(s) used to store the data, though the data might not be appropriate for use in the current form. From this observation can be delimited two aspects of data quality – intrinsic data quality considering the creation and storage of data, and secondly contextual data quality referring to the degree the data meets users’ needs.

Except these two aspects [2] considers two other aspects: the representation data quality referring to the "degree of care taken in the presentation and organization of information for users" and the accessibility data quality - "degree of freedom that users have to use data, define or refine the manner in which information is input, processed or presented to them" [2]. These four aspects provide a good framework for categorizing data quality and as can be seen, it targets the four coordinates of data – storage/creation, presentation, use and accessibility – however it ignores the fact data could be consumed also by machines, they needing to parse, understand and process the data. Therefore, could be added a fifth aspect – processing data quality, referring to the data readiness for consumption by machines, this aspect including the existence of metadata (e.g. data models, data mappings, etc.).

Why Data Quality?

Considering the growing dynamics and complexity of economic landscape, nowadays organization’s capacity of staying in business is highly dependent on decision making process’ accuracy, which at its turn is dependent on the reports’ quality used as support in decisions, and further on quality of data the reports are based on. If you have wrong data, you can’t expect the reports to show the correct numbers, same as you can’t expect a decision to be correct without having the actual situation correctly reflected even if occasionally the person taking the decision has a sound understanding of the business above the data level, or a good gut feeling. It is essential for an organization to have adequate data and reports that reflect the current state of art at any time and at the requested level of detail.

In this context is discussed about poor data quality, good data quality, high data quality, or even adequate/inadequate data quality. Paraphrasing data quality’s definition could be said that data are of high quality if they are 'fit for use' [1], and have poor data quality otherwise, though in life things are not black or white, but exist many other nuances in between, dependent on user’s perception and needs. For example, could be sufficient in one million records to have only one value that makes it impossible for a data set to be processed by a software package, and even if the data quality equates with a Six Sigma level, it’s not enough for this case. At the opposite side, a data set with a huge number of defects could be processed without problems and the outcome could be close to what it’s expected, now depends also on the nature of defects, which are the attributes affected and how they affect the result.

Poor data quality reduces considerably the value of data, impeding organizations to exploit the data at their full potential, a fact that limits their strategic competitive advantage. Poor data quality has a domino effect, often involving rework, delays in production, operations and decision making, a decrease in customers’ satisfaction reflected in revenues and lost opportunities, and so on. Unfortunately, the costs of poor data quality are difficult to quantify, they include the cost for finding and fixing the errors, the cost of wasted effort, lost production, and lost opportunity, plus the ones involved with wrong decision making. The same issue affects the costs involved in achieving high data quality levels, the easiest to quantify being the costs directly involved with the data quality initiatives. It's not in scope to detail these aspects, though is worth mentioning them to highlight why many organizations ignore the quality of their data.

How do Data Quality issues appear?

No system is bullet proof, no matter how good it was designed there will always be a chance for errors to appear, even if it’s the software itself, the people, the processes, standards, or procedures. A primary reason for data issues is caused by the weak validation implemented in the systems, either in the UI or in any other data entry points, by the errors that escaped the vigilant eyes of testers, by the (lack of) flexibility/robustness of applications, data models in general and data types in particular. On a secondary plane but not less important is the existence and adherence to processes, standards, or procedures, of whether they are known and respected by users, or whether they cover all the scenarios occurring in the daily business activity.

Who’s responsible for Data Quality?

As everybody in an organization is responsible for quality, the same applies also for data quality, from data entry to decision making everybody’s working with data directly or indirectly, and they are also theoretically responsible for it. From data life-cycle perspective there are the people who are entering the data in the various systems, the ones who are preparing the data for further usage, the ones using the data, and the ones managing the data quality initiatives (e.g. quality leaders, black belts, champions). The first category needs to take the ownership over the data, maintain them adequately, stick to the processes, standards and procedures defined, and together with the other users are responsible for identifying the issues in data, mitigate them and find solutions.

The data users are maybe the broadest category, ranging from simple employees using the data for daily activities to executives. Because often the data they are working with are aggregated at various levels, unless they have a good overview of the business or flexible reports that gives them access also to the row data it will be almost impossible for them to realize the actual data quality. Often the people preparing the data are the ones discovering the issues in data, though this happens by hazard, observing abnormalities in the data.

The people managing the data quality initiatives are hardly working with the data, though together with the other type of executives they are responsible for modeling an organization and creating a (data) quality culture.

How can be the Data Quality improved and by what means?

It takes time, resources, consolidated effort, adequate policies, and management’s involvement to reach an acceptable level of data quality. As already stressed data quality can be improved by creating robust processes, standards, or procedures, and sticking to them, creating a culture for quality in general and for data quality in particular, but rather they are the outcome of data quality initiatives.

The first step is recognizing that data quality in the organization as a problem and secondly identifying the nature of the issues eventually by conducting a data quality assessment, though this involve the definition of data quality (business) rules, rules against the quality of data will be checked, establishing an assessment, cleaning, and reporting framework for this purpose. Because improvement is the word of the day, could be conducted Six Sigma projects to identify issues’ root causes and the solutions to address them, monitor issues resolution and whether the found solutions fulfill the intended goals. The use of (Lean) Six Sigma methodology allows introducing quality concepts in the organization, making people aware of the premises and implications of data quality, and in addition it offers a receipt for success.

Depending on the volume of data and the number of issues found, data quality improvement might occur within several iterations, and when the volume of issues is more than organization can handle, then it makes sense to prioritize and fix the issues that impact the organization the most, following to address the remaining issues in time. There will also be situations in which is needed to redesign the existing processes, modify the legacy systems to limit the data entry issues or even start new projects in order fill the gaps.

When must be Data Quality considered?

Data quality comes in discussion especially when the data related issues impacts visibly the business, whether something is done to correct them or not that’s something dependent on whether an organization has in place a data quality vision, policy and adequate mechanisms for assessing and addressing the data quality on a regular basis, or whether the people or departments impacted by the issues take the problem in their hands and do something about it.

Another context in which data quality topic appears is during migration, conversion, and integration between systems, when the system(s) into which the data are supposed to be loaded, the destination system(s), come(s) with a set of rules/constraints requesting the data to be in a certain format/formatting, match specific data types and other data storage related constraints. Even if the data from the legacy system(s) are considered of high quality, in the mentioned activities the data quality is considered against the constraints imposed by the destination system.

What needs to be stressed is that data quality is not one time endeavor but a continuous and consolidated effort that could span along the life cycle of the systems storing the data. How often data quality assessments must be conducted? It depends on data growth and perceived quality, organizations’ data quality policy/vision and the degree they are respected, organization overall maturity and culture, availability of resources and the different usage of data.

Where is Data Quality considered?

Data quality is generally considered against the legacy systems the data following to be cleaned in them, though when data quality is judged against external requests, the data could be cleaned in Excel files or in any other environments designed for this kind of tasks, each scenario having its pluses and minuses, sometimes being preferred a hybrid solution between the two alternatives. Most of the time it makes sense to clean the data in the legacy system because in general it minimizes the overall effort, the data are “actual” and reflected in the other systems using the data.

Excel or any other similar tools could prove to be better environments for cleaning the data for one time requests, when is needed to do fast mass updates or is not intended/possible to modify the legacy data, however the probability of making mistakes is quite high because the validation implemented in the legacy systems are not available anymore or must be enforced, multiple copies of the same data could exist in case the data needs to be clean by multiple users, while the synchronization between legacy data and cleaned data could be problematic.

There are situations in which the data are not available in any of the systems available in place and there is no adequate possibility to store the new data in them, thus being requested eventually to extend one or more systems (aka data enrichment), solution that could be quite expensive, an alternative being to store the data in ad-hoc solutions (e.g. Excel, MS Access), following to merge the existing data islands (aka data integration) before loading them into the destination system(s). When dealing with a relatively small number of records the data could be entered manually in the destination system itself, though there are all kind of constraints that need to be considered, therefore it is safer and more appropriate to load the data as contiguous sets.

Previous Post <<||>> Next Post

Written: Dec-2009, Last Reviewed: Mar-2024

References:
[1] Joseph M Juran & A Blanton Godfrey (1999) Juran’s Quality Handbook, 5th Ed.
[2] Coral Calero (2008). Handbook of Research on Web Information Systems Quality
[3] US Census Bureau (2006) Definition of Data Quality: Census Bureau Principle, version 1.3. [Online] Available from: http://www.census.gov/quality/P01-0_v1.3_Definition_of_Quality.pdf (Accessed: 24 December 2009).
[4] OCDQ (Obsessive-Compulsive Data Quality) Blog (2009) The Data-Information Continuum. [Online] Available from: http://www.ocdqblog.com/home/the-data-information-continuum.html (Accessed: 24 December 2009)

SQL Troubles

Pages

10 November 2012

📦Data Migrations (DM): Data Quality’s Perspective I (A Bird’s-Eye View)

13 January 2010

🗄️Data Management: Data Quality (Part II: An Introduction)

About Me