"Data cleansing is dangerous mainly because data quality problems are usually complex and interrelated. Fixing one problem may create many others in the same or other related data elements." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"Data quality program is a collection of initiatives with the common objective of maximizing data quality and minimizing negative impact of the bad data. [...] objective of any data quality program is to ensure that data quality docs not deteriorate during conversion and consolidation projects, Ideally, we would like to do even more and use the opportunity to improve data quality since data cleansing is much easier to perform before conversion than afterwards." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"Databases rarely begin their life empty. More often the starting point in their lifecycle is a data conversion from some previously exiting data source. And by a cruel twist of fate, it is usually a rather violent beginning. Data conversion usually takes the better half of new system implementation effort and almost never goes smoothly." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"[...] data conversion is the most difficult part of any system implementation. The error rate in a freshly populated new database is often an order of magnitude above that of the old system from which the data is converted. As a major source of the data problems, data conversion must be treated with the utmost respect it deserves." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"Equally critical is to include data quality definition and acceptable quality benchmarks into the conversion specifications. No product design skips quality specifications. including quality metrics and benchmarks. Yet rare data conversion follows suit. As a result, nobody knows how successful the conversion project was until data errors get exposed in the subsequent months and years. The solution is to perform comprehensive data quality assessment of the target data upon conversion and compare the results with pre-defined benchmarks." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"More and more data is exchanged between the systems through real-time (or near real-time) interfaces. As soon as the data enters one database, it triggers procedures necessary to send transactions to Other downstream databases. The advantage is immediate propagation of data to all relevant databases. Data is less likely to be out-of-sync. [...] The basic problem is that data is propagated too fast. There is little time to verify that the data is accurate. At best, the validity of individual attributes is usually checked. Even if a data problem can be identified. there is often nobody at the other end of the line to react. The transaction must be either accepted or rejectcd (whatever the consequences). If data is rejected, it may be lost forever!" (Arkady Maydanchik, "Data Quality Assessment", 2007)
"Much data in databases has a long history. It might have come from old 'legacy' systems or have been changed several times in the past. The usage of data fields and value codes changes over time. The same value in the same field will mean totally different thing in different records. Knowledge or these facts allows experts to use the data properly. Without this knowledge, the data may bc used literally and with sad consequences. The same is about data quality. Data users in the trenches usually know good data from bad and can still use it efficiently. They know where to look and what to check. Without these experts, incorrect data quality assumptions are often made and poor data quality becomes exposed." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"The big part of the challenge is that data quality does not improve by itself or as a result of general IT advancements. Over the years, the onus of data quality improvement was placed on modern database technologies and better information systems. [...] In reality, most IT processes affect data quality negatively, Thus, if we do nothing, data quality will continuously deteriorate to the point where the data will become a huge liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"The corporate data universe consists of numerous databases linked by countless real-time and batch data feeds. The data continuously move about and change. The databases are endlessly redesigned and upgraded, as are the programs responsible for data exchange. The typical result of this dynamic is that information systems get better, while data deteriorates. This is very unfortunate since it is the data quality that determines the intrinsic value of the data to the business and consumers. Information technology serves only as a magnifier for this intrinsic value. Thus, high quality data combined with effective technology is a great asset, but poor quality data combined with effective technology is an equally great liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"The greatest challenge in data conversion is that actual content and structure of the source data is rarely understood. More often data transformation algorithms rely on the theoretical data definitions and data models, Since this information is usually incomplete, outdated, and incorrect, the converted data look nothing like what is expected. Thus, data quality plummets. The solution is to precede conversion with extensive data profiling and analysis. In fact, data quality after conversion is in direct (or even exponential) relation with the amount of knowledge about actual data you possess. Lack of in-depth analysis will guarantee significant loss of data quality." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"The main tool of a data quality assessment professional is a data quality rule - a constraint that validates a data element or a relationship between several data elements and can be implemented in a computer program. [...] The solution relies on the design and implementation of hundreds and thousands of such data quality rules, and then using them to identify all data inconsistencies. Miraculously, a well-designed and fine-tuned collection of rules will identify a majority Of data errors in a fraction or time compared with manual validation. In fact, it never takes more than a few months to design and implement the rules and produce comprehensive error reports, What is even better, the same setup can be reused over and over again to reassess data quality periodically with minimal effort." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"Using data quality rules brings comprehensive data quality assessment from fantasy world to reality. However, it is by no means simple, and it takes a skillful skipper to navigate through the powerful currents and maelstroms along the way. Considering the volume and structural complexity of a typical database, designing a comprehensive set of data quality rules is a daunting task. The number Of rules will often reach hundreds or even thousands. When some rules are missing, the results of the data quality assessment can be completely jeopardized, Thus the first challenge is to design all rules and make sure that they indeed identify all or most errors." (Arkady Maydanchik, "Data Quality Assessment", 2007)
"While we might attempt to identify and correct most data errors, as well as try to prevent others from entering the database, the data quality will never be perfect. Perfection is practically unattainable in data quality as with the quality of most other products. In truth, it is also unnecessary since at some point improving data quality becomes more expensive than leaving it alone. The more efficient our data quality program, the higher level of quality we will achieve- but never will it reach 100%. However, accepting imperfection is not the same as ignoring it. Knowledge of the data limitations and imperfections can help use the data wisely and thus save time and money, The challenge, of course, is making this knowledge organized and easily accessible to the target users. The solution is a comprehensive integrated data quality meta data warehouse." (Arkady Maydanchik, "Data Quality Assessment", 2007)
No comments:
Post a Comment