17 January 2010

Data Quality Dimensions - Accuracy

    Accuracy refers to the extent data are correct, match the reality with an acceptable level of approximation. Correctedness, the value of being correct, same as reality are vague terms, in many cases they are a question of philosophy, perception, having a high degree of interpretability. However in what concerns data they are typically the result of measurement, therefore a measurement of accuracy relate to the degree the data deviate from physical laws, logics or defined rules, though also this context is a swampy field because, utilizing a well-known syntagm, everything is relative. From a scientific point of view we try to model the reality with mathematical models which offer various level of approximation, the more we learn about our world, the more flaws we discover in the existing models, it’s a continuous quest for finding better models that approximate the reality. Actually, things don’t have to be so complicated, for basic measurements there are many tools out there that offer acceptable results for most of the requirements, on the other side, as requirements change, better approximations might be required with time.

    Another concept related with the ones of accuracy and measurement systems is the one of precision, and it refers to degree repeated measurements under unchanged conditions lead to the same results, further concepts associated with it being the ones of repeatability and reproducibility. Even if the accuracy and precision concepts are often confounded a measurement system can be accurate but not precise or precise but not accurate (see the target analogy), a valid measurement system targeting thus both aspects. This being said accuracy and precision can be considered dimensions of correctedness.

    Coming back to accuracy and its use in determining data quality, typically accuracy it’s strong related to the measurement tools used, for this being needed to do again the measurements for all or a sample of the dataset and identify whether the requested level of accuracy is met, approach that could involve quite an effort. The accuracy depends also on whether the systems used to store the data are designed to store the data at the requested level of accuracy, fact reflected by the characteristics of data types used (e.g. precision, length).

    Given the fact that a system stores related data (e.g. weight, height, width, length) that could satisfy physical, business of common sense rules, could be used rules to check whether the data satisfy them with the desired level of approximation. For example being known the height, width, length and the composition of a material (e.g. metal bar) could be determined the approximated weight and compared with entered weight, if the difference is not inside of a certain interval then most probably one of the values were incorrect entered. There are even simpler rules that might apply, for example the physical dimensions have to be positive real values, or in a generalized formulation - involve maximal or minimal limits that lead to identification of outliers, etc. In fact most of the time determining data accuracy resumes only at defining possible value intervals, though there will be also cases in which for this purpose are built complex models and used specific techniques.

    There is another important aspect related to accuracy, time dependency of data – whether the data change or not with time. Data currency or actuality refers to the extent data are actual. Given the above definition for accuracy, currency could be considered as a special type of accuracy because when the data are not actual then they don’t reflect the reality. If currency is considered as a standalone data quality dimension, then accuracy refers only to the data that are not time dependent.

No comments: