24 January 2010

Data Structuredness

    B. Boehm defined structuredness as “the degree to which a system or component possesses a definite pattern of organization of its interdependent parts” [1], which transposed to data refers to the “pattern of organization” that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc), emerging thus several levels of structure of different type. From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

    In general, at least from my perception, the term of structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored floating with unstructured data. From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the database’s are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

    Actually must be done a delimitation between syntactic and semantic aspects of structuredness, the syntactic structuredness referring to “the rules and patterns formed trough the combination of signs in constructs”[4] in data, while the semantic structuredness to the patterns of meaning, the above definitions applying for both aspects. If we talk about syntax and semantics then most probably it makes sense to talk also about pragmatic structuredness, the third dimension of semiotics. Another reason of confusion is the interchanged use of terms like data, information or knowledge for the same purpose in the same context – see the confusion between data management vs. information management or knowledge management.

    Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. In order to make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, fact that reduces considerably the value of such data.

    Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. In order to bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

    In the context of Data Quality, structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it. In theory even a minor inadvertence in the structure could lead to processing errors and unexpected behavior. The best example is the one of data available in delimited text files, if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly or the parsing will fail unless the issues are corrected.

[1] Boehm B.W., Brown J.R., Kaspar H., Lipow M., Macleod G.J., Merritt MJ. (1978). Characteristics of software quality. North-Holland Publishing Company
[2] The Register. (2006). Structured data is boring and useless, by D. Nortfolk. [Online] Available from: http://www.theregister.co.uk/2006/06/23/unstructured_data/ (Accessed: 24 January 2010)
[3] Wood P. (????). Semi-structured Data. [Online] Available from: http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/toc.html (Accessed: 24 January 2010)
[4] Nastase A. (2009). Utilizing Mind Maps as a Structure for Mining the Semantic Web, Dissertation. University of Liverpool. [Online] Available from: http://www.scribd.com/doc/16612282/Dissertation-paper-Utilizing-Mind-Maps-as-a-Structure-for-Mining-the-Semantic-Web (Accessed: 24 January 2010)

No comments: