Showing posts with label structuredness. Show all posts
Showing posts with label structuredness. Show all posts

29 March 2024

🗄️🗒️Data Management: Data [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 29-Mar-2024

Data

  • {definition} raw, unrelated numbers or entries that represent facts, concepts,  events,  and/or associations 
  • categorized by
    • domain
      • {type} transactional data
      • {type} master data
      • {type} configuration data
        • {subtype}hierarchical data
        • {subtype} reference data
        • {subtype} setup data
        • {subtype} policy
      • {type} analytical data
        • {subtype} measurements
        • {subtype} metrics
        • {subtype} 
    • structuredness
      • {type} structured data
      • {type} semi-structured data
      • {type} unstructured data
    • statistical usage as variable
      • {type} categorical data (aka qualitative data)
        • {subtype} nominal data
        • {subtype} ordinal data
        • {subtype} binary data
      • {type} numerical data (aka quantitative data)
        • {subtype} discrete data
        • {subtype} continuous data
    • size
      • {type} small data
      • {type} big data
  • {concept} transactional data
    • {definition} data that describe business transactions and/or events
    • supports the daily operations of an organization
    • commonly refers to data created and updated within operational systems
    • support applications that automated key business processes 
    • usually stored in normalized tables
  • {concept} master data
    • {definition}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]
      • the key business entities on which transaction are executed
    • the dimensions around on which analysis is conducted
      • used to categorize, evaluate and aggregate transactional data
    • can be shared across more than one transactional applications
    • there are master data similar to most organizations, but also master data specific to certain industries 
    • often appear in more than one area within the business
    • represent one version of the truth
    • can be further divided into specialized subsets
    • {concept} master data entity
      • core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies 
      • may be classified within a hierarchy
        • the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways
          • e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization
    • do not change as frequent like transactional data
      • less volatile than transactional data 
      • there are master data that don’t change at all 
        • e.g. geographic locations
    • strategic asset of the business 
    • needs to be managed with the same diligence as other strategic assets
  • {concept} metadata 
    • {definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]
      • data about data
    • refers to 
      • database schemas for OLAP & OLTP systems
      • XML document schemas
      • report definitions
      • additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
      • application configuration data
  • {concept} analytical data
    • {definition} data that supports analytical activities 
      • e.g. decision making, reporting queries and analysis 
    • comprises
      • numerical values
      • metrics
      • measurements
    • stored in OLAP repositories 
      • optimized for decision support 
      • enterprise data warehouses
      • departmental data marts
      • within table structures designed to support aggregation, queries and data mining 
  • {concept} hierarchical data 
    • {definition} data that reflects a hierarchy 
      • relationships between data are represented in hierarchies
    • typically appears in analytical applications
    • {concept} hierarchy
      • "a classification structure arranged in levels of detail from the broadest to the most detailed level" [2]
      • {concept} natural hierarchy
        • stem from domain-based attributes
        • represent an intrinsic structure of the dat
          • they are natural for the data
            • e.g. product taxonomy (categories/subcategories)
        • useful for drilling down from a general to a detailed level in order to find reasons, patterns, and problems
          • common way of analyzing data in OLAP applications
          • common way of filtering data in OLTP applications
      • {concept} explicit hierarchy
        • organize data according to business needs
        • entity members can be organized in any way
        • can be ragged
          • the hierarchy can end at different levels
      • {concept} derived hierarchy
        • domain-based attributes form natural hierarchies 
        • relationships between entities must already exist in a model
        • can be recursive
  • {concept} structured data
    • {definition} "data that has a strict metadata defined"
  • {concept} unstructured data 
    • {definition} data that doesn't follow predefined metadata
    • involves all kinds of documents 
    • can appear in a database, in a file, or even in printed material
  • {concept} semi-structured data 
    • {definition} structured data stored within unstructured data,
    • data typically in XML form
      • XML is widely used for data exchange
    • can appear in stand-alone files or as part of a database (as a column in a table)
    • useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema
References:
[1] The Art of Service (2017) Master Data Management Course 
[2] DAMA International (2011) "The DAMA Dictionary of Data Management", 

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

Data Management
Data Management Series

Barry Boehm defines structuredness as 'the degree to which a system or component possesses a definite pattern of organization of its interdependent parts' [1], which transposed to data refers to the 'pattern of organization' that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc.), emerging thus several levels of structure of different type. 

From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa. 

From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the databases are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.

Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.


Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.