SQL Troubles: structuredness

Showing posts with label structuredness. Show all posts

29 March 2024

🗄️🗒️Data Management: Data [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 29-Mar-2024

[Data Management] Data

{def} raw, unrelated numbers or entries that represent facts, concepts, events, and/or associations
categorized by

domain

{type} transactional data
{type} master data
{type} configuration data

{subtype}hierarchical data
{subtype} reference data
{subtype} setup data
{subtype} policy

{type} analytical data

{subtype} measurements
{subtype} metrics
{subtype}

structuredness

{type} structured data
{type} semi-structured data
{type} unstructured data

statistical usage as variable

{type} categorical data (aka qualitative data)

{subtype} nominal data
{subtype} ordinal data
{subtype} binary data

{type} numerical data (aka quantitative data)

{subtype} discrete data
{subtype} continuous data

size

{type} small data
{type} big data

{concept} transactional data

{def} data that describe business transactions and/or events
supports the daily operations of an organization
commonly refers to data created and updated within operational systems
support applications that automated key business processes
usually stored in normalized tables

{concept} master data

{def}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]

the key business entities on which transaction are executed

the dimensions around on which analysis is conducted

used to categorize, evaluate and aggregate transactional data

can be shared across more than one transactional applications
there are master data similar to most organizations, but also master data specific to certain industries
often appear in more than one area within the business
represent one version of the truth
can be further divided into specialized subsets
{concept} master data entity

core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies
may be classified within a hierarchy

the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways

e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization

do not change as frequent like transactional data

less volatile than transactional data
there are master data that don’t change at all

e.g. geographic locations

strategic asset of the business
needs to be managed with the same diligence as other strategic assets

{concept} metadata

{definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]

data about data

refers to

database schemas for OLAP & OLTP systems
XML document schemas
report definitions
additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
application configuration data

{concept} analytical data

{definition} data that supports analytical activities

e.g. decision making, reporting queries and analysis

comprises

numerical values
metrics
measurements

stored in OLAP repositories

optimized for decision support
enterprise data warehouses
departmental data marts
within table structures designed to support aggregation, queries and data mining

{concept} hierarchical data
- {definition} data that reflects a hierarchy
- typically appears in analytical applications
- {concept} hierarchy
{concept} structured data

{definition} "data that has a strict metadata defined"

{concept} unstructured data

{definition} data that doesn't follow predefined metadata
involves all kinds of documents
can appear in a database, in a file, or even in printed material

{concept} semi-structured data

{definition} structured data stored within unstructured data,
data typically in XML form

XML is widely used for data exchange

can appear in stand-alone files or as part of a database (as a column in a table)
useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema

Previous Post <<||>> Next Post

References:
[1] The Art of Service (2017) Master Data Management Course

[2] DAMA International (2011) "The DAMA Dictionary of Data Management",

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

Data Management Series

Barry Boehm defines structuredness as 'the degree to which a system or component possesses a definite pattern of organization of its interdependent parts' [1], which transposed to data refers to the 'pattern of organization' that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc.), emerging thus several levels of structure of different type.

From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa.

From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the databases are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.

Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"

SQL Troubles

Pages

29 March 2024

🗄️🗒️Data Management: Data [Notes]

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

About Me