SQL Troubles: structured data

Showing posts with label structured data. Show all posts

19 January 2018

🔬Data Science: Structured Data (Definitions)

"Data that has a strict metadata defined, such as a SQL Server table’s column." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Data that has enforced composition to specified datatypes and relationships and is managed by technology that allows for querying and reporting." (Keith Gordon, "Principles of Data Management", 2007)

"Database data, such as OLTP (Online Transaction Processing System) data, which can be sorted." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"A collection of records or data that is stored in a computer; records maintained in a database or application." (Robert F Smallwood, "Managing Electronic Records: Methods, Best Practices, and Technologies", 2013)

"Data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, a customer’s name, address, and so on)." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Data that fits cleanly into a predefined structure." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Data that is described by a data model, for example, business data in a relational database" (Hasso Plattner, "A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases" 2nd Ed., 2014)

"Data that is managed by a database management system" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"In statistics and data mining, any type of data whose values have clearly defined meaning, such as numbers and categories." (Meta S Brown, "Data Mining For Dummies", 2014)

"Data that adheres to a strict definition." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, for a customer’s name, address, and so on)." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Data that resides in a fixed field within a file or individual record, such as a row & column database." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Information that sits in a database, file, or spreadsheet. It is generally organized and formatted. In retail, this data can be point-of-sale data, inventory, product hierarchies, or others." (Brittany Bullard, "Style and Statistics", 2016)

"A data field of a definable data type, usually of a specified size or range, that can be easily processed by a computer." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"Data that can be stored in a table. Every instance in the table has the same set of attributes. Contrast with unstructured data." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Data that is identifiable as it is organized in structure like rows and columns. The data resides in fixed fields within a record or file or the data is tagged correctly and can be accurately identified." (Analytics Insight)

"Refers to information with a high degree of organization, meaning that it can be seamlessly included in a relational database and quickly searched by straightforward search engine algorithms and/or other search operations. Structured data examples include dates, numbers, and groups of words and number 'strings'. Machine-generated structured data is on the increase and includes sensor data and financial data." (Accenture)

30 November 2010

🧭Business Intelligence: Enterprise Reporting (Part IX: How Many Reporting Systems Do You Need?)

Business Intelligence Series

I hope nobody’s questioning the fact that an organization needs at least a reporting system in order to take advantage of the huge amount of data it has collected over time, in order to have an overview on what’s happening in the organization, take better decision supported by data, etc. In one way or another many organizations arrive to have in place two or more reporting systems and all the consequences deriving from it. From experience, I would expect that many professionals have put themselves at least once the above question, often they having to live with their decisions and they are not always happy about them.

On one side we have the demand, the needs of the various departments and groups existing in an organization, while on the other side of the balance we have the various types of data and the technologies existing on the market (could be regarded as the supply), and the various constraints an organization might have: resources (financial, human, processing power, time), politics and philosophies, geography, complexity, etc. If they don’t seem so many you have to consider that the respective constraints could be further split into a long chain of causality, the books on Project Management, Architecture, Methodologies and other related topics treating them in detail, so no need to go there.

From practical reasons, I’d like to reformulate the above question as follows:
1. Is it enough only one reporting system or do we need more than one?
2. What’s the optimum number of reporting systems existing in an organization?

A blunt answer for the first question would be that an organization might need as many reporting systems as it’s needed in order to satisfy the existing reporting needs, in other words the demand. Straightforward answer but not really acceptable for a Manager, in particular, and an IT Professional, in general. Another blunt answer, this time for the second question, would be that an organization needs only “one and good” reporting system, though that’s contradicting the various constraints existing in IT world. So, what should it be?

Typically there are two views, the bottom up view, in this case starting with the data and building up to the reporting demands, and top down view, starting from the reporting requirements to data. In this play-set the technologies, in general, the reporting systems, in particular, are playing the role of middle point, the two views applying in respect to data vs. reporting technologies, reporting technologies vs. reporting requirements, increasing the complexity of the view. Sometimes is enough to focus only on the reporting requirements, why we consider then also the data in this scenario?! Remember, reporting it’s all about harnessing the value of your data, not only in finding the right reporting solution.

Often, in order to understand the requirements of an organization and understanding the value of data it’s easier to look at the various types of data it stores – the bottom up approach. Typically we could make distinction between master vs. transactional data, raw vs. aggregated data, structured vs. semi-structured vs. non-structured data, cleaned vs. not cleaned data, historical vs. live data, mined, distributed or heterogenous data, to mention just a few of the important ways of categorizing the data.

The respective types of data are important because they require special types of techniques or tools to process, report and harness them. The master and transactional data are at the heart of business applications, the “fluid” that flows and nourishes them, with an incredible potential when harnessed at its real value. Such unaltered data are typically structured at macro level, though they could contain also semi-structured or non-structured chunks of data, could have different levels of quality, could be distributed or heterogenous, etc. The range of data repositories range from text or document systems to specialized databases, any mixture being in theory possible.

So we have a huge volume of data collected over time, how could we harness them in order to reveal their real value? Excepting the operational use, the usefulness of raw data, data found in their unaltered form, stops usually at the point where their complexity falls beyond people’s limits of understanding them. Thus we arrived to aggregate, recombine or extract chunks from the data, bringing the data to a more comprehensible form. We even arrived to extract more information out of the data using specific techniques known as data mining methods, algorithms that allow us to identify associations, cluster or interpolate the data, the complexity of such algorithms evolving in considerable in the past two decades.

There are few software applications which provide the whole range of types of data processing, most of them resuming in recombining and presenting the data. Also data presentation offers a considerable range of formats (e.g. tabular, text, XML, Excel, charts and other diagrams) with complex navigational functionality (slice-and-dice, drill-down, drill-through, click-through), different ways of making the data available for consumption (direct-access, cached, web services), different security levels, etc.

When it comes to managers and users they want the data at the level of detail and form that allows the easiest/lowest proximate understanding level, and, “yesterday”, to use a word that reflects the urgency of requirements. This could be done in theory by coming with a whole set of reports covering all possible requirements, though that’s not so efficient as investment and not always possible with a button click, as probably all we’d like to.

For the ones familiarized to Manufacturing, it’s actually a disguised push vs. pull scenario, pushing the reports to the user without expecting their demands vs. waiting for the user to forward the reporting requirements, from which to derive eventually new reports or make changes to existing ones. In Manufacturing it’s all about finding the right balance between push and demand, and even if the respective field has been found some (best) practices that leads to the middle way, in IT we still have to dig for it. The road seems to lead nowhere… but it helps getting a rough understanding of what a report involves, at least in what concerns the important non-programming stuff.

In the top down approach, as the data are remaining relatively constant, it’s easier to focus on the set of requirements and the reporting tool(s) used. It makes sense to choose the reporting tool that covers at least the most important requirements while for the other you have the choice to use workarounds or address them using two or more reporting solutions, attempting again to cover the most important requirements, and so on. The problem with having more than one reporting solution is that often data, logic and reports arrive to be replicated, the whole reporting infrastructure becoming more difficult to manage. On the other side as long you are having a clear (partitioned) delimitation between the reporting frameworks, logic and data are replicated at minimum, having two or more reporting solutions seems to be an acceptable trade. Examples of such delimitations are for example the OLTP vs. OLAP solutions, to which adds the data mining reporting solution(s).

There are also attempts to extend an existing reporting framework above the initial functionality, though often people arrived to reinvent the wheel or create little monsters they arrive to live with. Also this is an acceptable approach, as long the reporting framework allows that. Some attempts might succeed to provide what they were designed for, others not. There are also some good news from this perspective, the appearance of mash-ups and semantic technologies could in the future allow the integration of reporting systems, making possible things that nowadays are quite challenging. The future is open of unlimited possibilities, but better stick to the present! For that stay tuned for a second post!

Previous Post <<||>> Next Post

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

Data Management Series

Barry Boehm defines structuredness as 'the degree to which a system or component possesses a definite pattern of organization of its interdependent parts' [1], which transposed to data refers to the 'pattern of organization' that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc.), emerging thus several levels of structure of different type.

From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa.

From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the databases are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.

Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"

SQL Troubles

Pages