SQL Troubles: semi-structured data

Showing posts with label semi-structured data. Show all posts

01 February 2018

🔬Data Science: MapReduce (Definitions)

"A data processing and aggregation paradigm consisting of a 'map' phase that selects data and a 'reduce' phase that transforms the data. In MongoDB, you can run arbitrary aggregations over data using map-reduce." (MongoDb, "Glossary", 2008)

"A divide-and-conquer strategy for processing large data sets in parallel. In the 'map' phase, the data sets are subdivided. The desired computation is performed on each subset. The 'reduce' phase combines the results of the subset calculations into a final result. MapReduce frameworks handle the details of managing the operations and the nodes they run on, including restarting operations that fail for some reason. The user of the framework only has to write the algorithms for mapping and reducing the data sets and computing with the subsets." (Dean Wampler & Alex Payne, "Programming Scala", 2009)

"A method by which computationally intensive problems can be processed on multiple computers in parallel. The method can be divided into a mapping step and a reducing step. In the mapping step, a master computer divides a problem into smaller problems that are distributed to other computers. In the reducing step, the master computer collects the output from the other computers. Although MapReduce is intended for Big Data resources, holding petabytes of data, most Big Data problems do not require MapReduce." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"An early Big Data (before this term became popular) programming solution originally developed by Google for parallel processing using very large data sets distributed across a number of computing and storage systems. A Hadoop implementation of MapReduce is now available." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Designed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode. The 'map' component distributes the programming problem or tasks across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures. After the distributed computation is completed, another function called 'reduce' aggregates all the elements back together to provide a result." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A programming model consisting of two logical steps - Map and Reduce - for processing massively parallelizable problems across extremely large datasets using a large cluster of commodity computers." (Haoliang Wang et al, "Accessing Big Data in the Cloud Using Mobile Devices", Handbook of Research on Cloud Infrastructures for Big Data Analytics, 2014)

"Algorithm that is used to split massive data sets among many commodity hardware pieces in an effort to reduce computing time." (Billie Anderson & J Michael Hardin, "Harnessing the Power of Big Data Analytics", Encyclopedia of Business Analytics and Optimization, 2014)

"MapReduce is a parallel programming model proposed by Google and is used to distribute computing on clusters of computers for processing large data sets." (Jyotsna T Wassan, "Emergence of NoSQL Platforms for Big Data Needs", Encyclopedia of Business Analytics and Optimization, 2014)

"A concept which is an abstraction of the primitives ‘map’ and ‘reduce’. Most of the computations are carried by applying a ‘map’ operation to each global record in order to generate key/value pairs and then apply the reduce operation in order to combine the derived data appropriately." (P S Shivalkar & B K Tripathy, "Rough Set Based Green Cloud Computing in Emerging Markets", Encyclopedia of Information Science and Technology 3rd Ed., 2015)

"A programming model that uses a divide and conquer method to speed-up processing large datasets, with a special focus on semi-structured data." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", Encyclopedia of Information Science and Technology 3rd Ed., 2015)

"MapReduce is a programming model for general-purpose parallelization of data-intensive processing. MapReduce divides the processing into two phases: a mapping phase, in which data is broken up into chunks that can be processed by separate threads - potentially running on separate machines; and a reduce phase, which combines the output from the mappers into the final result." (Guy Harrison, "Next Generation Databases: NoSQL, NewSQL, and Big Data", 2015)

"MapReduce is a technological framework for processing parallelize-able problems across huge data sets using a large number of computers (nodes). […] MapReduce consists of two major steps: 'Map' and 'Reduce'. They are similar to the original Fork and Join operations in distributed systems, but they can consider a large number of computers that can be constructed based on the Internet cloud. In the Map-step, the master computer (a node) first divides the input into smaller sub-problems and then distributes them to worker computers (worker nodes). A worker node may also be a sub-master node to distribute the sub-problem into even smaller problems that will form a multi-level structure of a task tree. The worker node can solve the sub-problem and report the results back to its upper level master node. In the Reduce-step, the master node will collect the results from the worker nodes and then combine the answers in an output (solution) of the original problem." (Li M Chen et al, "Mathematical Problems in Data Science: Theoretical and Practical Methods", 2015)

"A programming model which process massive amounts of unstructured data in parallel and distributed cluster of processors." (Fatma Mohamed et al, "Data Streams Processing Techniques Data Streams Processing Techniques", Handbook of Research on Machine Learning Innovations and Trends, 2017)

"A data processing framework of Hadoop which provides data intensive computation of large data sets by dividing tasks across several machines and finally combining the result." (Rupali Ahuja, "Hadoop Framework for Handling Big Data Needs", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"A high-level programming model, which uses the “map” and “reduce” functions, for processing high volumes of data." (Carson K.-S. Leung, "Big Data Analysis and Mining", Encyclopedia of Information Science and Technology 4th Ed., 2018)

"Is a computational paradigm for processing massive datasets in parallel if the computation fits a three-step pattern: map, shard and reduce. The map process is a parallel one. Each process executes on a different part of data and produces (key, value) pairs. The shard process collects the generated pairs, sorts and partitions them. Each partition is assigned to a different reduce process which produces a single result." (Venkat Gudivada et al, "Database Systems for Big Data Storage and Retrieval", Handbook of Research on Big Data Storage and Visualization Techniques, 2018)

"Is a programming model or algorithm for the processing of data using a parallel programming implementation and was originally used for academic purposes associated with parallel programming techniques. (Soraya Sedkaoui, "Understanding Data Analytics Is Good but Knowing How to Use It Is Better!", Big Data Analytics for Entrepreneurial Success, 2019)

"MapReduce is a style of programming based on functional programming that was the basis of Hadoop." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Is a specific programming model, which as such represents a new approach to solving the problem of processing large amounts of differently structured data. It consists of two functions - Map (sorting and filtering data) and Reduce (summarizing intermediate results), and it is executed in parallel and distributed." (Savo Stupar et al, "Importance of Applying Big Data Concept in Marketing Decision Making", Handbook of Research on Applied AI for International Business and Marketing Applications, 2021)

"A software framework for processing vast amounts of data." (Analytics Insight)

15 January 2018

🔬Data Science: Semi-Structured Data (Definitions)

"Data that has flexible metadata, such as XML." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft® SQL Server™ 2005 Optimization and Maintenance 70-444", 2007)

"'Text' documents, such as e-mail, word processing, presentations, and spreadsheets, whose content can be searched." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"Data that, although unstructured, still has some degree of structure. A good example is e-mail: Even though it is predominantly text, it has logical blocks with different purposes." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Data that have already been processed to some extent." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management" 11th Ed., 2014)

"A structured data type that does not have a formal definition, like a document. It has tags or other markers to enforce a hierarchy of records within a particular object, but may be different from another object." (Jason Williamson, Getting a Big Data Job For Dummies, 2015)

"Semi-structured data has some structures that are often manifested in images and data from sensors." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"a form a structured data that does not have a formal structure like structured data. It does however have tags or other markers to enforce hierarchy of records." (Analytics Insight)

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

Data Management Series

Barry Boehm defines structuredness as 'the degree to which a system or component possesses a definite pattern of organization of its interdependent parts' [1], which transposed to data refers to the 'pattern of organization' that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc.), emerging thus several levels of structure of different type.

From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa.

From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the databases are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.

Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"

SQL Troubles

Pages