SQL Troubles: unstructured data

Showing posts with label unstructured data. Show all posts

05 May 2018

🔬Data Science: Clustering (Definitions)

"Grouping of similar patterns together. In this text the term 'clustering' is used only for unsupervised learning problems in which the desired groupings are not known in advance." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"The process of grouping similar input patterns together using an unsupervised training algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Clustering attempts to identify groups of observations with similar characteristics." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"The process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects, which are 'similar' between them and are 'dissimilar' to the objects belonging to other clusters." (Juan R González et al, "Nature-Inspired Cooperative Strategies for Optimization", 2008)

"Grouping the nodes of an ad hoc network such that each group is a self-organized entity having a cluster-head which is responsible for formation and management of its cluster." (Prayag Narula, "Evolutionary Computing Approach for Ad-Hoc Networks", 2009)

"The process of assigning individual data items into groups (called clusters) so that items from the same cluster are more similar to each other than items from different clusters. Often similarity is assessed according to a distance measure." (Alfredo Vellido & Iván Olie, "Clustering and Visualization of Multivariate Time Series", 2010)

"Verb. To output a smaller data set based on grouping criteria of common attributes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The process of partitioning the data attributes of an entity or table into subsets or clusters of similar attributes, based on subject matter or characteristic (domain)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A data mining technique that analyzes data to group records together according to their location within the multidimensional attribute space." (SQL Server 2012 Glossary, "Microsoft", 2012)

"Clustering aims to partition data into groups called clusters. Clustering is usually unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't." (Ivan Idris, "Python Data Analysis", 2014)

"Form of data analysis that groups observations to clusters. Similar observations are grouped in the same cluster, whereas dissimilar observations are grouped in different clusters. As opposed to classification, there is not a class attribute and no predefined classes exist." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"Organization of data in some semantically meaningful way such that each cluster contains related data while the unrelated data are assigned to different clusters. The clusters may not be predefined." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

[cluster analysis:] "A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Clustering is a classification technique where similar kinds of objects are grouped together. The similarity between the objects maybe determined in different ways depending upon the use case. Therefore, clustering in measurement space may be an indicator of similarity of image regions, and may be used for segmentation purposes." (Shiwangi Chhawchharia, "Improved Lymphocyte Image Segmentation Using Near Sets for ALL Detection", 2016)

"Clustering techniques share the goal of creating meaningful categories from a collection of items whose properties are hard to directly perceive and evaluate, which implies that category membership cannot easily be reduced to specific property tests and instead must be based on similarity. The end result of clustering is a statistically optimal set of categories in which the similarity of all the items within a category is larger than the similarity of items that belong to different categories." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

[cluster analysis:]"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People", 2017)

"Clustering or cluster analysis is a set of techniques of multivariate data analysis aimed at selecting and grouping homogeneous elements in a data set. Clustering techniques are based on measures relating to the similarity between the elements. In many approaches this similarity, or better, dissimilarity, is designed in terms of distance in a multidimensional space. Clustering algorithms group items on the basis of their mutual distance, and then the belonging to a set or not depends on how the element under consideration is distant from the collection itself." (Crescenzio Gallo, "Building Gene Networks by Analyzing Gene Expression Profiles", 2018)

"Unsupervised learning or clustering is a way of discovering hidden structure in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The term clustering refers to the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters." (Satyadhyan Chickerur et al, "Forecasting the Demand of Agricultural Crops/Commodity Using Business Intelligence Framework", 2019)

"In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A cluster is a group of data objects which have similarities among them. It's a group of the same or similar elements gathered or occurring closely together." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Clustering describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"Describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

15 January 2018

🔬Data Science: Big Data (Definitions)

"Big Data: when the size and performance requirements for data management become significant design and decision factors for implementing a data management and analysis system. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration." (Jimmy Guterman, 2009)

"A buzzword for the challenges of and approaches to working with data sets that are too big to manage with traditional tools, such as relational databases. So called NoSQL databases, clustered data processing tools like MapReduce, and other tools are used to gather, store, and analyze such data sets." (Dean Wampler, "Functional Programming for Java Developers", 2011)

"Big data: techniques and technologies that make handling data at extreme scale economical." (Brian Hopkins, "Big Data, Brewer, And A Couple Of Webinars", 2011) [source]

"Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value." (McKinsey & Co., "Big Data: The Next Frontier for Innovation, Competition, and Productivity", 2011)

"Data volumes that are exceptionally large, normally greater than 100 Terabyte and more commonly refer to the Petabyte and Exabyte range. Big data has begun to be used when discussing Data Warehousing and analytic solutions where the volume of data poses specific challenges that are unique to very large volumes of data including: data loading, modeling, cleansing, and analytics, and are often solved using massively parallel processing, or parallel processing and distributed data solutions." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it." (Edd Wilder-James, "What is big data?", 2012) [source]

"A collection of data whose very size, rate of accumulation, or increased complexity makes it difficult to analyze and comprehend in a timely and accurate manner." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"A colloquial term referring to exceedingly large datasets that are otherwise unwieldy to deal with in a reasonable amount of time in the absence of specialized tools. They are different from normal data in terms of volume, velocity, and variety and typically require unique approaches for capture, processing, analysis, search, and visualization." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Big data is the term increasingly used to describe the process of applying serious computing power – the latest in machine learning and artificial intelligence – to seriously massive and often highly complex sets of information." (Microsoft, 2013) [source]

"Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away." (Tim O’Reilly, [email correspondence, 2013)

"The capability to manage a huge volume of disparate data, at the right speed and within the right time frame, to allow real-time analysis and reaction. Big data is typically broken down by three characteristics, including volume (how much data), velocity (how fast that data is processed), and variety (the various types of data)." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A colloquial term referring to datasets that are otherwise unwieldy to deal with in a reasonable amount of time in the absence of specialized tools. Common characteristics include large amounts of data (volume), different types of data (variety), and ever-increasing speed of generation (velocity). They typically require unique approaches for capture, processing, analysis, search, and visualization." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"An extremely large database which generally defies standard methods of analysis." (Owen P. Hall Jr., "Teaching and Using Analytics in Management Education", 2014)

"Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze." (Xiuli He et al, Supply Chain Analytics: Challenges and Opportunities, 2014)

"More data than can be processed by today's database systems, or acutely high volume, velocity, and variety of information assets that demand IG to manage and leverage for decision-making insights and cost management." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The term that refers to data that has one or more of the following dimensions, known as the four Vs: Volume, Variety, Velocity, and Veracity." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A collection of models, techniques and algorithms that aim at representing, managing, querying and mining large-scale amounts of data (mainly semi-structured data) in distributed environments (e.g., Clouds)." (Alfredo Cuzzocrea & Mohamed M Gaber, "Data Science and Distributed Intelligence", 2015)

"A process to deliver decision-making insights. The process uses people and technology to quickly analyze large amounts of data of different types (traditional table structured data and unstructured data, such as pictures, video, email, and Tweets) from a variety of sources to produce a stream of actionable knowledge." (James R Kalyvas & Michael R Overly, "Big Data: A Businessand Legal Guide", 2015)

"A relative term referring to data that is difficult to process with conventional technology due to extreme values in one or more of three attributes: volume (how much data must be processed), variety (the complexity of the data to be processed) and velocity (the speed at which data is produced or at which it arrives for processing). As data management technologies improve, the threshold for what is considered big data rises. For example, a terabyte of slow-moving simple data was once considered big data, but today that is easily managed. In the future, a yottabyte data set may be manipulated on desktop, but for now it would be considered big data as it requires extraordinary measures to process." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Big data is a discipline that deals with processing, storing, and analyzing heterogeneous (structured/semistructured/unstructured) large data sets that cannot be handled by traditional information management technologies that have been used to process structured data. Gartner defined big data based on the three Vs: volume, velocity, and variety." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"Records that are so large (terabytes and exabytes) and diverse (from sensors to social media data) that they require new, powerful technologies for storage, management, analysis and visualization." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Term used to describe the exponential growth, variety, and availability of data, both structured and unstructured." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A broad term for large and complex data sets that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set." (Suren Behari, "Data Science and Big Data Analytics in Financial Services: A Case Study", 2016)

"A combination of facts and artifacts drawn from a myriad of sources and stored without regard to rational or normalized disciplines or structures." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A term that describes a large dataset that grows in size over time. It refers to the size of dataset that exceeds the capturing, storage, management, and analysis of traditional databases. The term refers to the dataset that has large, more varied, and complex structure, accompanies by difficulties of data storage, analysis, and visualization. Big Data are characterized with their high-volume, -velocity and –variety information assets." (Kenneth C C Yang & Yowei Kang, "Real-Time Bidding Advertising: Challenges and Opportunities for Advertising Curriculum, Research, and Practice", 2016)

"Big data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems)." (Davy Cielen et al, "Introducing Data Science", 2016)

"For digital resources, inexpensive storage and high bandwidth have largely eliminated capacity as a constraint for organizing systems, with an exception for big data, which is defined as a collection of data that is too big to be managed by typical database software and hardware architectures." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)

"Large sets of data that are leveraged to make better business decisions. Retail data can be sales, product inventory, e-mail offers, customer information, competitor pricing, product descriptions, social media, and much more." (Brittany Bullard, "Style and Statistics", 2016)

"A term used to describe large sets of structured and unstructured data. Data sets are continually increasing in size and may grow too large for traditional storage and retrieval. Data may be captured and analyzed as it is created and then stored in files." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"Datasets of structured and unstructured information that are so large and complex that they cannot be adequately processed and analyzed with traditional data tools and applications. |" (Jonathan Ferrar et al, "The Power of People", 2017)

"Big data are often defined in terms of the three Vs: the extreme volume of data, the variety of the data types, and the velocity at which the data must be processed." (John D Kelleher & Brendan Tierney, "Data science", 2018)

"Very large data volumes that are complex and varied, and often collected and must be analyzed in real time." (Daniel J. Power & Ciara Heavin, "Data-Based Decision Making and Digital Transformation", 2018)

"A generic term that designates the massive volume of data that is generated by the increasing use of digital tools and information systems. The term big data is used when the amount of data that an organization has to manage reaches a critical volume that requires new technological approaches in terms of storage, processing, and usage. Volume, velocity, and variety are usually the three criteria used to qualify a database as 'big data'." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." (Thomas Ochs & Ute A Riemann, "IT Strategy Follows Digitalization", 2019)

"The capability to manage a huge volume of disparate data, at the right speed and within the right time frame, to allow real time analysis and reaction." (K Hariharanath, "BIG Data: An Enabler in Developing Business Models in Cloud Computing Environments", 2019)

"A term used to refer to the massive datasets generated in the digital age. Both the volume and speed at which data are generated is far greater than in the past and requires powerful computing technologies." (Osman Kandara & Eugene Kennedy, "Educational Data Mining: A Guide for Educational Researchers", 2020)

"Refers to data sets that are so voluminous and complex that traditional data processing application software is inadequate to deal with them." (James O Odia & Osaheni T Akpata, "Role of Data Science and Data Analytics in Forensic Accounting and Fraud Detection", 2021)

"The evolving term that describes a large volume of structured, semi-structured and unstructured data that has the potential to be mined for information and used in machine learning projects and other advanced analytics applications." (Nenad Stefanovic, "Big Data Analytics in Supply Chain Management", 2021)

"The term 'big data' is related to gathering and storing extra-large volume of structured, semi-structured and unstructured data with high Velocity and Variability to be used in advanced analytics applications." (Ahmad M Kabil, Integrating Big Data Technology Into Organizational Decision Support Systems, 2021)

"A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." (Board International)

"A collection of data so large that it cannot be stored, transmitted or processed by traditional means." (Open Data Handbook)

"an accumulation of data that is too large and complex for processing by traditional database management tools" (Merriam-Webster)

"Extremely large data sets that may be analyzed to reveal patterns and trends and that are typically too complex to be dealt with using traditional processing techniques." (Solutions Review)

"is a term for very large and complex datasets that exceed the ability of traditional data processing applications to deal with them. Big data technologies include data virtualization, data integration tools, and search and knowledge discovery tools." (Accenture)

"The practices and technology that close the gap between the data available and the ability to turn that data into business insight." (Forrester)

"Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Big data has one or more of the following characteristics: high volume, high velocity or high variety." (IBM) [source]

"Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves." (SAS) [source]

"Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications." (Techtarget)

"Big data is a term used for large data sets that include structured, semi-structured, and unstructured data." (Xplenty) [source]

"Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." (Gartner)

"Big data is the catch-all term used to describe gathering, analyzing, and storing massive amounts of digital information to improve operations." (Talend) [source]

"Big data refers to the 21st-century phenomenon of exponential growth of business data, and the challenges that come with it, including holistic collection, storage, management, and analysis of all the data that a business owns or uses." (Informatica) [source]

14 January 2018

🔬Data Science: Unstructured Data (Definitions)

"Data that does not neatly fit into a tabular structure with well-defined and bounded definitions. Examples of unstructured data are e-mail messages and video streams. Many customer databases contain comment fields where customer service reps put in additional notes about customers." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"Computerised information which does not have a data structure that is easily readable by a machine, including audio, video and unstructured text such as the body of a word-processed document - effectively this is the same as multimedia data." (Keith Gordon, "Principles of Data Management", 2007)

"Data that has no metadata, such as text files." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Natively bitmapped data, such as video, audio, pictures, and MRI scans, that can be sensed either visually, audibly, or both." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"Data that does not fit into a structured data model or does not fit well into relational tables. Common examples include binary information such as video or audio and free-text information." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"Data that does not follow a specified data format. Unstructured data can be text, video, images, and so on." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Unstructured data has no real structure, such as the data in an email and a memo. Interestingly, estimates have 85% of all business information as unstructured data. There are now many products coming on the market that can put some structure into unstructured data so that it can be categorized or organized hierarchically." (Michael M David & Lee Fesperman, "Advanced SQL Dynamic Data Modeling and Hierarchical Processing", 2013)

"Data that exist in their original (raw) state; that is in the format in which they were collected." (Carlos Coronel & Steven Morris, "Database Systems: Design, Implementation, & Management Ed. 11", 2014)

"Data whose logical organization is not apparent to the computer" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Information (typically stored digitally) that either does not have a predefined data model or is not organized in a predefined manner. Most unstructured data is created by humans and includes email, documents, text messages, tweets, blogs, and more." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Text, audio, video, and other types of complex data that won’t easily fit into a conventional relational database. Unstructured data isn’t as simple as the numbers and short strings that most data analysts use." (Meta S Brown, "Data Mining For Dummies", 2014)

"Data that cannot fit cleanly into a predefined structure." (Evan Stubbs, "Big Data, Big Innovation", 2014)

"Data without data model or that a computer program cannot easily use (in the sense of understanding its content). Examples are word processing documents or electronic mail" (Hasso Plattner, "A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases" 2nd Ed., 2014)

"Data (generally text-based) which is not presented in a structured form such as a database, ontology, table, etc. Newspaper articles, government reports, blogs, and e-mails are all examples of unstructured data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Data that doesn’t fit into a fixed and strict definition. Things like sound files, images, text, and web pages can be considered unstructured data." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Information that does not follow a specified data format. Unstructured data can be text, video, images, and such." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Data that does not have a specific format. It can be customer reviews, tweets, pictures, or even hashtags." (Brittany Bullard, "Style and Statistics", 2016)

"A type of data where each instance in the data set may have its own internal structure; that is, the structure is not necessarily the same in every instance. For example, text data are often unstructured and require a sequence of operations to be applied to them in order to extract a structured representation for each instance." (John D Kelleher & Brendan Tierney, "Data science", 2018)

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

Data Management Series

Barry Boehm defines structuredness as 'the degree to which a system or component possesses a definite pattern of organization of its interdependent parts' [1], which transposed to data refers to the 'pattern of organization' that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc.), emerging thus several levels of structure of different type.

From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa.

From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the databases are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.

Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"

09 August 2009

🛢DBMS: NoSQL (Definitions)

"An umbrella term for non-relational data stores, hence the name. These stores sacrifice ACID transactions for greater scalability and availability." (Dean Wampler, "Functional Programming for Java Developers", 2011)

"A set of technologies that created a broad array of database management systems that are distinct from relational database systems. One major difference is that SQL is not used as the primary query language. These database management systems are also designed for distributed data stores." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A class of database management systems that consist of non-relational, distributed data stores. These systems are optimized for supporting the storage and retrieval requirements of massive-scale data-intensive applications." (IBM, "Informix Servers 12.1", 2014)

"A database that doesn’t adhere to relational database structures. Used to organize and query unstructured data." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Any of a class of database management systems that reject the limitations and drawbacks dictated by, or associated with, the relational model. NoSQL products tend to specialize in a single or limited number of areas, such as high-performance processing, big data (giga-record systems), diverse data types (video, pictures, mathematical models), documents, and so on. Their specialized focus often requires deemphasizing other areas such as data consistency and backup and recovery." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"In general, NoSQL databases provide a mechanism for storage and retrieval of data modeled in means other than the tabular relations used in relational databases." (Prashant Natarajan et al, "Demystifying Big Data and Machine Learning for Healthcare", 2017)

"NoSQL means 'not only SQL' or 'no SQL at all'. Being a new type of non-relational databases, NoSQL databases are developed for efficient and scalable management of big data." (Zongmin Ma & Li Yan, "Towards Massive RDF Storage in NoSQL Databases: A Survey", 2019)

"A broad term for a set of data access technologies that do not use the SQL language as their primary mechanism for reading and writing data. Some NoSQL technologies act as key-value stores, only accepting single-value reads and writes; some relax the restrictions of the ACID methodology; still others do not require a pre-planned schema." (MySQL, "MySQL 8.0 Reference Manual Glossary")

"A NoSQL database is distinguished mainly by what it is not - it is not a structured relational database format that links multiple separate tables. NoSQL stands for 'not only SQL', meaning that SQL, or structured query language is not needed to extract and organize information. NoSQL databases tend to be more diverse and flatter than relational databases (in a flat database, all data is contained in the same, large table)." (Statistics.com)

"NoSQL is a database management system built for the complexities of working with Big Data. Unlike SQL, NoSQL does not store data in a relational format." (Xplenty) [source]

"No-SQL (aka not only SQL) database systems are distributed, non-relational databases designed for large-scale data storage and for massively-parallel data processing across a large number of commodity servers." (IBM)

"NoSQL is short for 'not only SQL'. NoSQL databases include mechanisms for storage and retrieval of data based on means other than the tabular relations used in relational databases." (Idera) [source]

"sometimes referred to as ‘Not only SQL’ as it is a database that doesn’t adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling." (Analytics Insight)

SQL Troubles

Pages