A Software Engineer and data professional's blog on SQL, data, databases, data architectures, data management, programming, Software Engineering, Project Management, ERP implementation and other IT related topics.
Pages
- 🏠Home
- 🗃️Posts
- 🗃️Definitions
- 🏭Fabric
- ⚡Power BI
- 🔢SQL Server
- 📚Data
- 📚Engineering
- 📚Management
- 📚SQL Server
- 📚Systems Thinking
- ✂...Quotes
- 🧾D365: GL
- 💸D365: AP
- 💰D365: AR
- 👥D365: HR
- ⛓️D365: SCM
- 🔤Acronyms
- 🪢Experts
- 🗃️Quotes
- 🔠Dataviz
- 🔠D365
- 🔠Fabric
- 🔠Engineering
- 🔠Management
- 🔡Glossary
- 🌐Resources
- 🏺Dataviz
- 🗺️Social
- 📅Events
- ℹ️ About
15 February 2018
🔬Data Science: Data Visualization (Definitions)
🔬Data Science: Optimization (Definitions)
"Term used to describe analytics that calculate and determine the most ideal scenario to meet a specific target. Optimization procedures analyze each scenario and supply a score. An optimization analytic can run through hundreds, even thousands, of scenarios and rank each one based on a target that is being achieved." (Brittany Bullard, "Style and Statistics", 2016)
"Optimization is the process of finding the most efficient algorithm for a given task." (Edward T Chen, "Deep Learning and Sustainable Telemedicine", 2020)
🔬Data Science: Speech Recognition (Definitions)
"Automatic decoding of a sound pattern into phonemes or words." (Guido Deboeck & Teuvo Kohonen (Eds), "Visual Explorations in Finance with Self-Organizing Maps" 2nd Ed., 2000)
"Speech recognition is a process through which machines convert words or phrases spoken into a machine-readable format." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)
13 February 2018
🔬Data Science: Data Model (Definitions)
"A model that describes in an abstract way how data is represented in an information system. A data model can be a part of ontology, which is a description of how data is represented in an entire domain" (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)
"Description of the node structure that defines its entities, fields and relationships." (Roberto Barbera et al, "gLibrary/DRI: A Grid-Based Platform to Host Muliple Repositories for Digital Content", 2009)
"An abstract model that describes how data are presented, organized and related to." (Ali M Tanyer, "Design and Evaluation of an Integrated Design Practice Course in the Curriculum of Architecture", 2010)
"The first of a series of data models that more closely represented the real world, modeling both data and their relationships in a single structure known as an object. The SDM, published in 1981, was developed by M. Hammer and D. McLeod." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management 9th Ed", 2011)
"The way of organizing and representing data is data model." (Uma V & Jayanthi G, "Spatio-Temporal Hot Spot Analysis of Epidemic Diseases Using Geographic Information System for Improved Healthcare", 2019)
12 February 2018
🔬Data Science: Correlation (Definitions)
[correlation coefficient:] "A measure to determine how closely a scatterplot of two continuous variables falls on a straight line." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"A metric that measures the linear relationship between two process variables. Correlation describes the X and Y relationship with a single number (the Pearson’s Correlation Coefficient (r)), whereas regression summarizes the relationship with a line - the regression line." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)
[correlation coefficient:] "A measure of the degree of correlation between the two variables. The range of values it takes is between −1 and +1. A negative value of r indicates an inverse relationship. A positive value of r indicates a direct relationship. A zero value of r indicates that the two variables are independent of each other. The closer r is to +1 and −1, the stronger the relationship between the two variables." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)
"The degree of relationship between business and economic variables such as cost and volume. Correlation analysis evaluates cause/effect relationships. It looks consistently at how the value of one variable changes when the value of the other is changed. A prediction can be made based on the relationship uncovered. An example is the effect of advertising on sales. A degree of correlation is measured statistically by the coefficient of determination (R-squared)." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)
"A figure quantifying the correlation between risk events. This number is between negative one and positive one." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)
"A mechanism used to associate messages with the correct workflow service instance. Correlation is also used to associate multiple messaging activities with each other within a workflow." (Bruce Bukovics, "Pro WF: Windows Workflow in .NET 4", 2010)
"Correlation is sometimes used informally to mean a statistical association between two variables, or perhaps the strength of such an association. Technically, the correlation can be interpreted as the degree to which a linear relationship between the variables exists (i.e., each variable is a linear function of the other) as measured by the correlation coefficient." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)
"The degree of relationship between two variables; in risk management, specifically the degree of relationship between potential risks." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)
"A predictive relationship between two factors, such that when one factor changes, you can predict the nature, direction and/or amount of change in the other factor. Not necessarily a cause-and-effect relationship." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"Organizing and recognizing one related event threat out of several reported, but previously distinct, events." (Mark Rhodes-Ousley, "Information Security: The Complete Reference" 2nd Ed., 2013)
"Association in the values of two or more variables." (Meta S Brown, "Data Mining For Dummies", 2014)
[correlation coefficient:] "A statistic that quantifies the degree of association between two or more variables. There are many kinds of correlation coefficients, depending on the type of data and relationship predicted." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
"The degree of association between two or more variables." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
"A statistical measure that indicates the extent to which two variables are related. A positive correlation indicates that, as one variable increases, the other increases as well. For a negative correlation, as one variable increases, the other decreases." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)
11 February 2018
🔬Data Science: Parametric Estimating (Definitions)
[parametric:] "A statistical procedure that makes assumptions concerning the frequency distributions." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"A simplified mathematical description of a system or process, used to assist calculations and predictions. Generally speaking, parametric models calculate the dependent variables of cost and duration on the basis of one or more variables." (Project Management Institute, "Practice Standard for Project Estimating", 2010)
"An estimating technique that uses a statistical relationship between historical data and other variables (e.g., square footage in construction, lines of code in software development) to calculate an estimate for activity parameters, such as scope, cost, budget, and duration. An example for the cost parameter is multiplying the planned quantity of work to be performed by the historical cost per unit to obtain the estimated cost." (Project Management Institute, "Practice Standard for Project Estimating", 2010)
"A branch of statistics that assumes the data being examined comes from a variety of known probability distributions. In general, the tests sacrifice generalizability for speed of computation and precision, providing the requisite assumptions are met." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)
"An estimating technique in which an algorithm is used to calculate cost or duration based on historical data and project parameters." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)
"Inferential statistical procedures that rely on sample statistics to draw inferences about population parameters, such as mean and variance." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
🔬Data Science: Non-Parametric Tests (Definitions)
[nonparametric:] "A statistical procedure that does not require a normal distribution of the data." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"A branch of statistics that makes no assumptions on the underlying distributions of the data being examined. In general, the tests are far more generalizable but sacrifice precision and power." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)
"Inferential statistical procedures that do not rely on estimating population parameters such as the mean and variance." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
"A family of methods which makes no assumptions about the population distribution. Non-parametric methods most commonly work by ignoring the actual values, and, instead, analyzing only their ranks. This approach ensures that the test is not affected much by outliers, and does not assume any particular distribution. The clear advantage of non-parametric tests is that they do not require the assumption of sampling from a Gaussian population. When the assumption of Gaussian distribution does not hold, non-parametric tests have more power than parametric tests to detect differences." (Soheila Nasiri & Bijan Raahemi, "Non-Parametric Statistical Analysis of Rare Events in Healthcare", 2017)
🔬Data Science: Gaussian Distribution (Definitions)
"Represents a conventional scale for a normally distributed bell-shaped curve that has a central tendency of zero and a standard deviation of one unit, wherein the units are called sigma (σ)." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)
"Also called the standard normal distribution, is the normal distribution with mean zero and variance one." (Dimitrios G Tsalikakis et al, "Segmentation of Cardiac Magnetic Resonance Images", 2009)
"A normal distribution with the parameters μ = 0 and σ = 1. The random variable for this distribution is denoted by Z. The z-tables (values of the random variable Z and the corresponding probabilities) are widely used for normal distributions." (Peter Oakander et al, "CPM Scheduling for Construction: Best Practices and Guidelines", 2014)
🔬Data Science: K-nearest neighbors (Definitions)
"A modeling technique that assigns values to points based on the values of the k nearby points, such as average value, or most common value." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"A simple and popular classifier algorithm that assigns a class (in a preexisting classification) to an object whose class is unknown. [...] From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to k (a number chosen by the user) objects of known class. The most common class (i.e., the class that is assigned most often to the nearest k objects) is assigned to the object of unknown class." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)
"A method used for classification and regression. Cases are analyzed, and class membership is assigned based on similarity to other cases, where cases that are similar (or 'near' in characteristics) are known as neighbors." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)
"A prediction method, which uses a function of the k most similar observations from the training set to generate a prediction, such as the mean." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"K-Nearest Neighbors classification is an instance-based supervised learning method that works well with distance-sensitive data." (Matthew Kirk, "Thoughtful Machine Learning", 2015)
"An algorithm that estimates an unknown data item as being like the majority of the k-closest neighbors to that item." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)
"K-nearest neighbourhood is a algorithm which stores all available cases and classifies new cases based on a similarity measure. It is used in statistical estimation and pattern recognition." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)
10 February 2018
🔬Data Science: Data Mining (Definitions)
09 February 2018
🔬Data Science: Normalization (Definitions)
"Mathematical transformations to generate a new set of values that map onto a different range." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
[Min–max normalization:] "Normalizing a variable value to a predetermine range." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
[function point normalization:] "Dividing a metric by the project’s function points to allow you to compare projects of different sizes and complexities." (Rod Stephens, "Beginning Software Engineering", 2015)
"For metrics, performing some calculation on a metric to account for possible differences in project size or complexity. Two general approaches are size normalization and function point normalization." (Rod Stephens, "Beginning Software Engineering", 2015)
[size normalization:] "For metrics, dividing a metric by an indicator of size such as lines of code or days of work. For example, bugs/KLOC tells you how buggy the code is normalized for the size of the project." (Rod Stephens, "Beginning Software Engineering", 2015)
07 February 2018
🔬Data Science: Hadoop (Definitions)
"An Apache-managed software framework derived from MapReduce and Bigtable. Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency. Two major components of Hadoop exist: a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Marcia Kaufman et al, "Big Data For Dummies", 2013)
"An open-source software platform developed by Apache Software Foundation for data-intensive applications where the data are often widely distributed across different hardware systems and geographical locations." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)
"Technology designed to house Big Data; a framework for managing data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)
"an Apache-managed software framework derived from MapReduce. Big Table Hadoop enables applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed up computations and hide latency. The two major components of Hadoop are a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)
"An open-source framework that is built to process and store huge amounts of data across a distributed file system." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)
"Open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)
"A batch processing infrastructure that stores fi les and distributes work across a group of servers. The infrastructure is composed of HDFS and MapReduce components. Hadoop is an open source software platform designed to store and process quantities of data that are too large for just one particular device or server. Hadoop’s strength lies in its ability to scale across thousands of commodity servers that don’t share memory or disk space." (Benoy Antony et al, "Professional Hadoop®", 2016)
"Apache Hadoop is an open-source framework for processing large volume of data in a clustered environment. It uses simple MapReduce programming model for reliable, scalable and distributed computing. The storage and computation both are distributed in this framework." (Kaushik Pal, 2016)
"A framework that allow for the distributed processing for large datasets." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)
"Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)
"A batch processing infrastructure that stores files and distributes work across a group of servers." (Oracle)
"an open-source framework that is built to enable the process and storage of big data across a distributed file system." (Analytics Insight)
"Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process both structured and unstructured data, and scale up reliably from a single server to thousands of machines." (Databricks) [source]
"Hadoop is an open source software framework for storing and processing large volumes of distributed data. It provides a set of instructions that organizes and processes data on many servers rather than from a centralized management nexus." (Informatica) [source]
🔬Data Science: Semantics (Definitions)
"The meaning of a model that is well-formed according to the syntax of a language." (Anneke Kleppe et al, "MDA Explained: The Model Driven Architecture: Practice and Promise", 2003)
"The part of language concerned with meaning. For example, the phrases 'my mother’s brother' and 'my uncle' are two ways of saying the same thing and, therefore, have the same semantic value." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)
"The study of meaning (often the meaning of words). In business systems we are concerned with making the meaning of data explicit (structuring unstructured data), as well as making it explicit enough that an agent could reason about it." (Danette McGilvray, "Executing Data Quality Projects", 2008)
"The branch of philosophy concerned with describing meaning." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)
"Having to do with meaning, usually of words and/or symbols (the syntax). Part of semiotic theory." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"The study of the meaning behind the syntax (signs and symbols) of a language or graphical expression of something. The semantics can only be understood through the syntax. The syntax is like the encoded representation of the semantics." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"The study of meaning. In the context of Big Data, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (i.e., making triples), combining assertions about data objects (i.e., merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but I would suggest that most definitions for semantics would be functionally equivalent to the definition offered here." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)
"Set of mappings forming a representation in order to define the meaningful information of the data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)
"Semantics is a branch of linguistics focused on the meaning communicated by language." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)
06 February 2018
🔬Data Science: Data Profiling (Definitions)
🔬Data Science: Pig (Definitions)
"A programming interface for programmers to create MapReduce jobs within Hadoop." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)
"A programming language designed to handle any type of data. Pig helps users to focus more on analyzing large datasets and less time writing map programs and reduce programs. Like Hive and Impala, Pig is a high-level platform used for creating MapReduce programs more easily. The programming language Pig uses is called Pig Latin, and it allows you to extract, transform, and load (ETL) data at a very high level. This greatly reduces the effort if this was written in JAVA code; PIG is only a fraction of that." (Benoy Antony et al, "Professional Hadoop®", 2016)
"An open-source platform for analyzing large data sets that consists of the following: (1) Pig Latin scripting language; (2) Pig interpreter that converts Pig Latin scripts into MapReduce jobs. Pig runs as a client application." (Oracle)
About Me
- Adrian
- Koeln, NRW, Germany
- IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.