A Software Engineer and data professional's blog on SQL, data, databases, data architectures, data management, programming, Software Engineering, Project Management, ERP implementation and other IT related topics.
Pages
- 🏠Home
- 🗃️Posts
- 🗃️Definitions
- 🔢SQL Server
- 🎞️SQL Server: VoD
- 🏭Fabric
- 🎞️Fabric: VoD
- ⚡Power BI
- 🎞️Power BI: VoD
- 📚Data
- 📚Engineering
- 📚Management
- 📚SQL Server
- 🎞️D365: VoD
- 📚Systems Thinking
- ✂...Quotes
- 🧾D365: GL
- 💸D365: AP
- 💰D365: AR
- 🏠D365: FA
- 👥D365: HR
- ⛓️D365: SCM
- 🔤Acronyms
- 🪢Experts
- 🗃️Quotes
- 🔠Dataviz & BI
- 🔠D365
- 🔠Fabric
- 🔠Engineering
- 🔠Management
- 🔡Glossary
- 🌐Resources
- 🏺Dataviz
- 🗺️Social
- 📅Events
- ℹ️ About
27 February 2018
🔬Data Science: Data Modeling (Definitions)
24 February 2018
💎SQL Reloaded: Misusing Views and Pseudo-Constants
Views as virtual tables can be misused to replace tables in certain circumstances, either by storing values within one or multiple rows, like in the below examples:
-- parameters for a BI solution CREATE VIEW dbo.vLoV_Parameters AS SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId , Cast(GetDate() as Date) AS CurrentDate , Cast(100 as int) AS BatchCount GO SELECT * FROM dbo.vLoV_Parameters GO -- values for a dropdown CREATE VIEW dbo.vLoV_DataAreas AS SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId , Cast('Company ABC' as nvarchar(50)) AS Description UNION ALL SELECT 'XYZ' DataAreaId , 'Company XYZ' GO SELECT * FROM dbo.vLoV_DataAreas GO
These solutions aren’t elegant, and typically not recommended because they go against one of the principles of good database design, namely “data belong in tables”, though they do the trick when needed. Personally, I used them only in a handful of cases, e.g. when it wasn’t allowed to create tables, when it was needed testing something for a short period of time, or when there was some overhead of creating a table for 2-3 values. Because of their scarce use, I haven’t given them too much thought, not until I discovered Jared Ko’s blog posting on pseudo-constants. He considers the values from the first view as pseudo-constants, and advocates for their use especially for easier dependency tracking, easier code refactoring, avoiding implicit data conversion and easier maintenance of values.
All these are good reasons to consider them, therefore I tried to take further the idea to see if it survives a reality check. For this I took Dynamics AX as testing environment, as it makes extensive use of enumerations (aka base enums) to store list of values needed allover through the application. Behind each table there are one or more enumerations, the tables storing master data abounding of them.
For exemplification let’s consider InventTrans, table that stores the inventory transactions, the logic that governs the receipt and issued transactions are governed by three enumerations: StatusIssue, StatusReceipt and Direction.
-- Status Issue Enumeration CREATE VIEW dbo.vLoV_StatusIssue AS SELECT cast(0 as int) AS None , cast(1 as int) AS Sold , cast(2 as int) AS Deducted , cast(3 as int) AS Picked , cast(4 as int) AS ReservPhysical , cast(5 as int) AS ReservOrdered , cast(6 as int) AS OnOrder , cast(7 as int) AS QuotationIssue GO -- Status Receipt Enumeration CREATE VIEW dbo.vLoV_StatusReceipt AS SELECT cast(0 as int) AS None , cast(1 as int) AS Purchased , cast(2 as int) AS Received , cast(3 as int) AS Registered , cast(4 as int) AS Arrived , cast(5 as int) AS Ordered , cast(6 as int) AS QuotationReceipt GO -- Inventory Direction Enumeration CREATE VIEW dbo.vLoV_InventDirection AS SELECT cast(0 as int) AS None , cast(1 as int) AS Receipt , cast(2 as int) AS Issue
To see these views at work let’s construct the InventTrans table on the fly:
-- creating an ad-hoc table SELECT * INTO dbo.InventTrans FROM (VALUES (1, 1, 0, 2, -1, 'A0001') , (2, 1, 0, 2, -10, 'A0002') , (3, 2, 0, 2, -6, 'A0001') , (4, 2, 0, 2, -3, 'A0002') , (5, 3, 0, 2, -2, 'A0001') , (6, 1, 0, 1, 1, 'A0001') , (7, 0, 1, 1, 50, 'A0001') , (8, 0, 2, 1, 100, 'A0002') , (9, 0, 3, 1, 30, 'A0003') , (10, 0, 3, 1, 20, 'A0004') , (11, 0, 1, 2, 10, 'A0001') ) A(TransId, StatusIssue, StatusReceipt, Direction, Qty, ItemId)
Here are two sets of examples using literals vs. pseudo-constants:
--example issued with literals SELECT top 100 ITR.* FROM dbo.InventTrans ITR WHERE ITR.StatusIssue = 1 AND ITR.Direction = 2 GO --example issued with pseudo-constants SELECT top 100 ITR.* FROM dbo.InventTrans ITR JOIN dbo.vLoV_StatusIssue SI ON ITR.StatusIssue = SI.Sold JOIN dbo.vLoV_InventDirection ID ON ITR.Direction = ID.Issue GO --example receipt with literals SELECT top 100 ITR.* FROM dbo.InventTrans ITR WHERE ITR.StatusReceipt= 1 AND ITR.Direction = 1 GO --example receipt with pseudo-constants SELECT top 100 ITR.* FROM dbo.InventTrans ITR JOIN dbo.vLoV_StatusReceipt SR ON ITR.StatusReceipt= SR.Purchased JOIN dbo.vLoV_InventDirection ID ON ITR.Direction = ID.Receipt
As can be seen the queries using pseudo-constants make the code somehow readable, though the gain is only relative, each enumeration implying an additional join. In addition, when further business tables are added to the logic (e.g. items, purchases or sales orders) it complicates the logic, making it more difficult to separate the essential from nonessential. Imagine a translation of the following query:
-- complex query SELECT top 100 ITR.* FROM dbo.InventTrans ITR <several tables here> WHERE ((ITR.StatusReceipt<=3 AND ITR.Direction = 1) OR (ITR.StatusIssue<=3 AND ITR.Direction = 2)) AND (<more constraints here>)
The more difficult the constraints in the WHERE clause, the more improbable is a translation of the literals into pseudo-constraints. Considering that an average query contains 5-10 tables, each of them with 1-3 enumerations, the queries would become impracticable by using pseudo-constants and quite difficult to troubleshoot their execution plans.
The more I’m thinking about, an enumeration data type as global variable in SQL Server (like the ones available in VB) would be more than welcome, especially because values are used over and over again through the queries. Imagine, for example, the possibility of writing code as follows:
-- hypothetical query SELECT top 100 ITR.* FROM dbo.InventTrans ITR WHERE ITR.StatusReceipt = @@StatusReceipt .Purchased AND ITR.Direction = @@InventDirection.Receipt
From my point of view this would make the code more readable and easier to maintain. Instead, in order to make the code more readable, one’s usually forced to add some comments in the code. This works as well, though the code can become full of comments.
-- query with commented literals SELECT top 100 ITR.* FROM dbo.InventTrans ITR WHERE ITR.StatusReceipt <=3 -- Purchased, Received, Registered AND ITR.Direction = 1-- Receip
In conclusion, pseudo-constants’ usefulness is only limited, and their usage is against developers’ common sense, however a data type in SQL Server with similar functionality would make code more readable and easier to maintain.
Notes:
1) It is possible to simulate an enumeration data type in tables’ definition by using a CHECK constraint.
2) The queries work also in SQL databases in Microsoft Fabric (see file in GitHub repository). You might want to use another schema (e.g. Test), not to interfere with the existing code.
Happy coding!
19 February 2018
🔬Data Science: Data Exploration (Definitions)
15 February 2018
🔬Data Science: Data Preparation (Definitions)
🔬Data Science: Data Augmentation (Definitions)
"1.The process of adding to something to make it more or greater than the original. 2.In logic, a relationship where if X leads to Y, then XZ will lead to YZ." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"A technique for improving the performance of a model by enriching the training data, e.g. by generating additional instances of minority classes." (Vincent Karas & Björn W Schuller, "Deep Learning for Sentiment Analysis: An Overview and Perspectives", 2021)
🔬Data Science: Feature Extraction (Definitions)
"A technique that attempts to combine or transform predictors to make clear the information contained within them. Feature extraction methods include factor analysis, principal components analysis, correspondence analysis, multidimensional scaling, partial least square methods, and singular value decomposition." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)
"Extracting or deriving some useful information from the initially obtained data." (Shouvik Chakraborty & Kalyani Mali, "An Overview of Biomedical Image Analysis From the Deep Learning Perspective", 2020)
"A process of finding features of words and map them to vector space." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)
"A digital signal processing algorithm, which extracts distinctive values from the input signal." (Andrej Zgank et al, "Embodied Conversation: A Personalized Conversational HCI Interface for Ambient Intelligence", 2021)
"Feature extraction is a procedure in dimensionality reduction of extracting principal variables (features) from some random variables under consideration, usually achieved by extracting one principal variable (feature) as mapping from multiple random variables." (Usama A Khan & Josephine M Namayanja, "Reevaluating Factor Models: Feature Extraction of the Factor Zoo", 2021)
🔬Data Science: Feature Selection (Definitions)
"A method by which to decide on which features (columns) to keep in the analysis that will be done by the data mining algorithms. One of the first things to be done in a data mining project; this uncovers the most important variables among the set of predictor variables. Many of the predictor variables in a data set may not really be important for making an accurate predictive model, and only dilute/reduce the accuracy score of the model if included." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)
"The process a cybersecurity engineer uses to choose the features in which a given attack may manifest." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)
"Feature selection is the process of selecting important principal variables (features) from some random variables under consideration, usually achieved by selecting a principal variable (feature) as one of the random variables." (Usama A Khan & Josephine M Namayanja, "Reevaluating Factor Models: Feature Extraction of the Factor Zoo", 2021)
"It is used to select appropriate features from the available data for improving efficiency of machine learning algorithms." (Gunjan Ansari et al, "Natural Language Processing in Online Reviews", 2021)
🔬Data Science: Optimization (Definitions)
"Term used to describe analytics that calculate and determine the most ideal scenario to meet a specific target. Optimization procedures analyze each scenario and supply a score. An optimization analytic can run through hundreds, even thousands, of scenarios and rank each one based on a target that is being achieved." (Brittany Bullard, "Style and Statistics", 2016)
"Optimization is the process of finding the most efficient algorithm for a given task." (Edward T Chen, "Deep Learning and Sustainable Telemedicine", 2020)
🔬Data Science: Speech Recognition (Definitions)
"Automatic decoding of a sound pattern into phonemes or words." (Guido Deboeck & Teuvo Kohonen (Eds), "Visual Explorations in Finance with Self-Organizing Maps" 2nd Ed., 2000)
"Speech recognition is a process through which machines convert words or phrases spoken into a machine-readable format." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)
13 February 2018
🔬Data Science: Data Model (Definitions)
"A model that describes in an abstract way how data is represented in an information system. A data model can be a part of ontology, which is a description of how data is represented in an entire domain" (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)
"Description of the node structure that defines its entities, fields and relationships." (Roberto Barbera et al, "gLibrary/DRI: A Grid-Based Platform to Host Muliple Repositories for Digital Content", 2009)
"An abstract model that describes how data are presented, organized and related to." (Ali M Tanyer, "Design and Evaluation of an Integrated Design Practice Course in the Curriculum of Architecture", 2010)
"The first of a series of data models that more closely represented the real world, modeling both data and their relationships in a single structure known as an object. The SDM, published in 1981, was developed by M. Hammer and D. McLeod." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management 9th Ed", 2011)
"The way of organizing and representing data is data model." (Uma V & Jayanthi G, "Spatio-Temporal Hot Spot Analysis of Epidemic Diseases Using Geographic Information System for Improved Healthcare", 2019)
12 February 2018
🔬Data Science: Correlation (Definitions)
[correlation coefficient:] "A measure to determine how closely a scatterplot of two continuous variables falls on a straight line." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"A metric that measures the linear relationship between two process variables. Correlation describes the X and Y relationship with a single number (the Pearson’s Correlation Coefficient (r)), whereas regression summarizes the relationship with a line - the regression line." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)
[correlation coefficient:] "A measure of the degree of correlation between the two variables. The range of values it takes is between −1 and +1. A negative value of r indicates an inverse relationship. A positive value of r indicates a direct relationship. A zero value of r indicates that the two variables are independent of each other. The closer r is to +1 and −1, the stronger the relationship between the two variables." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)
"The degree of relationship between business and economic variables such as cost and volume. Correlation analysis evaluates cause/effect relationships. It looks consistently at how the value of one variable changes when the value of the other is changed. A prediction can be made based on the relationship uncovered. An example is the effect of advertising on sales. A degree of correlation is measured statistically by the coefficient of determination (R-squared)." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)
"A figure quantifying the correlation between risk events. This number is between negative one and positive one." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)
"A mechanism used to associate messages with the correct workflow service instance. Correlation is also used to associate multiple messaging activities with each other within a workflow." (Bruce Bukovics, "Pro WF: Windows Workflow in .NET 4", 2010)
"Correlation is sometimes used informally to mean a statistical association between two variables, or perhaps the strength of such an association. Technically, the correlation can be interpreted as the degree to which a linear relationship between the variables exists (i.e., each variable is a linear function of the other) as measured by the correlation coefficient." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)
"The degree of relationship between two variables; in risk management, specifically the degree of relationship between potential risks." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)
"A predictive relationship between two factors, such that when one factor changes, you can predict the nature, direction and/or amount of change in the other factor. Not necessarily a cause-and-effect relationship." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"Organizing and recognizing one related event threat out of several reported, but previously distinct, events." (Mark Rhodes-Ousley, "Information Security: The Complete Reference" 2nd Ed., 2013)
"Association in the values of two or more variables." (Meta S Brown, "Data Mining For Dummies", 2014)
[correlation coefficient:] "A statistic that quantifies the degree of association between two or more variables. There are many kinds of correlation coefficients, depending on the type of data and relationship predicted." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
"The degree of association between two or more variables." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
"A statistical measure that indicates the extent to which two variables are related. A positive correlation indicates that, as one variable increases, the other increases as well. For a negative correlation, as one variable increases, the other decreases." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)
11 February 2018
🔬Data Science: Parametric Estimating (Definitions)
[parametric:] "A statistical procedure that makes assumptions concerning the frequency distributions." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"A simplified mathematical description of a system or process, used to assist calculations and predictions. Generally speaking, parametric models calculate the dependent variables of cost and duration on the basis of one or more variables." (Project Management Institute, "Practice Standard for Project Estimating", 2010)
"An estimating technique that uses a statistical relationship between historical data and other variables (e.g., square footage in construction, lines of code in software development) to calculate an estimate for activity parameters, such as scope, cost, budget, and duration. An example for the cost parameter is multiplying the planned quantity of work to be performed by the historical cost per unit to obtain the estimated cost." (Project Management Institute, "Practice Standard for Project Estimating", 2010)
"A branch of statistics that assumes the data being examined comes from a variety of known probability distributions. In general, the tests sacrifice generalizability for speed of computation and precision, providing the requisite assumptions are met." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)
"An estimating technique in which an algorithm is used to calculate cost or duration based on historical data and project parameters." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)
"Inferential statistical procedures that rely on sample statistics to draw inferences about population parameters, such as mean and variance." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
🔬Data Science: Non-Parametric Tests (Definitions)
[nonparametric:] "A statistical procedure that does not require a normal distribution of the data." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"A branch of statistics that makes no assumptions on the underlying distributions of the data being examined. In general, the tests are far more generalizable but sacrifice precision and power." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)
"Inferential statistical procedures that do not rely on estimating population parameters such as the mean and variance." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)
"A family of methods which makes no assumptions about the population distribution. Non-parametric methods most commonly work by ignoring the actual values, and, instead, analyzing only their ranks. This approach ensures that the test is not affected much by outliers, and does not assume any particular distribution. The clear advantage of non-parametric tests is that they do not require the assumption of sampling from a Gaussian population. When the assumption of Gaussian distribution does not hold, non-parametric tests have more power than parametric tests to detect differences." (Soheila Nasiri & Bijan Raahemi, "Non-Parametric Statistical Analysis of Rare Events in Healthcare", 2017)
🔬Data Science: Gaussian Distribution (Definitions)
"Represents a conventional scale for a normally distributed bell-shaped curve that has a central tendency of zero and a standard deviation of one unit, wherein the units are called sigma (σ)." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)
"Also called the standard normal distribution, is the normal distribution with mean zero and variance one." (Dimitrios G Tsalikakis et al, "Segmentation of Cardiac Magnetic Resonance Images", 2009)
"A normal distribution with the parameters μ = 0 and σ = 1. The random variable for this distribution is denoted by Z. The z-tables (values of the random variable Z and the corresponding probabilities) are widely used for normal distributions." (Peter Oakander et al, "CPM Scheduling for Construction: Best Practices and Guidelines", 2014)
🔬Data Science: K-nearest neighbors (Definitions)
"A modeling technique that assigns values to points based on the values of the k nearby points, such as average value, or most common value." (DAMA International, "The DAMA Dictionary of Data Management", 2011)
"A simple and popular classifier algorithm that assigns a class (in a preexisting classification) to an object whose class is unknown. [...] From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to k (a number chosen by the user) objects of known class. The most common class (i.e., the class that is assigned most often to the nearest k objects) is assigned to the object of unknown class." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)
"A method used for classification and regression. Cases are analyzed, and class membership is assigned based on similarity to other cases, where cases that are similar (or 'near' in characteristics) are known as neighbors." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)
"A prediction method, which uses a function of the k most similar observations from the training set to generate a prediction, such as the mean." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)
"K-Nearest Neighbors classification is an instance-based supervised learning method that works well with distance-sensitive data." (Matthew Kirk, "Thoughtful Machine Learning", 2015)
"An algorithm that estimates an unknown data item as being like the majority of the k-closest neighbors to that item." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)
"K-nearest neighbourhood is a algorithm which stores all available cases and classifies new cases based on a similarity measure. It is used in statistical estimation and pattern recognition." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)
About Me

- Adrian
- Koeln, NRW, Germany
- IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.