Showing posts with label validation. Show all posts
Showing posts with label validation. Show all posts

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)
Acronyms:
Artificial Intelligence (AI)
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Data Mining (DM)
Knowledge Discovery in Databases (KDD)
Team Data Science Process (TDSP) 
Version Control System (VCS)
Visual Studio Team Services (VSTS)

Resources:
[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]
[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]
[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]
[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source
[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]
[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

10 May 2018

🔬Data Science: Cross-validation (Definitions)

"A method for assessing the accuracy of a regression or classification model. A data set is divided up into a series of test and training sets, and a model is built with each of the training set and is tested with the separate test set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A method for assessing the accuracy of a regression or classification model." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"A statistical method derived from cross-classification which main objective is to detect the outlying point in a population set." (Tomasz Ciszkowski & Zbigniew Kotulski, "Secure Routing with Reputation in MANET", 2008)

"Process by which an original dataset d is divided into a training set t and a validation set v. The training set is used to produce an effort estimation model (if applicable), later used to predict effort for each of the projects in v, as if these projects were new projects for which effort was unknown. Accuracy statistics are then obtained and aggregated to provide an overall measure of prediction accuracy." (Emilia Mendes & Silvia Abrahão, "Web Development Effort Estimation: An Empirical Analysis", 2008)

"A method of estimating predictive error of inducers. Cross-validation procedure splits that dataset into k equal-sized pieces called folds. k predictive function are built, each tested on a distinct fold after being trained on the remaining folds." (Gilles Lebrun et al, EA Multi-Model Selection for SVM, 2009)

"Method to estimate the accuracy of a classifier system. In this approach, the dataset, D, is randomly split into K mutually exclusive subsets (folds) of equal size (D1, D2, …, Dk) and K classifiers are built. The i-th classifier is trained on the union of all Dj ¤ j¹i and tested on Di. The estimate accuracy is the overall number of correct classifications divided by the number of instances in the dataset." (M Paz S Lorente et al, "Ensemble of ANN for Traffic Sign Recognition" [in "Encyclopedia of Artificial Intelligence"], 2009)

"The process of assessing the predictive accuracy of a model in a test sample compared to its predictive accuracy in the learning or training sample that was used to make the model. Cross-validation is a primary way to assure that over learning does not take place in the final model, and thus that the model approximates reality as well as can be obtained from the data available." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Validating a scoring procedure by applying it to another set of data." (Dougal Hutchison, "Automated Essay Scoring Systems", 2009)

"A method for evaluating the accuracy of a data mining model." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Cross-validation is a method of splitting all of your data into two parts: training and validation. The training data is used to build the machine learning model, whereas the validation data is used to validate that the model is doing what is expected. This increases our ability to find and determine the underlying errors in a model." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"A technique used for validation and model selection. The data is randomly partitioned into K groups. The model is then trained K times, each time with one of the groups left out, on which it is evaluated." (Simon Rogers & Mark Girolami, "A First Course in Machine Learning", 2017)

"A model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set." (Adrian Carballal et al, "Approach to Minimize Bias on Aesthetic Image Datasets", 2019)

27 April 2018

🔬Data Science: Validity (Definitions)

"An argument that explains the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of decisions made from an assessment." (Asao B Inoue, "The Technology of Writing Assessment and Racial Validity", 2009)

[external *]: "The extent to which the results obtained can be generalized to other individuals and/or contexts not studied." (Joan Hawthorne et al, "Method Development for Assessing a Diversity Goal", 2009)

[external *:] "A study has external validity when its results are generalizable to the target population of interest. Formally, external validity means that the causal effect based on the study population equals the causal effect in the target population. In counterfactual terms, external validity requires that the study population be exchangeable with the target population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal *:] "A study has internal validity when it provides an unbiased estimate of the causal effect of interest. Formally, internal validity means that the empirical effect from the study is equal to the causal effect in the study population." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"Construct validity is a term developed by psychometricians to describe the ability of a variable to represent accurately an underlying characteristic of interest." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[operational validity:] "is defined as a model result behavior has enough correctness for a model intended aim over the area of system intended applicability." (Sattar J Aboud et al, "Verification and Validation of Simulation Models", 2010)

"Validity is the ability of the study to produce correct results. There are various specific types of validity (see internal validity, external validity, construct validity). Threats to validity include primarily what we have termed bias, but encompass a wider range of methodological problems, including random error and lack of construct validity." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

[internal validity:] "Accuracy of the research study in determining the relationship between independent and the dependent variables. Internal validity can be assured only if all potential confounding variables have been properly controlled." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

[external *:] "Extent to which the results of a study accurately indicate the true nature of a relationship between variables in the real world. If a study has external validity, the results are said to be generalisable to the real world." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"The degree to which inferences made from data are appropriate to the context being examined. A variety of evidence can be used to support interpretation of scores." (Anne H Cash, "A Call for Mixed Methods in Evaluating Teacher Preparation Programs", 2016)

[construct *:] "Validity of a theory is also known as construct validity. Most theories in science present broad conceptual explanations of relationship between variables and make many different predictions about the relationships between particular variables in certain situations. Construct validity is established by verifying the accuracy of each possible prediction that might be made from the theory. Because the number of predictions is usually infinite, construct validity can never be fully established. However, the more independent predictions for the theory verified as accurate, the stronger the construct validity of the theory." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

21 February 2017

⛏️Data Management: Validity (Definitions)

"A characteristic of the data collected that indicates they are sound and accurate." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)

"Implies that the test measures what it is supposed to." (Robert McCrie, "Security Operations Management" 2nd Ed., 2006)

"The determination that values in the field are or are not within a set of allowed or valid values. Measured as part of the Data Integrity Fundamentals data quality dimension." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"A data quality dimension that reflects the confirmation of data items to their corresponding value domains, and the extent to which non-confirmation of certain items affects fitness to use. For example, a data item is invalid if it is defined to be integer but contains a non-integer value, linked to a finite set of possible values but contains a value not included in this set, or contains a NULL value where a NULL is not allowed." (G Shankaranarayanan & Adir Even, "Measuring Data Quality in Context", 2009)

"An aspect of data quality consisting in its steadiness despite the natural process of data obsolescence increasing in time." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the degree of conformance of data to its domain values and business rules." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Validity is a dimension of data quality, defined as the degree to which data conforms to stated rules. As used in the DQAF, validity is differentiated from both accuracy and correctness. Validity is the degree to which data conform to a set of business rules, sometimes expressed as a standard or represented within a defined data domain." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Validity is defined as the extent to which data corresponds to reference tables, lists of values from golden sources documented in metadata, value ranges, etc." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"the state of consistency between a measurement and the concept that a researcher intended to measure." (Meredith Zozus, "The Data Book: Collection and Management of Research Data", 2017)

[semantic validity:] "The compliance of attribute data to rules regarding consistency and truthfulness of association." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

[syntactic validity:] "The compliance of attribute data to format and grammar rules." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"Validity is a data quality dimension that refers to information that doesn’t conform to a specific format or doesn’t follow business rules." (Precisely) [source]

06 February 2017

⛏️Data Management: Data Validation (Definitions)

"Evaluating and checking the accuracy, consistency, timeliness, and security of information, for example by evaluating the believability or reputation of its source." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"The process of ensuring accurate data based on data acceptance and exception handling rules." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"The process of ensuring that the values of data conform to specified formats and/or values." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"(1) To confirm the validity of data. (2) A feature of data cleansing tools." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The act of determining that data is sound. In security, generally used in the context of validating input." (Mark S Merkow & Lakshmikanth Raghavan, "Secure and Resilient Software Development", 2010)

"Determining and confirming that something satisfies or conforms to defined rules, business rules, integrity constraints, defined standards, etc. The system cannot perform any validating unless it first has a definition of the way things should be validity The degree to which data conforms to domain values and defined business rules." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"This involves demonstrating that the conclusions that come from data analyses fulfill their intended purpose and are consistent." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The act of testing a model with data that was not used in the model-fitting process." (Meta S Brown, "Data Mining For Dummies", 2014)

[data integrity validation:] "Data integrity validation allows you to verify the integrity of the data that was secured by data protection operations." (CommVault, "Documentation 11.20", 2018)

25 March 2011

🧭Business Intelligence: Troubleshooting (Part II: Approaching a Query)

Business Intelligence Series
Business Intelligence Series

Introduction

You received a (long) query for troubleshooting, reviewing, conversion or any similar tasks. In addition, you don’t know much about the underlying table structure or business logic. So, what do you do then? For sure two things are intuitively clear: you don’t need to panic and, understanding the query may help you in your task. Understanding the query, it seems such a simple statement, though there is more to it. Here are some points on how to approach a query.

State your problem
 
“A problem well stated is a problem half solved” (Charles F. Kettering). Before performing any work, check what’s requested from you, whether you are having the information required for the task(s) ahead, for example documentation, valid examples, all code, etc. If something is missing, don’t hesitate to request all the information you need. While waiting for information, you can continue with next steps. As we don’t live in a perfect world, there will be also cases in which you’ll have to fill the gaps by yourself by performing additional research/work. When troubleshooting is important to understand what’s wrong and, when possible, have data against which to compare query’s output.

Save the work

Even if you are having a copy of the query somewhere on the server, save the previous version of the query and, when possible, use versioning. It might seem a redundant task, however the fact is that you never know when you need to refer to it and, as you’ll see next, it can/should be used as a baseline for validating the changes. In case you haven’t saved the query, check whether your RDBMS is tracking metadata about the queries run and, if the metadata were not reset in the meantime, you might be lucky enough to find a copy of your query.

I found that is important to save the daily work, the various analysis performed in order to understand a query, the various versions and even the data used for testing. All this work could help you letter to review what you made, the steps you missed, you can reuse one of the queries for further work, etc.

Break down

When the query is too complex, it could be useful to break the query into chunks that could be run and understood in isolation. Typically such chunks derive from query’s structure (e.g. inline queries, subqueries derived from unions). I found that often, focusing only a chunk of a query help isolating issues.

Restructure

Many programmers still write queries using the old non-ANSI joining syntax in which the join constraints appear in the WHERE clause, making the understanding and troubleshooting of a query more difficult. Often I found myself in the position of transforming first a query to ANSI SQL syntax, before performing further work on it. It’s actually a good occasion to gain a first understanding of query’s structure, but I’d prefer not to do it so often. In addition, during restructure phase it makes sense to differentiate between the join and filter constraints, this helping isolating the issue(s).

Check cardinalities

Wrong join constraints lead to duplicates or fewer records than expected, such differences being difficult to track when the variances in the numbers of records are quite small. Even if RDBMS come in developers’ help by providing metadata about the join relations, the columns and predicates participating in a join are not always so easy to identify. Therefore, in order to address this issue, it’s needed to check the constraints between any two tables between participating in a join. Sometimes, when the query is based on the table with the lowest level of detail, it can be enough to check the variations of the number of records.

Check filter constraints

Filter constraints are maybe more difficult to identify, especially when is needed to reengineer the logic built in applications. Many of the filter constraints are logical, though when you have no documentation about the schemas, is like rambling in the dark, having to check real examples and identify the various values and the impact they have on the behavior of your report.

Validate changes
 
So, you made the changes, everything looks perfect. Is it so? Often your intuition might tell you that the logic of a query is correct, though as software is not based on magic, at least not all the time, check some of the records to assure that the data are rendered as expected, check totals, compare the current with previous version, identify variations, etc. You don’t need to use all the technique you know, but to choose the best and minimal set of tools that allows you to validate the query.

Perform refactoring
    
Refactoring, the way to (continuous) code improvement, should become part of each developer’s philosophy about programming. A query, as any other piece of code, is rarely perfect as technical and factual knowledge is relative, features get deprecated and new techniques are introduced. On the other side, there is an old saying in IT – don’t change something that’s already working, so, there should be kept a balance between the two – the apparent and needed for change.

Document
    
I hope it’s not the case to stress the importance of documentation. From versioning to logic description, it’s a good practice to document the important parts of your work, especially the parts that will facilitate later work.

26 July 2010

💎SQL Reloaded: Porting 32 bit CLR UDFs on 64 bit Platforms

Today I tried to port on a 64 bit platform a few of the CLR UDFs created in the previous posts, this time being constrained to use Visual Basic Studio 2010 Express to create and build the assembly on a x86 platform, and install the assembly on a x64 SQL Server box. From the previous troubleshooting experience between the two platforms, I knew that there will be some challenges, fortunately there was nothing complex. Under SSIS 2008 it’s possible to choose the targeted platform, therefore I was expecting to have something similar also in VB Studio 2010 Express, and after a simple review of Project Properties, especially in what concerns the Compile settings, I found nothing relevant. I tried then the standard approach, so I built the solution, copied the .dll on the target server and tried to register the assembly though I got the following error:

Msg 6218: %s ASSEMBLY for assembly '%.*ls' failed because assembly '%.*ls' failed verification. Check if the referenced assemblies are up-to-date and trusted (for external_access or unsafe) to execute in the database. CLR Verifier error messages if any will follow this message%.*ls

After several attempts to google for a solution on how to port 32 bit CLR UDFs on 64 bit Platform or on how to configure VB Studio 2010 Express in order to target solutions for 64 bit platforms, I found a similar question (VB Express target x86 Platform) in MSDN,  Johan Stenberg’s answer completed by JohnWein’s hint, led me to the “Issues When Using Microsoft Visual Studio 2005” document, to be more specific to 1.44 section "References to 32-bit COM components may not work in VB and C# Applications running on 64-bit platforms", of importance being the part talking about "Express Editions". In the document is specified how to modify the project and add in the first PropertyGroup section a PlatformTarget tag with the text value x86, therefore what I had to do was to add the respective tag but with the value x64. After doing this change everything worked smoothly. It’s kind of a mystery why Microsoft hasn’t enabled this feature in Express versions, but in the end I can live with it as long there is a workaround for it.

Happy coding!

14 February 2007

🌁Software Engineering: Validation (Definitions)

"An independent test process whereby the performance of the neural network is tested against the acceptance requirements." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Confirmation that the product, as provided (or as it will be provided), will fulfill its intended use. In other words, validation ensures that 'you built the right thing'." (Sandy Shrum et al, "CMMI®: Guidelines for Process Integration and Product Improvement", 2003)

"Confirmation or corroboration of something, such as a business need or an identified opinion or recommendation." (Teri Lund & Susan Barksdale, "10 Steps to Successful Strategic Planning", 2006)

"the process of checking that a system meets the user needs." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)

"The assurance that a product, service, or system meets the needs of the customer and other identified stakeholders. It often involves acceptance and suitability with external customers. Contrast with verification." (Cynthia Stackpole, "PMP® Certification All-in-One For Dummies®", 2011)

"Testing if a development result fulfills the individual requirements for a specific use." (Tilo Linz et al, "Software Testing Foundations" 4th Ed., 2014)

"Determines if the product provides the necessary solution for the intended real-world problem." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK" 4th Ed., 2015)

"Validation is the process of verifying that a document or data structure conforms with its schema or schemas." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed, 2016)

"The assurance that a product, service, or result meets the needs of the customer and other identified stakeholders." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide )", 2017)

"The assurance that a product, service, or system meets the needs of the customer and other identified stakeholders. It often involves acceptance and suitability with external customers." (Jeffrey K Pinto, "Project Management: Achieving Competitive Advantage" 5th Ed., 2018)

 "activity that ensures a new or changed service, process, plan or other deliverable, meets the needs of the business." (ITIL)

"Confirmation by examination and through provision of objective evidence that the requirements for a specific intended use or application have been fulfilled" [ISO 9000]

"Confirmation, through the provision of objective evidence, that the requirements for a specific intended use or application have been fulfilled." (NIST SP 800-160)

"Confirmation (through the provision of strong, sound, objective evidence) that requirements for a specific intended use or application have been fulfilled." (NIST SP 800-161)

"Confirmation (through the provision of strong, sound, objective evidence) that requirements for a specific intended use or application have been fulfilled (e.g., a trustworthy credential has been presented, or data or information has been formatted in accordance with a defined set of rules, or a specific process has demonstrated that an entity under consideration meets, in all respects, its defined attributes or requirements)." (CNSSI 4009-2015)

"The process of determining that an object or process is acceptable according to a pre-defined set of tests and the results of those tests." (NIST SP 800-152)

"The process of demonstrating that the system under consideration meets in all respects the specification of that system." (INCITS/M1-040211)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.