28 February 2018

Data Science: Inference (Definitions)

"Drawing some form of conclusion about a measurable functional response based on representative or sample experimental data. Sample size, uncertainty, and the laws of probability play a major role in making inferences." (Clyde M Creveling, "Six Sigma for Technical Processes: An Overview for R Executives, Technical Leaders, and Engineering Managers", 2006)

"Reasoning from known propositions." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"In general, inference is the act or process of deriving new facts from facts known or assumed to be true. In Artificial Intelligence, researchers develop automated inference engines to automate human inference." (Michael Fellmann et al, "Supporting Semantic Verification of Process Models", 2012)

[statistical inference:] "A method that uses sample data to draw conclusions about a population." (Geoff Cumming, "Understanding The New Statistics", 2013)

"Any conclusion drawn on the basis of some set of information. In research, we draw inferences on the basis of empirical data we collect and ideas we construct." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

[causal inference:] "Conclusion that changes in the independent variable resulted in a change in the dependent variable. It may be drawn only if all potential confounding variables are properly controlled." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

[inductive inference] "A machine learning method for learning the rules that produced the actual data." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"The ability to derive information not explicitly available." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

27 February 2018

Data Science: Data Modeling (Definitions)

"The task of developing a data model that represents the persistent data of some enterprise." (Keith Gordon, "Principles of Data Management", 2007)

"An analysis and design method, building data models to 
a) define and analyze data requirements,
b) design logical and physical data structures that support these requirements, and
c) define business and technical meta-data." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The process of creating a data model by applying formal data model descriptions using data modeling techniques." (Christian Galinski & Helmut Beckmann, "Concepts for Enhancing Content Quality and eAccessibility: In General and in the Field of eProcurement", 2012)

"The process of creating the abstract representation of a subject so that it can be studied more cheaply (a scale model of an airplane in a wind tunnel), at a particular moment in time (weather forecasting), or manipulated, modified, and altered without disrupting the original (economic model)." (George Tillmann, "Usage-Driven Database Design: From Logical Data Modeling through Physical Schmea Definition", 2017)

"A method used to define and analyze the data requirements needed to support an entity’s business processes, defining the relationship between data elements and structures." (Solutions Review)

"A method used to define and analyze data requirements needed to support the business functions of an enterprise. These data requirements are recorded as a conceptual data model with associated data definitions. Data modeling defines the relationships between data elements and data structures. (Microstrategy)

"A method used to define and analyze data requirements needed to support the business functions of an enterprise. These data requirements are recorded as a conceptual data model with associated data definitions. Data modeling defines the relationships between data elements and structures." (Information Management)

"Refers to the process of defining, analyzing, and structuring data within data models." (Insight Software)

"Data modeling is a way of mapping out and visualizing all the different places that a software or application stores information, and how these sources of data will fit together and flow into one another." (Sisense) [source]

"Data modeling is the process of documenting a complex software system design as an easily understood diagram, using text and symbols to represent the way data needs to flow. The diagram can be used to ensure efficient use of data, as a blueprint for the construction of new software or for re-engineering a legacy application." (Techtarget) [source]

Data Science: Predictive Analytics (Definitions)

"Includes a variety of statistical and data mining techniques to analyze historical and current data to make predictions about the future." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"An area of statistical analysis that deals with extracting information from data and using it to predict future trends and behavior patterns." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The branch of data mining that focuses on forecasting trends (e.g., regression analysis) and estimating probabilities of future events. Business analytics, as it is also called, provides the models, which are formulas or algorithms, and procedures to BI." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"A statistical or data-mining solution consisting of algorithms and techniques that can be used on both structured and unstructured data (together or individually) to determine future outcomes. It can be deployed for prediction, optimization, forecasting, simulation, and many other uses" (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A methodology for forecasting futures events and trends using a variety of technologies including statistics and artificial intelligence." (Owen P. Hall Jr., "Teaching and Using Analytics in Management Education", 2014)

"A set of data–driven tools and methods to study a system behavior over time and to predict the future outcomes." (Shokoufeh Mirzaei, "Defining a Business-Driven Optimization Problem", 2014) 

"An advanced form of analytics that uses business information to find patterns and predict future outcomes and trends; determining credit scores by looking at a customer’s credit history and other data is a typical use for predictive analytics." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Analytic methods used to make predictions. The practice of using mathematical modeling to predict outcomes." (Meta S Brown, "Data Mining For Dummies", 2014)

"Predictive analytics requires new methods and technologies by an organization to mine data to discover trends/patterns and test large numbers variables for unexpected insight." (Avnish Rastogi, "New Payment Models and Big Data Analytics", 2014)

"The practice of using statistics and data mining to analyze current and historical information to make predictions about what will happen in the future. Predictive modeling, the fitting of some data to some model, is a step in predictive analytics. Typically, predictive analytics also includes applying a model to additional data." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Predictive analytics and modeling are statistical and analytical tools that examine and capture the complex relationships and underlying patterns among variables in the existing data in efforts to predict the future organizational performances, risks, trends, and behavior patterns". (Sema A Kalaian & Rafa M Kasim, "Predictive Analytics", 2015)

"A technique used in many business areas to enable organizations and companies to make more informed business discussions by making inference from analyzing patterns and relationships in consumer behavior data. A term refers to the procedure and technique to enable researchers or businesses to extra information from existing datasets to identify consumer behavioral patterns and insights to predict future trends and outcomes." (Kenneth C C Yang & Yowei Kang, "Real-Time Bidding Advertising: Challenges and Opportunities for Advertising Curriculum, Research, and Practice", 2016)

"A branch of advanced analytics that is used to make forecasts about future events." (Jonathan Ferrar et al, "The Power of People", 2017)

"A general term for using simple and complex models to predict what will happen, to support decision making. A process of using a quantitative model and current real-time or historical data to generate a score that is predictive of future behavior. Statistical analysis of historical data identifies a predictive model to support a specific decision task." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"General term for using simple and complex ­models to support anticipatory decision making. Often a process of using a ­quantitative model and current real-time or historical data to generate a score that is predictive of future behavior." (Daniel J. Power & Ciara Heavin, "Data-Based Decision Making and Digital Transformation", 2018)

"[...] predictive analytics is about predicting the future outcomes. It also involves forecasting demand, sales, and profits for a company. The commonly used techniques for predictive analytics are different types of regression and forecasting models. Some advanced techniques are data mining, machine learning, neural networks, and advanced statistical models." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"Predictive analytics is the branch of data mining concerned with the prediction of future probabilities and trends. The central element of predictive analytics is the predictor, a variable that can be measured for an individual or other entity to predict future behavior." (Thomas Ochs & Ute A Riemann, "IT Strategy Follows Digitalization", 2018)

"A statistical or data mining solution consisting of algorithms and techniques that can be used for both structured and unstructured data to determine future outcomes." (K Hariharanath, "BIG Data: An Enabler in Developing Business Models in Cloud Computing Environments", 2019)

"Predictive analytics is the branch of data mining concerned with the prediction of future probabilities and trends. The central element of predictive analytics is the predictor, a variable that can be measured for an individual or other entity to predict future behavior." (Thomas Ochs & Ute A Riemann, "IT Strategy Follows Digitalization", 2019)

"Predictive analytics represent any solution that supports the identification of meaningful patterns and correlations among variables in complex, structured, unstructured, historical, and potential future data sets for the purposes of predicting events and assessing the attractiveness of various courses of action." (Satyadhyan Chickerur et al, "Forecasting the Demand of Agricultural Crops/Commodity Using Business Intelligence Framework", 2019)

"A process for analyzing data in a manner that seeks to predict a likely future scenario or outcome. It can be used to improve decision making, mitigate risk, improve operations, and identify best practices." (Mike Gregory & Cynthia Roberts, "Maturing an Information Technology Privacy Program: Assessment, Improvement, and Change Leadership", 2020)

"It is a statistical process for denoting the average relationship between two or more factors with the involvement of dependent and independent variables." (Selvan C & S R Balasundaram, "Data Analysis in Context-Based Statistical Modeling in Predictive Analytics", 2021)

"A type of data analytics which identifies trends in historical datasets and uses those trends to forecast future performance, such as predicted sales revenue or demand." (Board International)

"[...] describes the practice of using historical data to predict future outcomes. It combines mathematical models (or 'predictive algorithms') with historical data to calculate the likelihood (or degree to which) something will happen." (Accenture)

"Techniques, tools, and technologies that use data to find models - models that can anticipate outcomes with a significant probability of accuracy." (Forrester)

"the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Applied to business, predictive models and analysis are used to analyze current data and historical facts in order to better understand customers, products and partners and to identify potential risks and opportunities for a company." (KDnuggets)

"Predictive analytics is a form of advanced analytics that uses both new and historical data to forecast activity, behavior and trends. It involves applying statistical analysis techniques, analytical queries and automated machine learning algorithms to data sets to create predictive models that place a numerical value - or score - on the likelihood of a particular event happening." (Techtarget) [source]

"Predictive analytics is a set of methods and technologies that can be used to analyze current and historical data with the goal of making predictions about future events. Predictive analytics includes a wide variety of mathematical modeling and computer science techniques with the common goal of using past events to indicate the probability or likelihood of a future event." (Sumo Logic) [source]

"Predictive analytics is a sub-division of advanced analytics and focuses on the identification of future events and values with their respective probabilities." (BI Survey) [source]

"Predictive analytics is an area of data mining that is related to the overall prediction of future probabilities and trends. It uses historical data, machine learning, and AI to predict what will happen in the future." (Logi Analytics) [source]

"Predictive Analytics is the practice of employing statistics and modeling techniques to extract information from current and historical datasets in order to predict potential future outcomes and trends." (OmiSci) [source]

"Predictive analytics is the umbrella term for analyzing patterns found in data to predict future behavior or results. It includes techniques and algorithms found in statistics, machine learning, artificial intelligence, and data mining." (TDWI)

25 February 2018

Data Science: Data Processing (Definitions)

"The act of turning raw data into meaningful output, generally associated with computers." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"Any process that converts data into information. The processing is usually assumed to be automated and running on an information system." (Eleutherios A Papathanassiou & Xenia J Mamakou, "Privacy Issues in Public Web Sites", Handbook of Research on Public Information Technology, 2008) 

"Obtaining, recording or holding the data, or carrying out any operation on the data, including organising, adapting or altering it. Retrieval, consultation or use of the data, disclosure of the data, and alignment, combination, blocking, erasure or destruction of the data are all legally classed as processing." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The operation performed on data through capture, transformation, and storage, in order to derive new information according to a given set of rules." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Collection and elaboration of sensing data with the aim to derivate/infer new knowledge from original raw data." (Paolo Bellavista et al, "Crowdsensing in Smart Cities: Technical Challenges, Open Issues, and Emerging Solution Guidelines", 2015)

"The act of data manipulation through integration of mathematical tools, statistics, and computer application to generate information." (Babangida Zubairu, "Security Risks of Biomedical Data Processing in Cloud Computing Environment", 2018)

"Any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction." (Yordanka Ivanova, "Data Controller, Processor, or Joint Controller: Towards Reaching GDPR Compliance in a Data- and Technology-Driven World", 2020)

"Data processing is any action performed to turn raw data into useful information." (Xplenty) [source]

"Data processing occurs when data is collected and translated into usable information. […] Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organization." (Talend) [source]

24 February 2018

SQL Reloaded: Misusing Views and Pseudo-Constants

   Views as virtual tables can be misused to replace tables in certain circumstances, either by storing values within one or multiple rows, like in the below examples:

-- parameters for a BI solution
CREATE VIEW dbo.vLoV_Parameters
AS
SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId
 , Cast(GetDate() as Date) AS CurrentDate 
 , Cast(100 as int) AS BatchCount 

GO

SELECT *
FROM dbo.vLoV_Parameters

GO

-- values for a dropdown 
 CREATE VIEW dbo.vLoV_DataAreas
 AS
 SELECT Cast('ABC' as nvarchar(20)) AS DataAreaId
 , Cast('Company ABC' as nvarchar(50)) AS Description 
 UNION ALL
 SELECT 'XYZ' DataAreaId 
 , 'Company XYZ'

GO

SELECT *
FROM dbo.vLoV_DataAreas

GO

    These solutions aren’t elegant, and typically not recommended because they go against one of the principles of good database design, namely “data belong in tables”, though they do the trick when needed. Personally, I used them only in a handful of cases, e.g. when it wasn’t allowed to create tables, when it was needed testing something for a short period of time, or when there was some overhead of creating a table for 2-3 values. Because of their scarce use, I haven’t given them too much thought, not until I discovered Jared Ko’s blog posting on pseudo-constants. He considers the values from the first view as pseudo-constants, and advocates for their use especially for easier dependency tracking, easier code refactoring, avoiding implicit data conversion and easier maintenance of values.


   All these are good reasons to consider them, therefore I tried to take further the idea to see if it survives a reality check. For this I took Dynamics AX as testing environment, as it makes extensive use of enumerations (aka base enums) to store list of values needed allover through the application. Behind each table there are one or more enumerations, the tables storing master data abounding of them.  For exemplification let’s consider InventTrans, table that stores the inventory transactions, the logic that governs the receipt and issued transactions are governed by three enumerations: StatusIssue, StatusReceipt and Direction.

-- Status Issue Enumeration 
 CREATE VIEW dbo.vLoV_StatusIssue
 AS
 SELECT cast(0 as int) AS None
 , cast(1 as int) AS Sold
 , cast(2 as int) AS Deducted
 , cast(3 as int) AS Picked
 , cast(4 as int) AS ReservPhysical
 , cast(5 as int) AS ReservOrdered
 , cast(6 as int) AS OnOrder
 , cast(7 as int) AS QuotationIssue

GO

-- Status Receipt Enumeration 
 CREATE VIEW dbo.vLoV_StatusReceipt
 AS
SELECT cast(0 as int) AS None
 , cast(1 as int) AS Purchased
 , cast(2 as int) AS Received
 , cast(3 as int) AS Registered
 , cast(4 as int) AS Arrived
 , cast(5 as int) AS Ordered
 , cast(6 as int) AS QuotationReceipt

GO

-- Inventory Direction Enumeration 
 CREATE VIEW dbo.vLoV_InventDirection
 AS
 SELECT cast(0 as int) AS None
 , cast(1 as int) AS Receipt
 , cast(2 as int) AS Issue

   To see these views at work let’s construct the InventTrans table on the fly:

-- creating an ad-hoc table  
 SELECT *
 INTO  dbo.InventTrans
 FROM (VALUES (1, 1, 0, 2, -1, 'A0001')
 , (2, 1, 0, 2, -10, 'A0002')
 , (3, 2, 0, 2, -6, 'A0001')
 , (4, 2, 0, 2, -3, 'A0002')
 , (5, 3, 0, 2, -2, 'A0001')
 , (6, 1, 0, 1, 1, 'A0001')
 , (7, 0, 1, 1, 50, 'A0001')
 , (8, 0, 2, 1, 100, 'A0002')
 , (9, 0, 3, 1, 30, 'A0003')
 , (10, 0, 3, 1, 20, 'A0004')
 , (11, 0, 1, 2, 10, 'A0001')
 ) A(TransId, StatusIssue, StatusReceipt, Direction, Qty, ItemId)


    Here are two sets of examples using literals vs. pseudo-constants:

--example issued with literals 
SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
 WHERE ITR.StatusIssue = 1 
   AND ITR.Direction = 2

GO
 --example issued with pseudo-constants
 SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
      JOIN dbo.vLoV_StatusIssue SI
        ON ITR.StatusIssue = SI.Sold
      JOIN dbo.vLoV_InventDirection ID
        ON ITR.Direction = ID.Issue

GO

--example receipt with literals 
 SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
 WHERE ITR.StatusReceipt= 1
   AND ITR.Direction = 1

GO

--example receipt with pseudo-constants
 SELECT top 100 ITR.*
 FROM dbo.InventTrans ITR
      JOIN dbo.vLoV_StatusReceipt SR
        ON ITR.StatusReceipt= SR.Purchased
      JOIN dbo.vLoV_InventDirection ID
        ON ITR.Direction = ID.Receipt

 
  As can be seen the queries using pseudo-constants make the code somehow readable, though the gain is only relative, each enumeration implying an additional join. In addition, when further business tables are added to the logic (e.g. items, purchases or sales orders)  it complicates the logic, making it more difficult to separate the essential from nonessential. Imagine a translation of the following query:

-- complex query 
  SELECT top 100 ITR.*
  FROM dbo.InventTrans ITR
              <several tables here>
  WHERE ((ITR.StatusReceipt<=3 AND ITR.Direction = 1)
    OR (ITR.StatusIssue<=3 AND ITR.Direction = 2))
    AND (<more constraints here>)


   The more difficult the constraints in the WHERE clause, the more improbable is a translation of the literals into pseudo-constraints. Considering that an average query contains 5-10 tables, each of them with 1-3 enumerations, the queries would become impracticable by using pseudo-constants and quite difficult to troubleshoot their execution plans.

    The more I’m thinking about, an enumeration data type as global variable in SQL Server (like the ones available in VB) would be more than welcome, especially because values are used over and over again through the queries. Imagine, for example, the possibility of writing code as follows:

-- hypothetical query
SELECT top 100 ITR.*
FROM dbo.InventTrans ITR
WHERE ITR.StatusReceipt = @@StatusReceipt .Purchased
  AND ITR.Direction = @@InventDirection.Receipt

   From my point of view this would make the code more readable and easier to maintain. Instead, in order to make the code more readable, one’s usually forced to add some comments in the code. This works as well, though the code can become full of comments.

-- query with commented literals
SELECT top 100 ITR.*
FROM dbo.InventTrans ITR
WHERE ITR.StatusReceipt <=3  Purchased, Received, Registered 
   AND ITR.Direction = 1-- Receipt

   In conclusion, pseudo-constants’ usefulness is only limited, and their usage is  against developers’ common sense, however a data type in SQL Server with similar functionality would make code more readable and easier to maintain.


PS: It is possible to simulate an enumeration data type in tables’ definition by using a CHECK constraint.

19 February 2018

Data Science: Data Exploration (Definitions)

Data exploration: "The process of examining data in order to determine ranges and patterns within the data." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

Data Exploration: "The part of the data science process where a scientist will ask basic questions that helps her understand the context of a data set. What you learn during the exploration phase will guide more in-depth analysis later. Further, it helps you recognize when a result might be surprising and warrant further investigation." (KDnuggets)

"Data exploration is the first step of data analysis used to explore and visualize data to uncover insights from the start or identify areas or patterns to dig into more." (Tibco) [source]

"Data exploration is the initial step in data analysis, where users explore a large data set in an unstructured way to uncover initial patterns, characteristics, and points of interest. This process isn’t meant to reveal every bit of information a dataset holds, but rather to help create a broad picture of important trends and major points to study in greater detail." (Sisense) [source]

"Data exploration is the process through which a data analyst investigates the characteristics of a dataset to better understand the data contained within and to define basic metadata before building a data model. Data exploration helps the analyst choose the most appropriate tool for data processing and analysis, and leverages the innate human ability to recognize patterns in data that may not be captured by analytics tools." (Qlik) [source]

"Data exploration provides a first glance analysis of available data sources. Rather than trying to deliver precise insights such as those that result from data analytics, data exploration focuses on identifying key trends and significant variables." (Xplenty) [source]

15 February 2018

Data Science: Data Preparation (Definitions)

Data preparation: "The process which involves checking or logging the data in; checking the data for accuracy; entering the data into the computer; transforming the data; and developing and documenting a database structure that integrates the various measures. This process includes preparation and assignment of appropriate metadata to describe the product in human readable code/format." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data Preparation describes a range of processing activities that take place in order to transform a source of data into a format, quality and structure suitable for further analysis or processing. It is often referred to as Data Pre-Processing due to the fact it is an activity that organises the data for a follow-on processing stage." (experian) [source]

"Data preparation [also] called data wrangling, it’s everything that is concerned with the process of getting your data in good shape for analysis. It’s a critical part of the machine learning process." (RapidMiner) [source]

"Data preparation is an iterative-agile process for exploring, combining, cleaning and transforming raw data into curated datasets for self-service data integration, data science, data discovery, and BI/analytics." (Gartner)

"Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data." (Talend) [source]

Data Science: Data Visualization (Definitions)

"Technique for presentation and analysis of data through visual objects, such as graphs, charts, images, and specialized tabular formats." (Paulraj Ponniah, "Data Warehousing Fundamentals", 2001)

"Technique for presentation and analysis of data through visual objects, such as graphs, charts, images, and specialized tabular formats." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010) 

"Visual representation of data, aiming to convey as much information as possible through visual processes." (Alfredo Vellido & Iván Olie, "Clustering and Visualization of Multivariate Time Series", 2010)

"Techniques for graphical representation of trends, patterns and other information." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Information abstracted in a schematic form to provide visual insights into sets of data. Data visualization enables us to go from the abstract numbers in a computer program (ones and zeros) to visual interpretation of data. Text visualization means converting textual information into graphic representation, so we can see information without having to read the data, as tables, histograms, pie or bar charts, or Cartesian coordinates." (Anna Ursyn, "Visualization as Communication with Graphic Representation", 2015)

"Presenting data and summary information using graphics, animation, and three-dimensional displays. Tools for visually displaying information and relationships often using dynamic and interactive graphics." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"Data Visualization is a way of representing the data collected in the form of figures and diagrams like tables, charts, graphs in order to make the data for common man more easily understandable." (Kirti R Bhatele, "Data Analysis on Global Stratification", 2020)

"Techniques for turning data into information by using the high capacity of the human brain to visually recognize patterns and trends. There are many specialized techniques designed to make particular kinds of visualization easy." (Information Management)

"The art of communicating meaningful data visually. This can involve infographics, traditional plots, or even full data dashboards." (KDnuggets)

"The practice of structuring and arranging data within a visual context to help users understand it. Patterns and trends that might be unrecognizable to the layman in text-based data can be easily viewed and digested by end users with the help of data visualization software." (Insight Software)

"Data visualization enables people to easily uncover actionable insights by presenting information and data in graphical, and often interactive graphs, charts, and maps." (Qlik) [source]

"Data visualization is the graphical representation of data to help people understand context and significance. Interactive data visualization enables companies to drill down to explore details, identify patterns and outliers, and change which data is processed and/or excluded." (Tibco) [source]

"Data visualization is the practice of translating information into a visual context, such as a map or graph, to make data easier for the human brain to understand and pull insights from." (Techtarget) [source]

"Data visualization is the process of graphically illustrating data sets to discover hidden patterns, trends, and relationships in order to develop key insights. Data visualization uses data points as a basis for the creation of graphs, charts, plots, and other images." (Talend) [source]

"Data visualization is the use of graphics to represent data. The purpose of these graphics is to quickly and concisely communicate the most important insights produced by data analytics." (Xplenty) [source]

12 February 2018

Data Science: Correlation (Definitions)

[correlation coefficient:] "A measure to determine how closely a scatterplot of two continuous variables falls on a straight line." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A metric that measures the linear relationship between two process variables. Correlation describes the X and Y relationship with a single number (the Pearson’s Correlation Coefficient (r)), whereas regression summarizes the relationship with a line - the regression line." (Lynne Hambleton, "Treasure Chest of Six Sigma Growth Methods, Tools, and Best Practices", 2007)

[correlation coefficient:] "A measure of the degree of correlation between the two variables. The range of values it takes is between −1 and +1. A negative value of r indicates an inverse relationship. A positive value of r indicates a direct relationship. A zero value of r indicates that the two variables are independent of each other. The closer r is to +1 and −1, the stronger the relationship between the two variables." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"The degree of relationship between business and economic variables such as cost and volume. Correlation analysis evaluates cause/effect relationships. It looks consistently at how the value of one variable changes when the value of the other is changed. A prediction can be made based on the relationship uncovered. An example is the effect of advertising on sales. A degree of correlation is measured statistically by the coefficient of determination (R-squared)." (Jae K Shim & Joel G Siegel, "Budgeting Basics and Beyond", 2008)

"A figure quantifying the correlation between risk events. This number is between negative one and positive one." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)

"A mechanism used to associate messages with the correct workflow service instance. Correlation is also used to associate multiple messaging activities with each other within a workflow." (Bruce Bukovics, "Pro WF: Windows Workflow in .NET 4", 2010)

"Correlation is sometimes used informally to mean a statistical association between two variables, or perhaps the strength of such an association. Technically, the correlation can be interpreted as the degree to which a linear relationship between the variables exists (i.e., each variable is a linear function of the other) as measured by the correlation coefficient." (Herbert I Weisberg, "Bias and Causation: Models and Judgment for Valid Comparisons", 2010)

"The degree of relationship between two variables; in risk management, specifically the degree of relationship between potential risks." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)

"A predictive relationship between two factors, such that when one factor changes, you can predict the nature, direction and/or amount of change in the other factor. Not necessarily a cause-and-effect relationship." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Organizing and recognizing one related event threat out of several reported, but previously distinct, events." (Mark Rhodes-Ousley, "Information Security: The Complete Reference" 2nd Ed., 2013)

"Association in the values of two or more variables." (Meta S Brown, "Data Mining For Dummies", 2014)

[correlation coefficient:] "A statistic that quantifies the degree of association between two or more variables. There are many kinds of correlation coefficients, depending on the type of data and relationship predicted." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"The degree of association between two or more variables." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical measure that indicates the extent to which two variables are related. A positive correlation indicates that, as one variable increases, the other increases as well. For a negative correlation, as one variable increases, the other decreases." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

11 February 2018

Data Science: K-nearest neighbors (Definitions)

"A modeling technique that assigns values to points based on the values of the k nearby points, such as average value, or most common value." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A simple and popular classifier algorithm that assigns a class (in a preexisting classification) to an object whose class is unknown. [...] From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to k (a number chosen by the user) objects of known class. The most common class (i.e., the class that is assigned most often to the nearest k objects) is assigned to the object of unknown class." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A method used for classification and regression. Cases are analyzed, and class membership is assigned based on similarity to other cases, where cases that are similar (or 'near' in characteristics) are known as neighbors." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A prediction method, which uses a function of the k most similar observations from the training set to generate a prediction, such as the mean." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"K-Nearest Neighbors classification is an instance-based supervised learning method that works well with distance-sensitive data." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"An algorithm that estimates an unknown data item as being like the majority of the k-closest neighbors to that item." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"K-nearest neighbourhood is a algorithm which stores all available cases and classifies new cases based on a similarity measure. It is used in statistical estimation and pattern recognition." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)

10 February 2018

Data Science: Data Mining (Definitions)

"The non-trivial extraction of implicit, previously unknown, and potentially useful information from data" (Frawley et al., "Knowledge discovery in databases: An overview", 1991)

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Data mining is the process of examining large amounts of aggregated data. The objective of data mining is to either predict what may happen based on trends or patterns in the data or to discover interesting correlations in the data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals", 2001)

"A class of undirected queries, often against the most atomic data, that seek to find unexpected patterns in the data. The most valuable results from data mining are clustering, classifying, estimating, predicting, and finding things that occur together. There are many kinds of tools that play a role in data mining. The principal tools include decision trees, neural networks, memory- and cased-based reasoning tools, visualization tools, genetic algorithms, fuzzy logic, and classical statistics. Generally, data mining is a client of the data warehouse." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"The discovery of information hidden within data." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"the process of extracting valid, authentic, and actionable information from large databases." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Advanced analysis or data mining is the analysis of detailed data to detect patterns, behaviors, and relationships in data that were previously only partially known or at times totally unknown." (Margaret Y Chu, "Blissful Data", 2004)

"Analysis of detail data to discover relationships, patterns, or associations between values." (Margaret Y Chu, "Blissful Data ", 2004)

"An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"the process of analyzing large amounts of data in search of previously undiscovered business patterns." (William H Inmon, "Building the Data Warehouse", 2005)

"A type of advanced analysis used to determine certain patterns within data. Data mining is most often associated with predictive analysis based on historical detail, and the generation of models for further analysis and query." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"Refers to the process of identifying nontrivial facts, patterns and relationships from large databases. The databases have often been put together for a different purpose from the data mining exercise." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Data mining is the process of discovering implicit patterns in data stored in data warehouse and using those patterns for business advantage such as predicting future trends." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Digging through data (usually in a data warehouse or data mart) to identify interesting patterns." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"Intelligently analyzing data to extract hidden trends, patterns, and information. Commonly used by statisticians, data analysts and Management Information Systems communities." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The process of extracting valid, authentic, and actionable information from large databases." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The process of retrieving relevant data to make intelligent decisions." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"A process that minimally has four stages: (1) data preparation that may involve 'data cleaning' and even 'data transformation', (2) initial exploration of the data, (3) model building or pattern identification, and (4) deployment, which means subjecting new data to the 'model' to predict outcomes of cases found in the new data." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Automatically searching large volumes of data for patterns or associations." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The use of machine learning algorithms to find faint patterns of relationship between data elements in large, noisy, and messy data sets, which can lead to actions to increase benefit in some form (diagnosis, profit, detection, etc.)." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010) 

"A way of extracting knowledge from a database by searching for correlations in the data and presenting promising hypotheses to the user for analysis and consideration." (Toby J Teorey, "Database Modeling and Design" 4th Ed., 2010)

"The process of using mathematical algorithms (usually implemented in computer software) to attempt to transform raw data into information that is not otherwise visible (for example, creating a query to forecast sales for the future based on sales from the past)." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A process that employs automated tools to analyze data in a data warehouse and other sources and to proactively identify possible relationships and anomalies." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"Process of analyzing data from different perspectives and summarizing it into useful information (e.g., information that can be used to increase revenue, cuts costs, or both)." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The process of sifting through large amounts of data using pattern recognition, fuzzy logic, and other knowledge discovery statistical techniques to identify previously unknown, unsuspected, and potentially meaningful data content relationships and trends." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data mining, a branch of computer science, is the process of extracting patterns from large data sets by combining statistical analysis and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage." (T T Wong & Loretta K W Sze, "A Neuro-Fuzzy Partner Selection System for Business Social Networks", 2012)

"Field of analytics with structured data. The model inference process minimally has four stages: data preparation, involving data cleaning, transformation and selection; initial exploration of the data; model building or pattern identification; and deployment, putting new data through the model to obtain their predicted outcomes." (Gary Miner et al, "Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications", 2012)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The process of exploring and analyzing large amounts of data to find patterns." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"An umbrella term for analytic techniques that facilitate fast pattern discovery and model building, particularly with large datasets." (Meta S Brown, "Data Mining For Dummies", 2014)

"Analysis of large quantities of data to find patterns such as groups of records, unusual records, and dependencies" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The practice of analyzing big data using mathematical models to develop insights, usually including machine learning algorithms as opposed to statistical methods."(Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Data mining is the analysis of data for relationships that have not previously been discovered." (Piyush K Shukla & Madhuvan Dixit, "Big Data: An Emerging Field of Data Engineering", Handbook of Research on Security Considerations in Cloud Computing, 2015)

"A methodology used by organizations to better understand their customers, products, markets, or any other phase of the business." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK" 4th Ed., 2015)

"Extracting information from a database to zero in on certain facts or summarize a large amount of data." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"It refers to the process of identifying and extracting patterns in large data sets based on artificial intelligence, machine learning, and statistical techniques." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"The process of exploring and analyzing large amounts of data to find patterns." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Term used to describe analyzing large amounts of data to find patterns, correlations, and similarities." (Brittany Bullard, "Style and Statistics", 2016)

"The process of extracting meaningful knowledge from large volumes of data contained in data warehouses." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A class of analytical applications that help users search for hidden patterns in a data set. Data mining is a process of analyzing large amounts of data to identify data–content relationships. Data mining is one tool used in decision support special studies. This process is also known as data surfing or knowledge discovery." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"The process of collecting, searching through, and analyzing a large amount of data in a database to discover patterns or relationships." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Data mining involves finding meaningful patterns and deriving insights from large data sets. It is closely related to analytics. Data mining uses statistics, machine learning, and artificial intelligence techniques to derive meaningful patterns." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"The analysis of the data held in data warehouses in order to produce new and useful information." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

"The process of collecting critical business information from a data source, correlating the information, and uncovering associations, patterns, and trends." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems." (Dmitry Korzun et al, "Semantic Methods for Data Mining in Smart Spaces", 2019)

"A technique using software tools geared for the user who typically does not know exactly what he's searching for, but is looking for particular patterns or trends. Data mining is the process of sifting through large amounts of data to produce data content relationships. It can predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. This is also known as data surfing." (Information Management)

"An analytical process that attempts to find correlations or patterns in large data sets for the purpose of data or knowledge discovery." (NIST SP 800-53)

"Extracting previously unknown information from databases and using that data for important business decisions, in many cases helping to create new insights." (Solutions Review)

"is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"the process of analyzing large batches of data to find patterns and instances of statistical significance. By utilizing software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective strategies for acquisition, as well as increase sales and decrease overall costs." (Insight Software)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft)

"The process of pulling actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating those connections in a way that helps decision-makers improve their product or organization." (KDnuggets)

"Data mining is the process of analyzing hidden patterns of data according to different perspectives for categorization into useful information, which is collected and assembled in common areas, such as data warehouses, for efficient analysis, data mining algorithms, facilitating business decision making and other information requirements to ultimately cut costs and increase revenue. Data mining is also known as data discovery and knowledge discovery." (Techopedia)

"Data mining is an automated analytical method that lets companies extract usable information from massive sets of raw data. Data mining combines several branches of computer science and analytics, relying on intelligent methods to uncover patterns and insights in large sets of information." (Sisense) [source]

"Data mining is the process of analyzing data from different sources and summarizing it into relevant information that can be used to help increase revenue and decrease costs. Its primary purpose is to find correlations or patterns among dozens of fields in large databases." (Logi Analytics) [source]

"Data mining is the process of analyzing massive volumes of data to discover business intelligence that helps companies solve problems, mitigate risks, and seize new opportunities." (Talend) [source]

"Data Mining is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"Data mining is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques." (Gartner)

"Data mining is the process of extracting relevant patterns, deviations and relationships within large data sets to predict outcomes and glean insights. Through it, companies convert big data into actionable information, relying upon statistical analysis, machine learning and computer science." (snowflake) [source]

"Data mining is the work of analyzing business information in order to discover patterns and create predictive models that can validate new business insights. […] Unlike data analytics, in which discovery goals are often not known or well defined at the outset, data mining efforts are usually driven by a specific absence of information that can’t be satisfied through standard data queries or reports. Data mining yields information from which predictive models can be derived and then tested, leading to a greater understanding of the marketplace." (Informatica) [source]

07 February 2018

Data Science: Hadoop (Definitions)

"An Apache-managed software framework derived from MapReduce and Bigtable. Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency. Two major components of Hadoop exist: a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"An open-source software platform developed by Apache Software Foundation for data-intensive applications where the data are often widely distributed across different hardware systems and geographical locations." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Technology designed to house Big Data; a framework for managing data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"an Apache-managed software framework derived from MapReduce. Big Table Hadoop enables applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed up computations and hide latency. The two major components of Hadoop are a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"An open-source framework that is built to process and store huge amounts of data across a distributed file system." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A batch processing infrastructure that stores fi les and distributes work across a group of servers. The infrastructure is composed of HDFS and MapReduce components. Hadoop is an open source software platform designed to store and process quantities of data that are too large for just one particular device or server. Hadoop’s strength lies in its ability to scale across thousands of commodity servers that don’t share memory or disk space." (Benoy Antony et al, "Professional Hadoop®", 2016)

"Apache Hadoop is an open-source framework for processing large volume of data in a clustered environment. It uses simple MapReduce programming model for reliable, scalable and distributed computing. The storage and computation both are distributed in this framework." (Kaushik Pal, 2016)

"A framework that allow for the distributed processing for large datasets." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

 "Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A batch processing infrastructure that stores files and distributes work across a group of servers." (Oracle)

"an open-source framework that is built to enable the process and storage of big data across a distributed file system." (Analytics Insight)

"Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process both structured and unstructured data, and scale up reliably from a single server to thousands of machines." (Databricks) [source]

"Hadoop is an open source software framework for storing and processing large volumes of distributed data. It provides a set of instructions that organizes and processes data on many servers rather than from a centralized management nexus." (Informatica) [source]

Data Science: Semantics (Definitions)

 "The meaning of a model that is well-formed according to the syntax of a language." (Anneke Kleppe et al, "MDA Explained: The Model Driven Architecture: Practice and Promise", 2003)

"The part of language concerned with meaning. For example, the phrases 'my mother’s brother' and 'my uncle' are two ways of saying the same thing and, therefore, have the same semantic value." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The study of meaning (often the meaning of words). In business systems we are concerned with making the meaning of data explicit (structuring unstructured data), as well as making it explicit enough that an agent could reason about it." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The branch of philosophy concerned with describing meaning." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Having to do with meaning, usually of words and/or symbols (the syntax). Part of semiotic theory." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The study of the meaning behind the syntax (signs and symbols) of a language or graphical expression of something. The semantics can only be understood through the syntax. The syntax is like the encoded representation of the semantics." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The study of meaning. In the context of Big Data, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (i.e., making triples), combining assertions about data objects (i.e., merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but I would suggest that most definitions for semantics would be functionally equivalent to the definition offered here." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"Set of mappings forming a representation in order to define the meaningful information of the data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"Semantics is a branch of linguistics focused on the meaning communicated by language." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

Related Posts Plugin for WordPress, Blogger...