SQL Troubles: data discovery

Showing posts with label data discovery. Show all posts

04 March 2024

🧭Business Intelligence: A Software Engineer's Perspective (Part VI: The Data Citizen)

Business Intelligence Series

More than a century ago, Jerbert G Wells wrote on mathematical literacy: "[...] the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex world-wide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write" [1]. The quote is occasionally misquoted as referring to Statistics, though frankly the boundaries of mathematical, statistical, numerical and data literacy tend to melt into each other, existing multiple dependencies between them.

In the age of big data, data citizens, business people able to use data, data processing and visualization tools for building solutions that enable their job, become steadily a necessity for businesses in their quest of making data-driven decisions, gaining insight and whatever valuable use data might have for the organizations. The need is not new, Microsoft Access and Excel were used for similar purposes already in the 90s, becoming a maintenance nightmare for IT, data islands without proper backup or documentation existing through the organizations, diverse numbers being reported and contradicting each other.

Then IT took over, trying to find alternatives for the data islands, implementing concepts like single source(s) of truth, quality gates and supporting processes, designing data models and infrastructures for self-service, allowing users to tap into the data for data exploration, discovery, reporting, etc. Getting all this right required to redesign existing infrastructures, making one step forward and a few steps back, in the end everything is a learning process. Such an effort can easily consume an organization's resources.

Microsoft and other vendors for data-driven solutions keep insisting on how much potential exist in their tools for the data citizen, how the citizens can bring competitive advantage for organizations, automating business and supporting processes. The potential is not to neglect, though it requires a considerable investment from organizations in training and mentoring data citizens, in building data warehouses or data meshes that focus on end-user self-service needs. The data citizen needs time to learn, to play with the data, build solutions, test their usefulness in the daily tasks, respectively incorporate and disseminate the knowledge gained within the organization.

There are many scenarios in which results can be obtained with a minimum of effort, however there are also hard limits. Besides the learning effort and the time available, there are cognitive, knowledge and ability limits that vary from person to person. Understanding what good architecture, design and techniques means is unfortunately not for everybody, and here's where the concept of citizen data analyst or citizen scientist breaks, and this independently of the tools used. There are also IT people who have similar challenges.

It must be also recognized that the solutions built in the early stages by data citizens are primarily personal solutions that need to be reviewed and brought to the standards adopted by the organization. In time, it's expected to reduce considerably such effort by evolving data citizen's knowledge and skillset. Without this further work, the solutions built will tend to display some of the shortcomings of the solutions built on MS Access or Excel.

The concept of data citizen can work as long the various assumptions and needs are adequately addressed, however progress will not happen overnight. The effort needs to become part of organization's long-term strategy, and the effort can be considerable for many organizations. Mentorship in terms of technical and non-technical support is needed. It's advisable to proceed in small iterative steps and integrate gradually the lessons learned.

Previous Post <<||>> Next Post

Resources:

[1] "Mankind in the Making", by Herbert G Wells, 1903 [Source]

13 June 2020

🧭☯Business Intelligence: Self-Service BI (The Good, the Bad and the Ugly)

Self-Service BI (SSBI) is a form of Business Intelligence (BI) in which the users are enabled and empowered to explore and analyze the data, respectively build reports and visualizations on their own, with minimal IT support.

The Good: Modern SSBI tools like PowerBI, Tableau or Qlik Sense provide easy to use and rich functionality for data preparation, exploration, discovery, integration, modelling, visualization, and analysis. Moreover, they integrated the advances made in graphics, data storage and processing (e.g. in-memory processing, parallel processing), which allow addressing most of data requirements. With just a few drag-and-drops users can display details, aggregate data, identify trends and correlations between data. Slice-and-dice or passthrough features allow navigating the data across dimensions and different levels of details. In addition, the tools can leverage the existing data models available in data warehouses, data marts and other types of data repositories, including the rich set of open data available on the web.

With the right infrastructure, knowledge and skills users can better understand and harness the business data, using them to address business questions, they can make faster and smarter decisions rooted in data. SSBI offers the potential of increasing the value data have for the organization, while improving the time to value for data products (data models, reports, visualizations).

The Bad: In the 90s products like MS Excel or Access allowed users to build personal solutions to address gaps existing in processes and reporting. Upon case, the personal solutions gained in importance, starting to be used by more users to the degree that they become essential for the business. Thus, these islands of data and knowledge started to become a nightmare for the IT department, as they were supposed to be kept alike and backed-up. In addition, issues like security of data, inefficient data processing, duplication of data and effort, different versions of truth, urged the business to consolidate such solutions in standardized solutions.

Without an adequate strategy and a certain control over the outcomes of the SSBI initiatives, organization risk of reaching to the same deplorable state, with SSBI initiatives having the potential to bring more damage than the issues they can solve. Insufficient data quality and integration, unrealistic expectations, the communication problems between business and IT, as well insufficient training and support have the potential of making SSBI’s adoption more difficult.

The investment in adequate SSBI tool(s) might be small compared with the further changes that need to be done within the technical and logistical BI infrastructure. In addition, even if the role of IT is minimized, it doesn’t mean that IT needs to be left out of the picture. IT is still the owner of the IT infrastructure, it still needs to oversight the self-service processes and the flow of data, information and knowledge within the organization. From infrastructure to skillset, there are aspects of the SSBI that need to be addressed accordingly. The BI professional can’t be replaced entirely, though the scope of his work may shift to address new types of challenges.

Not understanding that SSBI initiatives are iterative, explorative in nature and require time to bring value, can put unnecessary pressure on those being part of it. Renouncing to SSBI initiatives without attempting to address the issues and stir them in the right direction hinder an organization and its employees’ potential to grow, with all the implication deriving from it.

The Ugly: Despite the benefits SSBI can bring, its adoption within organizations remains low. Whether it’s business’ credibility in own forces, or the inherent technical or logistical challenges, SSBI follows the BI trend of being a promise that seldom reaches its potential.

Previous Post <<||>> Next Post

24 May 2020

🧮💫ERP: Proof-of-Concept (Part I: Migrating AdventureWorks to Dynamics 365 - Products)

ERP Implementations Series

Below is exemplified the migration of Products from AdventureWorks database to Dynamics 365 (D365), where a minimum of steps were considered. Variations (e.g. enrichment of data, successive migrations) and other Entities (Product Variants, Released Products, Released Product Variants, etc.) will be considered in future posts.

As the AdventureWorks database is available only for testing and exemplification purposes, there is no need for a data import layer, the data being prepared into the “Map” schema created for this purpose. In theory the same approach can be used in production systems as well, though usually it’s better to detach the migration layer from the source system(s) from performance or security reasons.

-- creating a schema into the AdventureWorks database 
CREATE SCHEMA [Map]

Step 1: Data Discovery

Within this step one attempts getting a sorrow understanding of the systems involved within the data migration, in this case AdventureWorks and D365. As basis for this will be analyzed the tables for each entity, respectively the relations existing between them, the values, the distribution as well the relations existing between attributes. Is needed to analyze the similarities as well differences between the involved data models at structural as well at value level.

In AdventureWorks the SKU (Stock Keeping Unit) has a Color, Size and Style as Dimensions, a Product being created for each SKU. In D365 one differentiates between Products and Dimensions associated with it, having thus two levels defined. In addition, in D365 a Product has also the Configuration as dimension:

In addition, the Products and their Dimensions are defined at master level with a minimal of attributes like the Dimension Group. After that the Products and their Dimensions can be released for each Business Unit (aka Data Area), where the detailed attributes for Purchasing, Sales, Production or Inventory are maintained. For those acquainted with Dynamics AX 2009 it’s the same structure.
Once the structural differences identified, one can start looking at the values that define a product in both systems. The following queries are based on the source system.

-- reviewing the sizes 
SELECT Size 
, count(*) NoSizes
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
ORDER BY Size

-- reviewing the colors 
SELECT Color
, count(*) NoColors
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Color
ORDER BY Color

-- reviewing the styles  
SELECT Style
, count(*) NoStyles
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Style
ORDER BY Style

If the above queries show what values are used, the following shows the dependencies between them:

-- reviewing the sizes, colors, styles  
SELECT Size
, Color
, Style
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY Size
, Color
, Style
ORDER BY 1,2,3

-- reviewing the dependencies between sizes, colors, styles  
SELECT CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END HasSize
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END HasColor
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END HasStyle
, count(*) NoValues
FROM [Production].[vProductDetails]
WHERE CultureId = 'en'
GROUP BY CASE WHEN IsNull(Size, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Color, '') != '' THEN 'x' ELSE '' END 
, CASE WHEN IsNull(Style, '') != '' THEN 'x' ELSE '' END
ORDER BY 1,2,3

The last query is probably the most important, as it shows how the products need to be configured into the target system:

As can be seen a product can have only Color, Color and Style, Size and Color, respectively no dimensions or all dimensions. It will be needed to define a Dimension Group for each of these cases (e.g. Col, ColSty, SizCol, SizColSty, None). (More information on this in a future post.)

Unfortunately, unless the target system is already in use, there are no values usually, though one can attempt entering a few representative values manually over the user interface, at least to see what tables get populated.

Step 2: Data Mapping

Once the main attributes from source and target were identified, one can create the mapping at attribute level between them. Typically, one includes all the relevant information for a migration, from table, attribute, description to attributes’ definition (e.g. type, length, precision, mandatory) in all the systems:

The mapping was kept to a minimum to display only the most relevant information. Except a warning concerning the length of an attribute, respectively a new attribute (the old item number), the mapping doesn’t involve any challenges.

A data dictionary or even a metadata repository for the involved systems can help in the process, otherwise one needs to access the information from the available documentation or system’s metadata and prepare the data manually.

The relevant metadata for D365 can be obtained from the Microsoft documentation. The data can be loaded into system via the EcoResProductV2Entity (see also data entities or the AX 2012 documentation for tables and enumeration data types).

Step 3: Building the source entity

AdventureWorks already provides a view which models the Products entity, though because of its structure it needs to suffer some changes, or sometimes more advisable, do the changes in a separate view as follows:

-- Products source entity 
CREATE VIEW Map.vProductDetails 
AS 
SELECT CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ItemIdOld 
, CASE WHEN Size<>'' THEN dbo.CutLeft(Name, '-',1) ELSE Name End Name 
, row_number() OVER(PARTITION BY CASE WHEN Size<>'' THEN dbo.CutLeft(ProductNumber, '-',1) ELSE ProductNumber End ORDER BY ProductNumber) Ranking
, ProductNumber 
, Description 
, Color
, Size
, Style
, CultureId 
, Subcategory 
, Category
, MakeFlag
, FinishedGoodsFlag
, SellStartDate 
, SellEndDate 
, StandardCost 
, ListPrice 
, SafetyStockLevel 
, ReorderPoint 
FROM [Production].[vProductDetails]

-- reviewing the data 
SELECT *
FROM Map.vProductDetails 
WHERE CultureId = 'en'
ORDER BY ProductNumber

To prepare the data for the migration the Product Number as well the Name were stripped from the Size, this being done with the help of dbo.CutLeft function. The row_number ranking window function was used to allow later selecting the first size for a given Product.

The discovery process continues, this time in respect to the target. Its useful to understand for example whether a Product has more than one Color or Style, whether the prices vary between Sizes, whether attributes like the Subcategory are consistent between Sizes, etc. It’s useful to prove anything that could have impact on the migration logic. The list of tests will be extended while building the logic, as new information are discovered.

-- checking dimensions' definition
SELECT ItemidOld 
, count(Size) NoSizes
, count(DISTINCT Color) NoColors
, count(DISTINCT Style) NoStyles
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  --AND Ranking = 1
GROUP BY ItemidOld
ORDER BY ItemidOld

-- checking the price variances between dimensions 
SELECT ItemidOld 
, Min(StandardCost) MinStandardCost
, Max(StandardCost) MaxStandardCost
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(StandardCost, 0)) != Max(IsNull(StandardCost, 0)) 
ORDER BY ItemidOld

-- checking attribute's consistency between dimensions 
SELECT ItemidOld 
, Min(Subcategory) MinSubcategory
, Max(Subcategory) MaxSubcategory
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1
GROUP BY ItemidOld
HAVING Min(IsNull(Subcategory, '')) != Max(IsNull(Subcategory, '')) 
ORDER BY ItemidOld

When the view starts performing poorly, for example because of the number of joins or data’s volume, it might me useful to dump the data in a table and perform the tests on it.
Even if it’s not maybe the case, it’s useful to apply defensive techniques in the logic by handing adequately the nulls.

Step 4: Implementing the Mapping

The attributes which need to be considered here are based on the target entities. It might be needed to include also attributes that are further needed to build the logic.

-- Product Mapping 
CREATE VIEW [Map].vEcoResProductV2Entity
AS
SELECT ProductId 
, ItemidOld ItemId 
, 'Item' ProductType 
, 'ProductMaster' ProductSubtype 
, Left(Replace(Name, ' ', ''), 20) ProductSearchName 
, ItemidOld ProductNumber 
, Name ProductName 
, Description ProductDescription 
, CASE 
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !='' AND IsNull(Style, '')!='' THEN 'SizColSty'
 WHEN IsNull(Size, '') !='' AND IsNull(Color, '') !=''  THEN 'SizCol'

        WHEN IsNull(Style, '') !='' AND IsNull(Color, '') !=''  THEN 'ColSty'
 WHEN IsNull(Color, '') !='' THEN 'Col'
 WHEN IsNull(Style, '')!='' THEN 'Sty'

        WHEN IsNull(Size, '')!='' THEN 'Siz'
 ELSE 'None'
  END ProductDimensionGroupName 
, 'WHS' StorageDimensionGroupName 
, CASE WHEN MakeFlag = 1 THEN 'SN' ELSE '' END TrackingDimensionGroupName 
, 'PredefinedVariants' VariantConfigurationTechnology 
, Subcategory ProductCategory 
, ItemidOld ItemIdOld 
, 1 IsNewItem 
FROM Map.vProductDetails 
WHERE CultureId = 'en'
  AND Ranking = 1

The Dimenstion Group is based on the above observation. It was supposed that all Products have inventory (see Storage Dimension), while the manufactured products will get a Serial Number (see Tracking Dimension). IsNewItem will be used further to migrate deltas (and thus to partition migrations).

Step 5: Building the Target Entity

The target entity is in the end only a table in which usually are kept only the attributes in scope. In this case the definition is given by the following DDL:

CREATE TABLE Map.EcoResProductV2Entity(
 Id int IDENTITY(1,1) NOT NULL,
 ProductId int NULL,
 ProductType nvarchar(20) NULL,
 ProductSubtype nvarchar(20) NULL,
 ProductsearchName nvarchar(255) NULL,
 ProductNumber nvarchar(20) NOT NULL,
 ProductName nvarchar(60) NULL,
 ProductDescription nvarchar(1000) NULL,
 ProductDimensionGroupName nvarchar(50) NULL,
 StorageDimensionGroupName nvarchar(50) NULL,
 TrackingDimensionGroupName nvarchar(50) NULL,
 VariantConfigurationTechnology nvarchar(50) NULL,
 ProductCategory nvarchar(255) NULL,
 ItemIdOld nvarchar(20) NULL,
 IsNewItem bit,
 CONSTRAINT I_EcoResProductV2Entity PRIMARY KEY CLUSTERED 
(
 Id ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

The table was build to match the name and definition from the target systems. The definition is followed by a few inserts based on the logic defined in the previous step:

-- preparing the data for EcoResProductV2Entity 
INSERT INTO [Map].EcoResProductV2Entity 
SELECT ITM.ProductId
, ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory
, ITM.ItemIdOld 
, ITM.IsNewItem
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld 

-- reviewing the data 
SELECT *
FROM [Map].EcoResProductV2Entity 
ORDER BY ItemId

The business might decide to take over the Product Number into the target system as unique identifier, though it’s not always the case. It might opt to create a new sequence number, which could start e.g. with 10000000 (8 characters). In such a case is changed only the logic for the Product Number, the value being generated using a ranking window function:

-- preparing the data for EcoResProductV2Entity  
DECLARE @StartItemId as int = 10000000
--INSERT INTO [Map].EcoResProductV2Entity 
SELECT ...
, @StartItemId + Rank() OVER(ORDER BY ITM.ProductType, ITM.ItemIdOld) ProductNumber
, ...
FROM [Map].VEcoResProductV2Entity ITM
WHERE ITM.IsnewItem = 1
ORDER BY ProductType 
, ITM.ItemIdOld

Step 5: Reviewing the Data

Before exporting the data it makes sense to review the data from various perspectives: how many Products of a certain type will be created, whether the current and old product numbers are unique, etc. The scripts make sure that the consistency of the data in respect to the future systems was achieved.

-- checking values' frequency (overview, no implications)
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, count(*) NoRecords
FROM [Map].EcoResProductV2Entity ITM
WHERE IsNewItem = 1
GROUP BY ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology

-- check ProductNumber's uniqueness (no duplicates allowed)
SELECT ProductNumber
, Min(ItemidOld) 
, max(ItemIdold)
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ProductNumber
HAVING count(*)>1

-- check old Product's uniqueness (no duplicates allowed)
SELECT ItemIdOld
, count(*)
FROM [Map].EcoResProductV2Entity
GROUP BY ItemIdOld
HAVING count(*)>1

This section will grow during the implementation, as further entities will be added.

Step 6: Exporting the data

The export query is usually reflecting the entity and can include further data’s formatting, when needed:

-- Export Products
SELECT ITM.ProductType
, ITM.ProductSubtype
, ITM.ProductsearchName
, ITM.ProductNumber
, ITM.ProductName
, ITM.ProductDescription
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.VariantConfigurationTechnology
, ITM.ProductCategory RetailProductCategoryName 
, ITM.ItemIdOld
FROM [Map].EcoResProductV2Entity ITM
WHERE ITM.isNewItem = 1
ORDER BY ProductNumber

Depending on the import needs, the data can be exported to Excel or a delimited text file (e.g. “|” pipe is an ideal delimiter.

Step 7: Validating the Data before Import

Before importing the data into the target system, it makes sense to have the data checked by the business or consultants. A visual check at this stage can help save time later.

Step 8: Validating the Data after Import

Unfortunately Microsoft doesn’t allow direct access to the D365 Production database, however one can still access various tables and entities’ content via the table browser. Anyway, the validation usually takes place into the UAT (User Acceptance Testing) system. So, if everything went well into the UAT and all measures were taken to have the same parameters across all systems, there should be no surprises during Go-Live.

09 February 2017

⛏️Data Management: Data Discovery (Definitions)

"Data discovery is the process where an organization obtains an understanding of which data matters the most and identifies challenges with that data. The outcome of data discovery is that the scope of a data quality initiative should be clear and data quality rules can be defined." (Robert Hawker, "Practical Data Quality", 2023)

"The process of analyzing the type, quality, accessibility, and location of data in all available data repositories. It's critical for determining the current state of a data environment, especially when a recent and accurate data dictionary doesn't exist." (Forrester)

"Data Discovery describes a range of techniques designed to collect and consolidate information before an alysing it to find relationships and outliers between entities (or data items) that may exist. This process may be done on data from the same database or across multiple, disparate databases. (experian) [source]

"Data discovery involves the collection and evaluation of data from various sources and is often used to understand trends and patterns in the data." (Tibco) [source]

Data discovery is not a tool. It is a business user oriented process for detecting patterns and outliers by visually navigating data or applying guided advanced analytics. Discovery is an iterative process that does not require extensive upfront model creation. (BI Survey) [source]

"Data discovery is the process of using a range of technologies that allow users to quickly clean, combine, and analyze complex data sets and get the information they need to make smarter decisions and impactful discoveries." (Qlik) [source]

15 April 2015

📊Business Intelligence: Text Analytics (Definitions)

"A technique whereby software employs linguistics and pattern detection techniques to impute some larger meaning to the words in a document. Entity extraction and document categorization are two emerging types of text analytics." (Mike Moran & Bill Hunt , "Search Engine Marketing, Inc", 2005)

"Transforms unstructured text into structured 'text data' that can then be searched, mined, or discovered." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"The process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can be leveraged in various ways." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Refers generally to the process of deriving patterns and trends from unstructured content such as notes, reports, and comments." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"The practice of analyzing unstructured data." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Text analytics a variety of computer-based techniques designed to deriving information from text sources." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can be leveraged in various ways." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"The process of deriving insights from large volumes of text, typically through the use of specialized software to identify patterns, trends, and sentiment. " (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

[AI-based text analytics:] "Machine-learning and rules-based analytics technology that mines semistructured and unstructured text data sources and extracts structured information (such as keywords, concepts, entities, topics, sentiment, emotion, and intent) to analyze the findings for correlations, trends, outliers, patterns, and anomalies." (Forrester)

"A subset of natural language processing (NLP) technologies that identifies structures and patterns in text and transforms them into actionable insights to drive better business outcomes." (Forrester)

"Text analytics is the process of deriving information from text sources. It is used for several purposes, such as: summarization (trying to find the key content across a larger body of information or a single document), sentiment analysis (what is the nature of commentary on an issue), explicative (what is driving that commentary), investigative (what are the particular cases of a specific issue) and classification (what subject or what key content pieces does the text talk about)." (Gartner)

SQL Troubles

Pages