SQL Troubles: tools

Showing posts with label tools. Show all posts

15 July 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 53: The Future of Business Intelligence - Will AI Make It Obsolete?)

Prompt Engineering Series

Prompt: "write a blogpost of 600 words on how AI may start making business intelligence obsolete"

Introduction

Business intelligence (BI) has long been the backbone of data-driven decision-making, helping organizations analyze trends, optimize operations, and gain competitive advantages. However, as artificial intelligence (AI) continues to evolve, many wonder whether traditional BI tools and methodologies will become obsolete. AI’s ability to process vast amounts of data, generate insights autonomously, and adapt in real time is reshaping the landscape of business analytics. But does this mean BI will disappear entirely, or will it simply evolve?

The Shift from Traditional BI to AI-Driven Analytics

Traditional BI relies on structured data, dashboards, and human interpretation to extract meaningful insights. Analysts and business leaders use BI tools to generate reports, visualize trends, and make informed decisions. However, AI is introducing a new paradigm - one where data analysis is automated, predictive, and adaptive.

AI-driven analytics can:

Process unstructured data from sources like social media, emails, and customer interactions.
Identify patterns and correlations that human analysts might overlook.
Provide real-time insights without requiring manual report generation.
Predict future trends using machine learning models.

These capabilities suggest that AI is not just enhancing BI - it is fundamentally transforming it.

Why AI Might Replace Traditional BI Tools

Several factors indicate that AI could make traditional BI tools obsolete:

Automation of Data Analysis: AI eliminates the need for manual data processing, allowing businesses to generate insights instantly. Traditional BI tools require human intervention to clean, structure, and interpret data, whereas AI can automate these processes.
Predictive and Prescriptive Analytics: While BI focuses on historical data, AI-driven analytics predict future trends and prescribe actions. Businesses can move beyond reactive decision-making and adopt proactive strategies based on AI-generated forecasts.
Natural Language Processing (NLP) for Data Queries: AI-powered systems enable users to ask questions in natural language rather than navigating complex dashboards. This makes data analysis more accessible to non-technical users, reducing reliance on BI specialists.
Continuous Learning and Adaptation: AI models improve over time, refining their predictions and insights based on new data. Traditional BI tools require manual updates and adjustments, whereas AI evolves autonomously.

Challenges and Limitations of AI in Business Intelligence

Despite AI’s advancements, there are reasons why BI may not become entirely obsolete:

Data Governance and Compliance: AI-driven analytics must adhere to strict regulations regarding data privacy and security. Businesses need human oversight to ensure compliance with laws such as GDPR.
Interpretability and Trust: AI-generated insights can sometimes be opaque, making it difficult for business leaders to trust automated recommendations. Traditional BI tools provide transparency in data analysis.
Human Expertise in Decision-Making: AI can generate insights, but human intuition and strategic thinking remain essential for complex business decisions. AI should complement, not replace, human expertise.

The Future: AI-Augmented Business Intelligence

Rather than making BI obsolete, AI is likely to augment and enhance business intelligence. The future of BI will involve AI-powered automation, predictive analytics, and real-time decision-making, but human oversight will remain crucial.

Organizations that embrace AI-driven BI will gain a competitive edge, leveraging automation while maintaining strategic control. The key is to integrate AI as a collaborative tool rather than a complete replacement for traditional BI methodologies.

Conclusion

AI is revolutionizing business intelligence, but it is unlikely to make it entirely obsolete. Instead, BI will evolve into a more automated, predictive, and adaptive system powered by AI. Businesses that integrate AI-driven analytics will benefit from faster insights, improved decision-making, and enhanced efficiency.

The future of AI is not about replacement - it’s about transformation. AI will redefine how businesses analyze data, but human expertise will remain essential in shaping strategic decisions.

Disclaimer: The whole text was generated by Copilot (under Windows 10) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part VIII: Translating SQL to KQL - Full Joins)

One of the great features of KQL is the possibility of translating SQL code to KQL via the "explain" keyword, allowing thus to port SQL code to KQL, respectively help translate knowledge from one programming language to another.

Let's start with a basic example:

// transform SQL to KQL code (to be run only the first part from --)
--
explain
SELECT top(10) CustomerKey, FirstName, LastName, CityName, CompanyName 
FROM Customers 
ORDER BY CityName DESC

// output: translated KQL code 
Customers
| project CustomerKey, FirstName, LastName, CityName, CompanyName
| sort by CityName desc nulls first
| take int(10)

The most interesting part of the translation is how "explain" translate joins from SQL to KQL. Let's start with a FULL JOIN from the set of patterns considered in a previous post on SQL joins:

--
explain
SELECT CST.CustomerKey
, CST.FirstName + ' ' + CST.LastName CustomerName
, Cast(SAL.DateKey as Date) DateKey
, SAL.TotalCost
FROM NewSales SAL
    JOIN Customers CST
      ON SAL.CustomerKey = CST.CustomerKey 
WHERE SAL.DateKey > '20240101' AND SAL.DateKey < '20240201'
ORDER BY CustomerName, DateKey, TotalCost DESC

And, here's the translation:

// translated code
NewSales
| project-rename ['SAL.DateKey']=DateKey
| join kind=inner (Customers
| project-rename ['CST.CustomerKey']=CustomerKey
    , ['CST.CityName']=CityName
    , ['CST.CompanyName']=CompanyName
    , ['CST.ContinentName']=ContinentName
    , ['CST.Education']=Education
    , ['CST.FirstName']=FirstName
    , ['CST.Gender']=Gender
    , ['CST.LastName']=LastName
    , ['CST.MaritalStatus']=MaritalStatus
    , ['CST.Occupation']=Occupation
    , ['CST.RegionCountryName']=RegionCountryName
    , ['CST.StateProvinceName']=StateProvinceName) 
    on ($left.CustomerKey == $right.['CST.CustomerKey'])
| where ((['SAL.DateKey'] > todatetime("20240101")) 
    and (['SAL.DateKey'] < todatetime("20240201")))
| project ['CST.CustomerKey']

    , CustomerName=__sql_add(__sql_add(['CST.FirstName']
    , " "), ['CST.LastName'])
    , DateKey=['SAL.DateKey']
    , TotalCost
| sort by CustomerName asc nulls first
    , DateKey asc nulls first
    , TotalCost desc nulls first
| project-rename CustomerKey=['CST.CustomerKey']

The code was slightly formatted to facilitated its reading. Unfortunately, the tool doesn't work well with table aliases, introduces also all the fields available from the dimension table, which can become a nightmare for the big dimension tables, the concatenation seems strange, and if one looks deeper, further issues can be identified. So, the challenge is how to write a query in SQL so it can minimize the further changed in QKL.

Probably, one approach is to write the backbone of the query in SQL and add the further logic after translation.

--
explain
SELECT NewSales.CustomerKey
, NewSales.DateKey 
, NewSales.TotalCost
FROM NewSales 
    INNER JOIN Customers 
      ON NewSales.CustomerKey = Customers.CustomerKey 
WHERE DateKey > '20240101' AND DateKey < '20240201'
ORDER BY NewSales.CustomerKey
, NewSales.DateKey

And the translation looks simpler:

// transformed query
NewSales
| join kind=inner 
(Customers
| project-rename ['Customers.CustomerKey']=CustomerKey
    , ['Customers.CityName']=CityName
    , ['Customers.CompanyName']=CompanyName
    , ['Customers.ContinentName']=ContinentName
    , ['Customers.Education']=Education
    , ['Customers.FirstName']=FirstName
    , ['Customers.Gender']=Gender
    , ['Customers.LastName']=LastName
    , ['Customers.MaritalStatus']=MaritalStatus
    , ['Customers.Occupation']=Occupation
    , ['Customers.RegionCountryName']=RegionCountryName
    , ['Customers.StateProvinceName']=StateProvinceName) 
    on ($left.CustomerKey == $right.['Customers.CustomerKey'])
| where ((DateKey > todatetime("20240101")) 
    and (DateKey < todatetime("20240201")))
| project CustomerKey, DateKey, TotalCost
| sort by CustomerKey asc nulls first
, DateKey asc nulls first

I would have written the query as follows:

// transformed final query
NewSales
| where (DateKey > todatetime("20240101")) 
    and (DateKey < todatetime("20240201"))
| join kind=inner (
    Customers
    | project CustomerKey
        , FirstName
        , LastName 
    ) on $left.CustomerKey == $right.CustomerKey
| project CustomerKey
    , CustomerName = strcat(FirstName, ' ', LastName)
    , DateKey
    , TotalCost
| sort by CustomerName asc nulls first
    , DateKey asc nulls first

So, it makes sense to create the backbone of a query, translate it to KQL via explain, remove the unnecessary columns and formatting, respectively add what's missing. Once the patterns were mastered, there's probably no need to use the translation tool, but could prove to be also some exceptions. Anyway, the translation tool helps considerably in learning. Big kudos for the development team!

Notes:
1) The above queries ignore the fact that Customer information is available also in the NewSales table, making thus the joins obsolete. The joins were considered only for exemplification purposes. Similar joins might be still needed for checking the "quality" of the data (e.g. for dimensions that change over time). Even if such "surprises" shouldn't appear by design, real life designs continue to surprise...
2) Queries should be less verbose by design (aka simplicity by design)! The more unnecessary code is added, the higher the chances for errors to be overseen, respectively the more time is needed to understand and validated the queries!

Happy coding!

Previous Post <<||>> Next Post

References
[1] Microsoft Lear n (2024) Azure: Query data using T-SQL [link]

14 October 2023

💠Azure Tools & Services: The Azure Diagrams Architecture Advisor (Test Drive)

There seems to be a new tool available for the Azure community, even if the tool doesn't belong yet to the Microsoft technology stack. I'm talking about Azure Diagrams Architecture Advisor, which enables users to "design diagrams in a collaborative manner and provides guidance on which services can be integrated and when they should be utilized in your architecture".

I took the tool for a test drive, trying to adapt an architecture I used recently for a Synapse Data Lakehouse. I used the Data Warehouse using Synapse Analytics example as source of inspiration. (You can use the links to check the latest versions.)

Overall, it was a good experience, the functionality being straightforward to use. I needed some time to identify what resources are available, and there are a few. Even if some of the resources were missing (e.g. Dataverse and Dynamics 365 services), I could use a custom resource to fill the gap.

One can easily drag and drop resources (see the left panel) on the canvas and create links between them. The diagram can be saved only as soon as you give it a name and sign in for an account, the email and a password are all you need for that. The diagram can be kept hidden or made available for the whole community, respectively can be saved locally as PNG or SVG files.

Serverless Data Lakehouse

Each standard (non-custom) resource has a link to documentation, respectively to the pricing. The arrows between resources show whether the integration is supported by default. One can create bidirectional arrows by creating a second arrow from destination to source.

The functionality is pretty basic, though I was able to generate the diagram I wanted, and I'm satisfied with the result. Unfortunately, I couldn't modify directly the example I used for inspiration. It would be great if further functionality like the one available in Visio Online could be made available (e.g. automatic alignment, copy pasting whole diagrams or parts of them, making each value editable, grouping several resources together, adding more graphical content, etc.).

The resources and icons available focus on the Microsoft Azure technology stack. It would be great if one could import at least new icons, though ideally further resources from other important software vendors should be supported (e.g. AWS, Oracle, Adobe, etc.).

The possibility to define further links (e.g. to the actual environments) or properties (e.g. environment names), respectively to change the views for the whole diagram could enhance the usability of such diagrams.

Microsoft should think seriously about including this tool in their stack, if not in Visio Online at least as standalone product!

Besides the examples and the shared diagrams available you can browse also through the collection of Azure Architectures.

To summarize, here are the improvements suggested:
- save and modify directly the examples;
- automatic alignment of resources within the canvas;
- copy pasting whole diagrams or parts of them;
- make each value editable;
- group several resources together (e.g. in a layer);
- adding more graphical content;
- import new icons;
- support resources from other important software vendors (e.g. AWS, Oracle, Adobe, etc.);
- define further links (e.g. to the actual environments);
- define further properties (e.g. environment names);
- change the views for the whole diagram;
- integration into Visio Online.

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)

an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently [1]
{goal} help customers fully realize the benefits of their analytics program [1]
{component} data science lifecycle definition
- {description} a framework to structure the development of data science projects [1]
- {goal} designed for data science projects that ship as part of intelligent applications that deploy ML & AI models for predictive analytics [1]
- {benefit} can be used in the context of other DM methodologies as they have common ground [1]
  - e.g. CRISP-DM, KDD
- {benefit} exploratory data science projects or improvised analytics projects can also benefit from using this process [1]
{component} standardized project structure
- {description} a directory structure that includes templates for project documents
  - ⇒makes it easy for team members to find information [1]
  - ⇐templates for the folder structure and required documents are provided in standard locations [1]
  - all code and documents are stored in an agile VCS tracking repository [1]
    - {recommendation} create a separate repository for each project on the VCS for versioning, information security, and collaboration [1]
- {benefit} organizes the code for the various activities [1]
- {benefit} allows tracking the progress [1]
- {benefit} provides checklist with key questions for each project to guarantee process and deliverables’ quality [1]
- {benefit} enables team collaboration [1]
- {benefit} allows closer tracking of the code for individual features [1]
- {benefit} enables teams to obtain better cost estimates [1]
- {benefit} helps build institutional knowledge across the organization [1]
{component} recommended infrastructure
- {description} a set of recommendations for the infrastructure and resources needed for analytics and storage [1]
- {benefit} addresses cloud and/or on-premises requirements [1]
- {benefit} enables reproducible analysis [1]
- {benefit} avoids infrastructure duplication [1]
  - ⇒minimizes inconsistencies and unnecessary infrastructure costs [1]
- {tools} tools are provided to provision the shared resources, track them, and allow each team member to connect to those resources securely [1]
- {good practice} create a consistent compute environment [1]
  - ⇐allows team members replicate and validate experiments [1]
{component} recommended tools and utilities
- {description} a set of recommendations for the tools and utilities needed for project’s execution [1]
- {benefit} help lower the barriers and increase the consistency of their adoption [1]
- {benefit} provides an initial set of tools and scripts to jump-start methodology’s adoption [1]
- {benefit} helps automate some of the common tasks in the data science lifecycle [1]
  - e.g. data exploration and baseline modeling [1]
- {benefit} well-defined structure provided for individuals to contribute shared tools and utilities into their team's shared code repository [1]
  - ⇐ resources can then be leveraged by other projects [1]
{phase} 1: business understanding
- {goal} define and document the business problem, its objectives, the needed attributes, and the metric(s) used to determine project’s success
- {goal} identify and document the relevant data sources
- {step} 1.1: define project’s objectives
  - elicit together with the stakeholders the requirements, define and document the problem and its objectives, respectively the metric(s) used to determine project’s success
    - requires a good understanding of the business processes, data and further characteristics
- {step} 1.2: identify data sources
  - identify the attributes and the data sources relevant to the problem under study
- {step} 1.3: define project plan and team*
  - develop a high-level milestone plan and identify the resources needed for executing it
- {tool} project charter
  - standard template that documents the business problem, the scope of the project, the business objectives and metric(s) used to determine project’s success
{phase} 2: data acquisition & understanding
- {goal} prepare the base dataset(s) as needed by the modeling phase into the target repository
- {goal} build the data ETL/ELT architecture and processes needed for provisioning the basis data
- {step} 2.1: ingest data
  - make the required data available for the team in the repository where the analytics operations take place
- {step} 2.2: explore data
  - understand data’s characteristics by leveraging specific tools (visualization, analysis)
  - prepare the data as needed for further processing
- {step} 2.3: set up pipelines
  - build the pipelines needed for data actualization and qualitative assessment [3]
  - set up a process to score new data or refresh the data regularly [3]
- {step} 2.4: feasibility analysis*
  - reevaluate the project to determine whether the value expected is sufficient to continue pursuing it
- {tool} data quality report
  - report that includes data summaries, data mappings, variable ranking, data qualitative assessment(s) and further information [3]
- {tool} solution architecture
  - diagram and/or textual-based description of the data pipeline(s), technical assumptions and further aspects
- {tool} data reports
  - document the structure and statistics of the raw data
- {tool} checkpoint decision
  - decision template document that
    - summarizes the findings of the feasibility analysis step
    - includes a set of choices and recommendations for the next steps
    - serves as basis for the decision on whether to continue or not the project, respectively what the next steps are
{phase} 3: modeling
- {goal} create a machine-learning model that addresses the prediction requirements and that's suitable for production
- {step} 3.1: feature engineering
  - the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis [4]
    - ⇐requires a good understanding of how the features relate to each other and how the ML algorithms use those features [4]
- {step} 3.2: model selection*
  - choose one or more modeling algorithms that address problem’s characteristics the best
- {step} 3.3: model training
  - involves the following steps:
    - split the input data into training and test datasets
    - build the models by using the training dataset
    - evaluate the training and the test data set
    - determine the optimal setup and methods
- {step} 3.4: model evaluation
  - evaluate the performance of the model(s)
- {step} 3.5: feasibility analysis*
  - evaluate the readiness of the models for use into production, respectively on whether they fulfill project’s objectives
- {tool} feature sets
  - describe the features developed for the modeling and how they were generated
  - contains pointers to the code used to generate the features
- {tool} model report
  - a standard, template-based report that provides details on each experiment’s outcomes
  - created for each model tried
- {tool} checkpoint decision
- {tool} model performance metrics
  - e.g. ROC curves or MSE
{phase} 4: deployment
- {goal} deploy the models and the data pipelines to the environment used for final user acceptance
- {step} 4.1: operationalize architecture
  - prepare the models and data pipelines for use into production
  - {best practice} expose the models over an open API interface
    - enables models’ consumption from various applications
  - {best practice} build telemetry and monitoring into the models and the data pipelines [5]
    - helps in monitoring and troubleshooting [5]
- {step} 4.2: deploy solution*
  - deploy the architecture into production
- {tool} status dashboard
  - displays data on system’s health and key metrics
- {tool} model report
  - the report in its final form with deployment information
- {tool} solution architecture
  - the document in its final form
{phase} 5: customer acceptance
- {goal} confirm that project’s objectives were fulfilled and get customer’s acceptance
- {step} 5.1: system validation
  - validate system’s performance and outcomes and confirm that it fulfills customer’s needs
- {step} 5.2: project signoff*
  - finalize and review documentation
  - handover the solution and afferent documentation to customer
  - evaluate the project against the defined objectives and get customer’ signoff
- {tool} exit report
- {tool} technical report
  - contains all the details of the project that are useful for learning about how to operate the system [6]

Acronyms:

Artificial Intelligence (AI)

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Data Mining (DM)

Knowledge Discovery in Databases (KDD)

Team Data Science Process (TDSP)

Version Control System (VCS)

Visual Studio Team Services (VSTS)

Resources:

[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]

[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]

[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]

[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source]

[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]

[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part I: An Introduction)

Each important technology has the potential of creating divides between the specialists from a given field. This aspect is more suggestive in the data-driven fields like BI/Analytics or Data Warehousing. The data professionals (engineers, scientists, analysts, developers) skilled only in the new wave of technologies tend to disregard the role played by the former technologies and their role in the data landscape. The argumentation for such behavior is rooted in the belief that a new technology is better and can solve any problem better than previous technologies did. It’s a kind of mirage professionals and customers can easily fall under.

Being bigger, faster, having new functionality, doesn’t make a tool the best choice by default. The choice must be rooted in the problem to be solved and the set of requirements it comes with. Just because a vibratory rammer is a new technology, is faster and has more power in applying pressure, this doesn’t mean that it will replace a hammer. Where a certain type of power is needed the vibratory rammer might be the best tool, while for situations in which a minimum of power and probably more precision is needed, like driving in a nail, then an adequately sized hammer will prove to be a better choice.

A technology is to be used in certain (business/technological) contexts, and even if contexts often overlap, the further details (aka requirements) should lead to the proper use of tools. It’s in a professional’s duties to be able to differentiate between contexts, requirements and the capabilities of the tools appropriate for each context. In this resides partially a professional’s mastery over its field of work and of providing adequate solutions for customers’ needs. Especially in IT, it’s not enough to master the new tools but also have an understanding about preceding tools, usage contexts, capabilities and challenges.

From an historical perspective each tool appeared to fill a demand, and even if maybe it didn’t manage to fill it adequately, the experience obtained can prove to be valuable in one way or another. Otherwise, one risks reinventing the wheel, or more dangerously, repeating the failures of the past. Each new technology seems to provide a deja-vu from this perspective.

Moreover, a new technology provides new opportunities and requires maybe to change our way of thinking in respect to how the technology is used and the processes or techniques associated with it. Knowledge of the past technologies help identifying such opportunities easier. How a tool is used is also a matter of skills, while its appropriate use and adoption implies an inherent learning curve. Having previous experience with similar tools tends to reduce the learning curve considerably, though hands-on learning is still necessary, and appropriate learning materials or tutoring is upon case needed for a smoother transition.

In what concerns the implementation of mature technologies, most of the challenges were seldom the technologies themselves but of non-technical nature, ranging from the poor understanding/knowledge about the tools, their role and the implications they have for an organization, to an organization’s maturity in leading projects. Even the most-advanced technology can fail in the hands of non-experts. Experience can’t be judged based only on the years spent in the field or the number of projects one worked on, but on the understanding acquired about implementation and usage’s challenges. These latter aspects seem to be widely ignored, even if it can make the difference between success and failure in a technology’s implementation.

Ultimately, each technology is appropriate in certain contexts and a new technology doesn’t necessarily make another obsolete, at least not until the old contexts become obsolete.

Previous Post <<||>>Next Post

31 December 2020

📊Graphical Representation: Graphics We Live by (Part V: Pie Charts in MS Excel)

Graphical Representation Series

From business dashboards to newspapers and other forms of content that capture the attention of average readers, pie charts seem to be one of the most used forms of graphical representation. Unfortunately, their characteristics make them inappropriate for displaying certain types of data, and of being misused. Therefore, there are many voices who advice against using them for any form of display.

It’s hard to agree with radical statements like ‘avoid (using) pie charts’ or ’pie charts are bad’. Each form of graphical representation (aka graphical tool, graphic) has advantages and disadvantages, which makes it appropriate or inappropriate for displaying data having certain characteristics. In addition, each tool can be easily misused, especially when basic representational practices are ignored. Avoiding one representational tool doesn’t mean that the use of another tool will be correct. Therefore, it’s important to make people aware of these aspects and let them decide which tools they should use.

From a graphical tool is expected to represent and describe a dataset in a small area without distorting the reality, while encouraging the reader to compare the different pieces of information, when possible at different levels of details [1] or how they change over time. As form of communication, they encode information and meaning; the reader needs to be able to read, understand and think critically about graphics and data – what is known as graphical/data literacy.

A pie chart consists of a circle split into wedge-shaped slices (aka edges, segments), each slice representing a group or category (aka component). It resembles with the spokes of a wheel, however with a few exceptions they are seldom equidistant. The size of each slice is proportional to the percentage of the component when compared to the whole. Therefore, pie charts are ideal when displaying percentages or values that can be converted into percentages. Thus, the percentages must sum up to 100% (at least that’s readers’ expectation).

Within or besides the slices are displayed components’ name and sometimes the percentages or other numeric or textual information associated with them (Fig. 1-4). The percentages become important when the slices seem to be of equal sizes. As long the slices have the same radius, comparison of the different components resumes in comparing arcs of circles or the chords defined by them, thing not always straightforward. 3-dimensional displays can upon case make the comparison more difficult.

The comparison increases in difficulty with the number of slices increases beyond a certain number. Therefore, it’s not recommended displaying more than 5-10 components within the same chart. If the components exceed this limit, the exceeding components can be summed up within an “other” component.

Within a graphic one needs a reference point that can be used as starting point for exploration. Typically for categorical data this reference point is the biggest or the smallest value, the other values being sorted in ascending, respectively descending order, fact that facilitates comparing the values. For pie charts, this would mean sorting the slices based on their sizes, except the slice for “others” which is typically considered last.

The slices can be filled optionally with meaningful colors or (hashing) patterns. When the same color pallet is used, the size can be reflected in colors’ hue, however this can generate confusion when not applied adequately. It’s recommended to provide further (textual) information when the graphical elements can lead to misinterpretations.

Pie charts can be used occasionally for comparing the changes of the same components between different points in time, geographies (Fig. 5-6) or other types of segmentation. Having the charts displayed besides each other and marking each component with a characteristic color or pattern facilitate the comparison.

Previous Post <<||>> Next Post

27 November 2020

🧊Data Warehousing: ETL (Part II: An Introduction)

ETL (Extract, Transform, Load) processes, technologies or tools are about extracting data from one or more data sources via a set of queries, performing changes on the data via conversions, aggregations, mappings or other types of transformations, respectively loading the data into target tables or other type of repositories. Thus, an ETL process allows moving and transforming data between predefined data structures on an ad-hoc basis or as part of stable repetitive processes, which makes ETL ideal for data warehousing, data integrations, data migrations or similar scenarios.

Extract: The extraction of data is done typically based on SQL queries from relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files, though modern ETL tools can support other type of queries (CAML, XQuery, DAX) or even NoSQL architectures (Handoop). This allows addressing a wide range of requirements, the complexity of the logic depending on the functionality provided by the query languages, respectively the extraction functionality available.

Transform: The transformation logic can be implemented based on the functionality provided by the ETL tool, and can involve after case any combination of aggregates, conditional splits, merges, lookups, multicasts, pivoting/unpivoting, cleansing, data conversions, sampling, mapping or any other transformations that can be performed on an in-transit dataset. On the other side, quite often the same can be achieved with the help of SQL-based manipulations directly in the extraction logic or later in the process. SQL can prove to be occasionally faster and more flexible than the transformations provided by the ETL tool, however despite the overlaps, the two approaches can complement each other when used adequately.

Load: The load is usually just a dump of the data into one or more final or intermediary tables with predefined structures. Unless the data don’t match the data type, format or further defined constraints, the load seldom involve further challenges as long the solution was designed adequately.

Within the logical model, extract, transform and load can be considered as process by themselves. Within the object model provided by the ETL tool, they are considered in the mentioned sequence within a data flow, which within a set of workflow constraints defines how the data move through the pipeline – the sequence of processing steps considered. The basic unit of work is the data flow and the workflow it belongs to, unit that can be encapsulated in one container for easier management or simply convenience. Several containers can be linked within a workflow to create more complex behavior.

The data flows and workflow constraints, together with the supporting connections and containers form an ETL package, the main unit of work for encapsulating and running ETL logic. ETL packages are scheduled and run as fit for the purpose.

With the right design, these building blocks allow enough flexibility in handling ad-hoc requests or of building complex solutions. This involves decisions on how to partition the ETL packages, respectively the data flows, in which order they should be run, where and in which sequence the data should be transformed, how to handle exceptions, how to build eventually intermediary data repositories, how to handles audit requirements, and so on. Each of these choices can prove to be important.

The knowledge of the ETL architecture and functionality is quintessential in providing the right solution for the problem considered, however once the basics were understood the challenges typically reside in understanding the source and/or target structures, the logical and physical entities available, identify the way the data can be partitioned horizontally or vertically, respectively what type of transformations are required for moving the data, as required by the solution.

Previous Post <<||>> Next Post

30 October 2020

Data Science: Data Strategy (Part II: Generalists vs Specialists in the Field)

Division of labor favorizes the tasks done repeatedly, where knowledge of the broader processes is not needed, where aspects as creativity are needed only at a small scale. Division invaded the IT domains as tools, methodologies and demands increased in complexity, and therefore Data Science and BI/Analytics make no exception from this.

The scale of this development gains sometimes humorous expectations or misbelieves when one hears headhunters asking potential candidates whether they are upfront or backend experts when a good understanding of both aspects is needed for providing adequate results. The development gains tragicomical implications when one is limited in action only to a given area despite the extended expertise, or when a generalist seems to step on the feet of specialists, sometimes from the right entitled reasons.

Headhunters’ behavior is rooted maybe in the poor understanding of the domain of expertise and implications of the job descriptions. It’s hard to understand how people sustain of having knowledge about a domain just because they heard the words flying around and got some glimpse of the connotations associated with the words. Unfortunately, this is extended to management and further in the business environment, with all the implications deriving from it.

As Data Science finds itself at the intersection between Artificial Intelligence, Data Mining, Machine Learning, Neurocomputing, Pattern Recognition, Statistics and Data Processing, the center of gravity is hard to determine. One way of dealing with the unknown is requiring candidates to have a few years of trackable experience in the respective fields or in the use of a few tools considered as important in the respective domains. Of course, the usage of tools and techniques is important, though it’s a big difference between using a tool and understanding the how, when, why, where, in which ways and by what means a tool can be used effectively to create value. This can be gained only when one’s exposed to different business scenarios across industries and is a tough thing to demand from a profession found in its baby steps.

Moreover, being a good data scientist involves having a deep insight into the businesses, being able to understand data and the demands associated with data – the various qualitative and quantitative aspects. Seeing the big picture is important in defining, approaching and solving problems. The more one is exposed to different techniques and business scenarios, with right understanding and some problem-solving skillset one can transpose and solve problems across domains. However, the generalist will find his limitations as soon a certain depth is reached, and the collaboration with a specialist is then required. A good collaboration between generalists and specialists is important in complex projects which overreach the boundaries of one person’s knowledge and skillset.

Complexity is addressed when one can focus on the important characteristic of the problem, respectively when the models built can reflect the demands. The most important skillset besides the use of technical tools is the ability to model problems and root the respective problems into data, to elaborate theories and check them against reality.

Complex problems can require specialization in certain fields, though seldom one problem is dependent only on one aspect of the business, as problems occur in overreaching contexts that span sometimes the borders of an organization. In addition, the ability to solve problems seem to be impacted by the diversity of the people involved into the task, sometimes even with backgrounds not directly related to organization’s activity. As in evolution, a team’s diversity is an important factor in achievement and learning, most gain being obtained when knowledge gets shared and harnessed beyond the borders of teams.

Note:
Written as answer to a Medium post on Data Science generalists vs specialists.

25 August 2019

🛡️Information Security: Cybersecurity (Definitions)

"The art of ensuring the existence and continuity of the Information Society of a nation, guaranteeing and protecting, in Cyberspace, its information assets and critical infrastructure." (Claudia Canongia & Raphael Mandarino, "Cybersecurity: The New Challenge of the Information Society", 2012)

"The act of protecting technology, information, and networks from attacks." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"The practice of protecting computers and electronic communication systems as well as the associated information." (Weiss, "Auditing IT Infrastructures for Compliance" 2nd Ed., 2015)

"Cybersecurity deals with damage to, unauthorized use of, exploitation of electronic information and communications systems that ensure confidentiality, integrity and availability." (Sanjukta Pookulangara, "Cybersecurity: What Matters to Consumers - An Exploratory Study", 2016)

"Focuses on protecting computers, networks, programs and data from unintended or unauthorized access, change or destruction." (Kimberly Lukin, "Russian Cyberwarfare Taxonomy and Cybersecurity Contradictions between Russia and EU", 2016)

"The activity or process, ability or capability, or state whereby information and communications systems and the information contained therein are protected from and/or defended against damage, unauthorized use or modification, or exploitation." (Olivera Injac & Ramo Šendelj, "National Security Policy and Strategy and Cyber Security Risks", 2016)

"The ability to protect against the unauthorized use of electronic data and malicious activity. This electronic data can be personal customer information such as names, addresses, social security numbers, credit cards, and debit cards, to name a few." (Brittany Bullard, "Style and Statistics", 2016)

"A trustworthiness property concerned with the protection of systems from cyberattacks." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"Information security (infosec) but broadly referring to technology and human systems that are built around the secure exchange, storage, and management of information." (Shalin Hai-Jew, "Safe Distances: Online and RL Hyper-Personal Relationships as Potential Attack Surfaces", 2018)

"Is defined as the collection of tools, policies, security concepts, security safeguards, guidelines, risk management approaches, actions, training, best practices, assurance and technologies that can be used to protect the cyber environment, organization, and user assets." (Thokozani I Nzimakwe, "Government's Dynamic Approach to Addressing Challenges of Cybersecurity in South Africa", 2018)

"Protection against criminal access to one’s data and information and against criminal manipulation of computer networks/data/systems." (Shalin Hai-Jew, "Beware!: A Multimodal Analysis of Cautionary Tales in Strategic Cybersecurity Messaging Online", 2018)

"The collection of tools, policies, security concepts, security safeguards, guidelines, risk management approaches, actions, training, best practices, assurance, and technologies that can be used to protect the cyberspace environment and organization and users’ assets." (William Stallings, "Effective Cybersecurity: A Guide to Using Best Practices and Standards", 2018)

"The organization and collection of resources, processes, and structures used to protect cyberspace from occurrences that misalign de jure from de facto property rights." (Mika Westerlund et al, "A Three-Vector Approach to Blind Spots in Cybersecurity", 2018)

"A computing-based discipline involving technology, people, information, and processes to enable assured operations. It involves the creation, operation, analysis, and testing of secure computer systems. It is an interdisciplinary course of study, including aspects of law, policy, human factors, ethics, and risk management in the context of adversaries." (Matt Bishop et al, "Cybersecurity Curricular Guidelines", 2019)

"Acts taken, technologies created and deployed, policies written and enacted, to protect computer systems and networks against misuse, intrusion, and exploitation." (Shalin Hai-Jew, "The Electronic Hive Mind and Cybersecurity: Mass-Scale Human Cognitive Limits to Explain the “Weakest Link” in Cybersecurity", 2019)

"Also known as computer security or IT security, is the protection of computer systems from the theft or damage to the hardware, software or the information on them, as well as from disruption or misdirection of the services they provide." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2019)

"Includes process, procedures, technologies, and controls designed to protect systems, networks, and data." (Sandra Blanke et al, "How Can a Cybersecurity Student Become a Cybersecurity Professional and Succeed in a Cybersecurity Career?", 2019)

"The protection of computer systems from theft and damage to their assets and from manipulation and distraction of their services." (Viacheslav Izosimov & Martin Törngren, "Security Awareness in the Internet of Everything", 2019)

"The protection of internet-connected systems including hardware, software, and data from cyberattacks." (Semra Birgün & Zeynep Altan, "A Managerial Perspective for the Software Development Process: Achieving Software Product Quality by the Theory of Constraints", 2019)

"Cybersecurity is seen where security alerts and cyber-attacks are becoming more frequent and malicious, these threats include private access attempts and exploitation software or phishing, malware, web application attacks, and network penetration." (Theunis G Pelser & Garth Gaffley, "Implications of Digital Transformation on the Strategy Development Process for Business Leaders", 2020)

"Is the protection of internet-connected systems, including hardware, software and data, from cyberattacks. In a computing context, security comprises cybersecurity and physical security - both are used by enterprises to protect against unauthorized access to data centers and other computerized systems." (Alexander A Filatov, "Sovereign Bureaucrats vs. Global Tech Companies: Ethical and Regulatory Challenges", 2020)

"It is a general term which describes technologies, processes, methods, and practices for the purpose of protection of internet-connected information systems from attacks, i.e., cyberattacks. Cybersecurity can refer to security of data, software or hardware within information systems." (Ana Gavrovska & Andreja Samčović, "Intelligent Automation Using Machine and Deep Learning in Cybersecurity of Industrial IoT: CCTV Security and DDoS Attack Detection", 2020)

"Cybersecurity is an act to protect data, devices, applications, servers, network from the malicious attack through various tools and techniques. The process also ensures the confidentiality, integrity, availability, and non-repudiation of the content." (Shafali Agarwal, "Preserving Information Security Using Fractal-Based Cryptosystem", 2021)

"Cybersecurity refers to the set of technologies, processes, and practices designed to safeguard networks, devices, programs, and data from attack, threats, or unauthorized access." (Sanjeev Rao et al, "Online Social Networks Misuse, Cyber Crimes, and Counter Mechanisms", 2021)

"It is the organization and collection of resources, processes, and structures used to protect cyberspace from security events." (Carlos A M S Teles et al, "A Black-Box Framework for Malicious Traffic Detection in ICT Environments", Handbook of Research on Cyber Crime and Information Privacy, 2021)

"Prevention of damage to, protection of, and restoration of computers, electronic communications systems, electronic communications services, wire communication, and electronic communication, including information contained therein, to ensure its availability, integrity, authentication, confidentiality, and nonrepudiation." (CNSSI 4009-2015)

"The ability to protect or defend the use of cyberspace from cyber attacks." (NISTIR 8170)

"The prevention of damage to, unauthorized use of, exploitation of, and - if needed - the restoration of electronic information and communications systems, and the information they contain, in order to strengthen the confidentiality, integrity and availability of these systems." (NISTIR 8074 Vol. 2)

"The process of protecting information by preventing, detecting, and responding to attacks." (NISTIR 8183)

10 July 2019

🧱IT: Product Information Management [PIM] (Definitions)

"The management of product master data, usually via a PIM hub, to avail a single version of the truth about product data to the business." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"MDM Systems that focus exclusively on managing the descriptions of products are also call PIM systems." (Martin Oberhofer et al, "Enterprise Master Data Management", 2008)

"Processes and technologies focused on centrally managing information about products, with a focus on the data required to market and sell the products through one or more distribution channels. A central set of product data can be used to feed consistent, accurate, and up-to-date information to multiple output media such as websites, print catalogs, ERP systems, and electronic data feeds to trading partners. PIM systems generally need to support multiple geographic locations, multilingual data, and maintenance and modification of product information within a centralized catalog to provide consistently accurate information to multiple channels in a cost-effective manner." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"Processes and tools used to predict and evaluate success of products through marketing and sales efforts." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Product Information Management (PIM) is the process, techniques and technology of gaining control over a company's product marketing information. The objective of PIM solutions is to remove inefficiency in the marketing supply chain by delivering information to sales channels more quickly and with fewer mistakes." (Digital Asset Management)

"Product information management (PIM) is the process of managing all the information required to market and sell products through distribution channels." (Wikipedia) [source]

"Product information management (PIM) is the software-based orchestration of data dissemination related to a business’s products and its suppliers’ products. PIM coordinates changing product information across all channels of communication, thus ensuring that a business’s entire ecosystem has consistent and up-to-date information." (Informatica)

"The processes and tools for managing product information, including: 1) data centralization and governance; 2) data onboarding from partners; 3) data and content creation and enrichment; and 4) content distribution/syndication." (Forrester)

12 May 2019

#️⃣Software Engineering: Programming (Part XIII: Misconceptions about Programming II)

Continuation

One of the organizational stereotypes is having a big room full of cubicles filled with employees. Even if programmers can work in such settings, improperly designed environments restrict to a certain degree the creativity and productivity, making more difficult employees' collaboration and socialization. Despite having dedicated meeting rooms, an important part of the communication occurs ad-hoc. In open spaces each transient interruption can easily lead inadvertently to loss of concentration, which leads to wasted time, as one needs retaking thoughts’ thread and reviewing the last written code, and occasionally to bugs.

Programming is expected to be a 9 to 5 job with the effective working time of 8 hours. Subtracting the interruptions, the pauses one needs to take, the effective working time decreases to about 6 hours. In other words, to reach 8 hours of effective productivity one needs to work about 10 hours or so. Therefore, unless adequately planned, each project starts with a 20% of overtime. Moreover, even if a task is planned to take 8 hours, given the need of information the allocated time is split over multiple days. The higher the need for further clarifications the higher the chances for effort to expand. In extremis, the effort can double itself.

Spending extensive time in front of the computer can have adverse effects on programmers’ physical and psychical health. Same effect has the time pressure and some of the negative behavior that occurs in working environments. Also, the communication skills can suffer when they are not properly addressed. Unfortunately, few organizations give importance to these aspects, few offer a work free time balance, even if a programmer’s job best fits and requires such approach. What’s even more unfortunate is when organizations ignore the overtime, taking it as part of job’s description. It’s also one of the main reasons why programmers leave, why competent workforce is lost. In the end everyone’s replaceable, however what’s the price one must pay for it?

Trainings are offered typically within running projects as they can be easily billed. Besides the fact that this behavior takes time unnecessarily from a project’s schedule, it can easily make trainings ineffective when the programmers can’t immediately use the new knowledge. Moreover, considering resources that come and go, the unwillingness to invest in programmers can have incalculable effects on an organization performance, respectively on their personal development.

Organizations typically look for self-motivated resources, this request often encompassing organization’s whole motivational strategy. Long projects feel like a marathon in which is difficult to sustain the same rhythm for the whole duration of the project. Managers and team leaders need to work on programmers’ motivation if they need sustained performance. They must act as mentors and leaders altogether, not only to control tasks’ status and rave and storm each time deviations occur. It’s easy to complain about the status quo without doing anything to address the existing issues (challenges).

Especially in dysfunctional teams, programmers believe that management can’t contribute much to project’s technical aspects, while management sees little or no benefit in making developers integrant part of project's decisional process. Moreover, the lack of transparence and communication lead to a wide range of frictions between the various parties.

Probably the most difficult to understand is people’s stubbornness in expecting different behavior by following the same methods and of ignoring the common sense. It’s bewildering the easiness with which people ignore technological and Project Management principles and best practices. It resides in human nature the stubbornness of learning on the hard way despite the warnings of the experienced, however, despite the negative effects there’s often minimal learning in the process...

To be eventually continued…

Previous Post <<||>> Next Post

06 May 2019

🧭Business Intelligence: Key Performance Indicators [KPI] (Part I: An Introduction)

Key Performance Indicators (KPIs) are quantifiable measurements (aka metrics) that reflect the critical success factor of an organization in respect to their strategic goals and objectives. They allow measuring the progress toward reaching the defined goals and, to some degree, forecasting the further evolution. They help keeping the focus on the goals, increases awareness in what concerns the goals and provide visibility into the business.

As they reflect an organization’s objectives, KPIs need to be anchored and aligned with them. If there’s no association with an objective then one doesn’t deal with a KPI but with other form of performance metric. Therefore KPIs need to change with the objectives, they are not fix.

One important requirement for a KPI is to be defined using SMART (specific, measurable, attainable, relevant, time-bound) criteria. Thus a KPI needs to be clear and unambiguous (specific), needs to measure the progress against a goal (measurable), needs to be realistic (attainable), needs to be relevant for the business and its current strategy (relevant), and needs to specify when the result(s) can be achieved (time-bound). To the SMART criteria some consider also the requirement for a KPI to be periodically and consistently evaluated and reviewed (trackable) and agreed between the parties afected by it (agreed).

A KPI needs to be visible within an organization, understandable and non-redundant. Even if KPIs are a tool for the upper management, their definition and impact needs to be visible and understood by all the people working with it, even if this can lead to unexpected behavior. The requirement for non-redundancy implies a partition of the KPIs to limit the cases in which two or more KPIs provide the same information.

A KPI needs to be supported by actions and needs to trigger actions. It’s nice to have KPIs reported periodically to the upper management, though as long no action is triggered, there’s no value in it. A KPI is kind of reinforcement for questions like: “why are we doing good/bad?”. The negative variations must trigger some form of action, however also the positive variation could involve further analysis to understand what caused the improvement.

The variation of a KPI needs to be supported by facts – each variation needs to be explainable in one form or another. A number without a story remains a number that can or not be trusted. Therefore, it might be needed to have further metrics or reports that support the KPIs, that can be used to identify the sources for variation, in order to understand the data.

Last but not the least KPIs need to be documented. The documentation needs to include at minimum a rough definition that includes the rationale, the boundary as well the critical values, metric’s owners, unit of measure, etc. In addition, one can add historical information about the KPI in respect to when and what caused variations, respectively how the variations were brought under control.

KPIs vary from an organization to another, the variation in not only influenced by the different goals organizations might have, but also based on the fact that organizations tend to measure different things, often the wrong things. It’s in general recommended to have a small number of KPIs that reflect in one dasboard how the business is doing and what is important for the business.

KPIs provide a basis for change by providing insights into what needs to change to improve some aspects of the business. When adequately defined and measured, KPIs provide a good perspective over an organization’s effort in achieving its goals and objectives, and therefore a good tool for monitoring and stirring organization’s strategy.

Previous Post <<||>> Next Post

SQL Troubles

Pages

15 July 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 53: The Future of Business Intelligence - Will AI Make It Obsolete?)

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part VIII: Translating SQL to KQL - Full Joins)

14 October 2023

💠Azure Tools & Services: The Azure Diagrams Architecture Advisor (Test Drive)

29 March 2021

Notes: Team Data Science Process (TDSP)

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part I: An Introduction)

31 December 2020

📊Graphical Representation: Graphics We Live by (Part V: Pie Charts in MS Excel)

27 November 2020

🧊Data Warehousing: ETL (Part II: An Introduction)

30 October 2020

Data Science: Data Strategy (Part II: Generalists vs Specialists in the Field)

25 August 2019

🛡️Information Security: Cybersecurity (Definitions)

10 July 2019

🧱IT: Product Information Management [PIM] (Definitions)

12 May 2019

#️⃣Software Engineering: Programming (Part XIII: Misconceptions about Programming II)

06 May 2019

🧭Business Intelligence: Key Performance Indicators [KPI] (Part I: An Introduction)

About Me