SQL Troubles: December 2024

25 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part II: Under the Hood) 🆕

The first tests performed with Microsoft Fabric's SQL databases (see previous post) revealed that with small exceptions many of the objects and administration scripts I built for SQL Server 2005 up to 2022 worked until now with minimal changes, though the tests made focused only on the standard basic functionality - the use of user-defined functions, stored procedures, views, respectively DMVs. As usual, the devil probably hides in details.

First of all, one can easily check current database's configuration settings via the sys.configurations:

-- database configuration values
SELECT cnf.configuration_id
, cnf.name
, cnf.description
, cnf.value 
, cnf.minimum 
, cnf.maximum 
, cnf.value_in_use 
, cnf.is_dynamic 
, cnf.is_advanced 
FROM sys.configurations cnf
WHERE value <> 0 
--WHERE name LIKE 'data%'
ORDER BY cnf.name

One can look also at the database-scoped configuration via the sys.database_scoped_configurations, especially at the records whose value deviates from the default:

-- SQL database - database scoped configuration
SELECT name
, value
, value_for_secondary
, is_value_default 
FROM sys.database_scoped_configurations
WHERE is_value_default <>1 --search for deviations from default
--WHERE name=N'MAXDOP' -- search for individual values
ORDER BY name

Output:

name	value	value_for_secondary	is_value_default
MAXDOP	8		False

Querying the sys.databases revealed that there seems to be only one system database available, namely the master database used to retrieve the definition of the system objects. In what concerns the various properties, it's probably important to check the compatibility level, the collation name, and other settings that apply:

-- database information
SELECT db.database_id
, db.[name] AS database_name
, db.recovery_model_desc 
, db.log_reuse_wait_desc 
, db.[compatibility_level] 
, db.page_verify_option_desc 
, db.collation_name
, db.user_access_desc
, db.is_change_feed_enabled
, db.is_data_lake_replication_enabled
FROM sys.databases AS db

Output:

database_id	database_name	recovery_model_desc	log_reuse_wait_desc	compatibility_level	page_verify_option_desc	collation_name	user_access_desc	is_change_feed_enabled	is_data_lake_replication_enabled
1	master	FULL	NOTHING	160	CHECKSUM	SQL_Latin1_General_CP1_CI_AS	MULTI_USER	False	False
26	AdventureWorks01-...	FULL	NOTHING	160	CHECKSUM	SQL_Latin1_General_CP1_CI_AS	MULTI_USER	True	True

Unfortunately, it's not possible to change a database's collation, though other collations on individual columns are supported [2]. Compared to the standard SQL Server, it's not possible to change the compatibility level to a previous value. It will be interesting to see whether an SQL database will use always the latest version of SQL Server or whether future compatibility levels are supported in parallel.

There is no tempdb listed, though querying directly the objects from tempdb by using 3-part references allows to retrieve their content. For example, the following query retrieves the various database files available currently:

-- SQL database: tempdb files 
SELECT dbf.file_id
, dbf.name file_name
--, dbf.physical_name
, dsp.name file_group
--, type 
, dbf.type_desc file_type
--, dbf.growth growth_kb
, Cast(dbf.growth/128.0  as decimal(18,2)) growth_mb
--, dbf.is_percent_growth
--, dbf.max_size max_size_kb
, Cast(NullIf(dbf.max_size, -1)/128.0  as decimal(18,2)) max_size_mb
--, dbf.size file_size_kb
, Cast(dbf.size/128.0 as decimal(18,2)) file_size_mb
, dbf.state_desc 
, dbf.is_read_only 
FROM tempdb.sys.database_files dbf
     LEFT JOIN tempdb.sys.data_spaces dsp
       ON dbf.data_space_id = dsp.data_space_id
ORDER BY dbf.Name

Output:

file_id	file_name	file_group	file_type	growth_mb	max_size_mb	file_size_mb	state_desc	is_read_only
1	tempdev	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
11	tempdev10	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
12	tempdev11	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
13	tempdev12	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
14	tempdev13	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
15	tempdev14	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
16	tempdev15	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
17	tempdev16	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
3	tempdev2	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
4	tempdev3	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
5	tempdev4	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
6	tempdev5	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
7	tempdev6	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
8	tempdev7	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
9	tempdev8	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
10	tempdev9	PRIMARY	ROWS	256.00	32768.00	16.00	ONLINE	False
2	templog		LOG	64.00	216256.00	16.00	ONLINE	False

By removing the reference to tempdb from the query, one can get the same information for the current database. For the SalesLT were created only two database and one log file.

There seems to be no msdb database, used primarily by the SQL Server Agent for scheduling alerts and jobs. This implies that the functionality based on it won't work and one needs to find alternatives!

Looking at the feature comparison from [2], the most important are the lack of support for always encrypted, auditing, bulk inserts, change data capture (CDC), elastic queries, full-text search, in-memory optimization, ledger, OPENROWSET. In addition it supports cross-database three-part name queries via the SQL analytics endpoint [2].

Until now, the most annoying fact is that in the web UI results are returned in different "pages", and thus makes it a bit more challenging to navigate the output. However, using a cursor to iterate through the results and saving the content to a table solves the problem (see link). The last query from the post was used to retrieve the system objects together with the number of records returned.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Frequently asked questions for SQL database in Microsoft Fabric (preview) [link]
[2] Microsoft Learn (2024) Features comparison: Azure SQL Database and SQL database in Microsoft Fabric (preview) [link]

💎🏭SQL Reloaded: Number of Records IV (via sys.partitions DMV)

To get the exact number of records in a table one can use the COUNT (see post) or the more recent COUNT_BIG function, though for big tables this can be an inefficient operation for the database engine:

-- number of records via COUNT
SELECT count(*) row_count
FROM SalesLT.Product

Moreover, sometimes the operation needs to be repeated for a number of tables, e.g. dropdown tables in Dynamics 365 for Finance and Operations (D365 F&O). Writing the query as a UNION allows to export the data as a single table and do comparisons (e.g. in Excel). The same approach can be used also when multiple columns are used for grouping, though one must account for the additional columns in the other subqueries. However, the more tables are involved, the more difficult it becomes to maintain the query over time.

-- number of records via COUNT for multiple tables
SELECT 'SalesLT.Product' table_name
, count(*) row_count
FROM SalesLT.Product
UNION ALL
SELECT 'SalesLT.ProductDescription' table_name
, count(*) row_count
FROM SalesLT.ProductDescription
UNION ALL
SELECT 'SalesLT.ProductModel' table_name
, count(*) row_count
FROM SalesLT.ProductModel

There are many scenarios in which it's needed to get an approximate of the number of records available in a table and doing a record count might prove to be too expensive. For a quick and dirty solution one can use the sys.partitions DMV instead:

-- number of records via DMV for single object
SELECT object_id
, OBJECT_NAME(object_id) object_name
, OBJECT_SCHEMA_NAME(object_id) schema_name
, SUM(Rows) AS row_count
, data_compression_desc AS compression_type
, COUNT(*) partitions_count
FROM sys.partitions 
WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
  AND OBJECT_ID('SalesLT.Product') = object_id
GROUP BY object_id
, data_compression_desc
ORDER BY row_count DESC;

The query is based on sys.partitions table [1] which contains a row for each partition of all the tables and most types of indexes in the database. The documentation mentions that "rows" indicates the approximate number of rows in the considered partition.

Alternatively, one can bring more tables into the query to extend its range of applicability.

-- number of records via DMVs
SELECT S.name + '.' + T.name SearchName
, S.Name SchemaName
, T.name TableName
, P.row_count
, P.compression_type
, P.partitions_count
FROM sys.tables T
     LEFT JOIN (
        SELECT object_id
        , SUM(Rows) AS row_count
        , data_compression_desc AS compression_type
        , COUNT(*) partitions_count
        FROM sys.partitions 
        WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
        --AND OBJECT_ID('SalesLT.Product') = object_id
        GROUP BY object_id
        , data_compression_desc
     ) P
    ON T.object_id = P.object_id
     JOIN sys.schemas as S
	   on S.schema_id = T.schema_id
WHERE S.Name = 'SalesLT'
  AND T.Name LIKE 'Product%'
ORDER BY row_count DESC;

The data can be exported regularly to give an idea how tables' cardinality changes over time. One can find this useful as part of the loading process in data warehouses or other solutions (e.g. data migrations).

By using a FULL JOIN instead of a LEFT JOIN one can retrieve only the tables that have records.

One should consider only the tables in scope, and eventually remove the records associated with the system objects (e.g. sys or information_schema upon case).

 -- constraints to be added in the WHERE clause to remove the records related to system objects
 AND OBJECT_NAME(object_id) NOT LIKE 'sys%'
 AND OBJECT_NAME(object_id) NOT LIKE 'queue_%' 
 AND OBJECT_NAME(object_id) NOT LIKE 'filestream_tombstone%'

There are also scenarios in which the count is needed only for a subset of the data. It's the case of D365 F&O (in which the number of records is needed by DataAreaId (aka company) or another field. A solution can be built using the sp_MSForEachTable stored procedure (see the last query from this post) and a cursor.

Notes:
1) Unfortunately, in Microsoft Fabric warehouses the sys.partitions.rows is 0 for all user tables and currently also the sp_MSForEachTable stored procedure can't be used to retrieve the number of records for all tables. However, one can create an old-fashioned cursor for iterating though the collection of tables in scope.
2) The code used in this post is available also in the GitHub repository.

Happy coding and Merry Christmas!

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2024) sys.partitions (Transact-SQL) [link]
[2] Microsoft Learn (2024) COUNT_BIG (Transact-SQL) [link]

22 December 2024

#️⃣Software Engineering: Mea Culpa (Part VI: A Look Back)

Software Engineering Series

Looking back at my university years, I'd say that there are three teachers, respectively courses that made a considerable impact on students' lives. In the second year I learned Category Algebra, which despite the fact that it reflected past knowledge and the topics were too complex for most of us, it provided us with a unprecedented layer of abstraction that showed us that Mathematics is not what we thought it to be!

The second course was related to the Complex plane theory, a course in which, the decan of the university at those times, challenged our way of thinking about relatively basic concepts. It was a big gap between what we thought about Mathematics, and what the subject proved to be. The course was thought in a post-university year together with a course on Relativity Theory, in which even we haven't understood much about the concepts and theories, it was the first time (except the Graph theory), we saw applied Mathematics to a broader context. Please don't misunderstand me! There were many other valuable teachers and courses, though these were the three courses that made the most important impact for me!

During those times, we attended also courses on Fortran, Pascal, C++, HTML and even dBase, and, even if each programming language brought something new in the landscape, I can't say they changed how we thought about the world (some of us had similar courses during the lyceum years) and problem solving. That's what for example SQL or more generally a database-related course brought, even if I had to wait for the first MooC courses to appear. Equally important was also Scott E Page's course on Model Theory, which introduced the model-thinking approach, a structured way of thinking about models, with applicability to the theoretical and practical aspects of life.

These are the courses that anybody interested in programming and/or IT should attend! Of course, there are also courses on algorithms, optimization, linear and non-linear programming, and they bring an arsenal of concepts and techniques to think about, though, even if they might have a wide impact, I can't compare them with the courses mentioned above. A course should (ideally) change the way we think about the world to make a sensible difference! Same goes for programming and theoretical concepts too!...

Long after I graduated, I found many books and authors that I wished I had met earlier! Quotable Math reflects some of the writings I found useful, though now it seems already too late for those books to make a considerable impact! Conversely, it's never too late to find new ways to look at life, and this is what some books achieve! This is also a way of evaluating critically what we want to read or what is worth reading!

Of course, there are many courses, books or ideas out there, though if they haven't changed the way you think about life, directly or indirectly, are they worth attending, respectively reading? Conversely, if one hasn't found a new perspective brought by a topic, probably one barely scratched the surface of the subject, independently if we talk here about students or teachers. For some topics, it's probably too much to ask, though pragmatically talking, that's the intrinsic value of what we learn!

That's a way to think about life and select the books worth reading! I know, many love reading for the sake of reading, though the value of a book, theory, story or other similar artifacts should be judged especially by the impact they have on our way of thinking, respectively on our lives. Just a few ideas that's maybe worth reflective upon...

Previous Post <<||>> Next Post

21 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part I: Creating a View) 🆕

At this year's Ignite conference it was announced that SQL databases are available now in Fabric in public preview (see SQL Databases for OLTP scenarios, [1]). To test the functionality one can import the SalesLT database in a newly created empty database, which made available several tables:

-- tables from SalesLT schema (queries should be run individually)
SELECT TOP 100 * FROM SalesLT.Address
SELECT TOP 100 * FROM SalesLT.Customer
SELECT TOP 100 * FROM SalesLT.CustomerAddress
SELECT TOP 100 * FROM SalesLT.Product ITM 
SELECT TOP 100 * FROM SalesLT.ProductCategory
SELECT TOP 100 * FROM SalesLT.ProductDescription 
SELECT TOP 100 * FROM SalesLT.ProductModel  
SELECT TOP 100 * FROM SalesLT.ProductModelProductDescription 
SELECT TOP 100 * FROM SalesLT.SalesOrderDetail
SELECT TOP 100 * FROM SalesLT.SalesOrderHeader

The schema seems to be slightly different than the schemas used in previous tests made in SQL Server, though with a few minor changes - mainly removing the fields not available - one can create the below view:

-- drop the view (cleaning step)

-- DROP VIEW IF EXISTS SalesLT.vProducts

-- create the view
CREATE OR ALTER VIEW SalesLT.vProducts
-- Products (view) 
AS 
SELECT ITM.ProductID 
, ITM.ProductCategoryID 
, PPS.ParentProductCategoryID 
, ITM.ProductModelID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, PPM.Name ProductModel 
, PPS.Name ProductSubcategory 
, PPC.Name ProductCategory  
, ITM.Color 
, ITM.StandardCost 
, ITM.ListPrice 
, ITM.Size 
, ITM.Weight 
, ITM.SellStartDate 
, ITM.SellEndDate 
, ITM.DiscontinuedDate 
, ITM.ModifiedDate 
FROM SalesLT.Product ITM 
     JOIN SalesLT.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
     JOIN SalesLT.ProductCategory PPS 
        ON ITM.ProductCategoryID = PPS.ProductCategoryID 
         JOIN SalesLT.ProductCategory PPC 
            ON PPS.ParentProductCategoryID = PPC.ProductCategoryID

-- review the data
SELECT top 100 *
FROM SalesLT.vProducts

In the view were used FULL JOINs presuming thus that a value was provided for each record. It's always a good idea to test the presumptions when creating the queries, and eventually check from time to time whether something changed. In some cases it's a good idea to always use LEFT JOINs, though this might have impact on performance and probably other consequences as well.

-- check if all models are available
SELECT top 100 ITM.*
FROM SalesLT.Product ITM 
    LEFT JOIN SalesLT.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
WHERE PPM.ProductModelID IS NULL

-- check if all models are available
SELECT top 100 ITM.*
FROM SalesLT.Product ITM 
    LEFT JOIN SalesLT.ProductCategory PPS 
        ON ITM.ProductCategoryID = PPS.ProductCategoryID 
WHERE PPS.ProductCategoryID IS NULL

-- check if all categories are available
SELECT PPS.*
FROM SalesLT.ProductCategory PPS 
     LEFT JOIN SalesLT.ProductCategory PPC 
       ON PPS.ParentProductCategoryID = PPC.ProductCategoryID
WHERE PPC.ProductCategoryID IS NULL

Because the Product categories have an hierarchical structure, it's a good idea to check the hierarchy as well:

-- check the hierarchical structure 
SELECT PPS.ProductCategoryId 
, PPS.ParentProductCategoryId 
, PPS.Name ProductCategory
, PPC.Name ParentProductCategory
FROM SalesLT.ProductCategory PPS 
     LEFT JOIN SalesLT.ProductCategory PPC 
       ON PPS.ParentProductCategoryID = PPC.ProductCategoryID
--WHERE PPC.ProductCategoryID IS NULL
ORDER BY IsNull(PPC.Name, PPS.Name)

This last query can be consolidated in its own view and the previous view changed, if needed.

One can then save all the code as a file.

Except some small glitches in the editor, everything went smoothly.

Notes:
1) One can suppose that many or most of the queries created in the previous versions of SQL Server work also in SQL databases. The future and revised posts on such topics are labelled under sql database.
2) During the various tests I got the following error message when trying to create a table:

"The external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Create' was denied on the requested resource."

At least in my case all I had to do was to select "SQL Database" instead of "SQL analytics endpoint" in the web editor. Check the top right dropdown below your user information.

[3] For a full least of the available features see [2].

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL database in Microsoft Fabric (Preview) [link]
[2] Microsoft Learn (2024) Features comparison: Azure SQL Database and SQL database in Microsoft Fabric (preview) [link]

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VII: Data Stores Comparison)

Business Intelligence Series

Microsoft made available a reference guide for the data stores supported for Microsoft Fabric workloads [1], including the new Fabric SQL database (see previous post). Here's the consolidated table followed by a few aspects to consider:

Area	Lakehouse	Warehouse	Eventhouse	Fabric SQL database	Power BI Datamart
Data volume	Unlimited	Unlimited	Unlimited	4 TB	Up to 100 GB
Type of data	Unstructured, semi-structured, structured	Structured, semi-structured (JSON)	Unstructured, semi-structured, structured	Structured, semi-structured, unstructured	Structured
Primary developer persona	Data engineer, data scientist	Data warehouse developer, data architect, data engineer, database developer	App developer, data scientist, data engineer	AI developer, App developer, database developer, DB admin	Data scientist, data analyst
Primary dev skill	Spark (Scala, PySpark, Spark SQL, R)	SQL	No code, KQL, SQL	SQL	No code, SQL
Data organized by	Folders and files, databases, and tables	Databases, schemas, and tables	Databases, schemas, and tables	Databases, schemas, tables	Database, tables, queries
Read operations	Spark, T-SQL	T-SQL, Spark*	KQL, T-SQL, Spark	T-SQL	Spark, T-SQL
Write operations	Spark (Scala, PySpark, Spark SQL, R)	T-SQL	KQL, Spark, connector ecosystem	T-SQL	Dataflows, T-SQL
Multi-table transactions	No	Yes	Yes, for multi-table ingestion	Yes, full ACID compliance	No
Primary development interface	Spark notebooks, Spark job definitions	SQL scripts	KQL Queryset, KQL Database	SQL scripts	Power BI
Security	RLS, CLS**, table level (T-SQL), none for Spark	Object level, RLS, CLS, DDL/DML, dynamic data masking	RLS	Object level, RLS, CLS, DDL/DML, dynamic data masking	Built-in RLS editor
Access data via shortcuts	Yes	Yes	Yes	Yes	No
Can be a source for shortcuts	Yes (files and tables)	Yes (tables)	Yes	Yes (tables)	No
Query across items	Yes	Yes	Yes	Yes	No
Advanced analytics	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Time Series native elements, full geo-spatial and query capabilities	T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics	Interface for data processing with automated performance tuning
Advanced formatting support	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Full indexing for free text and semi-structured data like JSON	Table support for OLTP, JSON, vector, graph, XML, spatial, key-value	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format
Ingestion latency	Available instantly for querying	Available instantly for querying	Queued ingestion, streaming ingestion has a couple of seconds latency	Available instantly for querying	Available instantly for querying

It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.

However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!

For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag.

The warehouse seems to be more versatile, though careful attention needs to be given to its design.

The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas.

Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios. Starting with 28th of March 2025, SQL databases will be ON by default and tenant admins must manually turn them OFF before the respective date [3].

Power BI datamarts have been in preview for a couple of years.

Previous Post <<||>> Next Post

References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]

[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link]

[3] Microsoft Fabric Update Blog (2025) Extending flexibility: default checkbox changes on tenant settings for SQL database in Fabric [link]

[4] Microsoft Fabric Update Blog (2025) Enhancing SQL database in Fabric: share your feedback and shape the future [link]

[5] Microsoft Fabric Update Blog (2025) Why SQL database in Fabric is the best choice for low-code/no-code Developers [link]

14 December 2024

🧭💹Business Intelligence: Perspectives (Part 21: Data Visualization Revised)

Data Visualization Series

Creating data visualizations nowadays became so easy that anybody can do it with a minimum of effort and knowledge, which on one side is great for the creators but can be easily become a nightmare for the readers, respectively users. Just dumping data in visuals can be barely called data visualization, even if the result is considered as such. The problems of visualization are multiple – the lack of data culture, the lack of understanding processes, data and their characteristics, the lack of being able to define and model problems, the lack of educating the users, the lack of managing the expectations, etc.

There are many books on data visualization though they seem an expensive commodity for the ones who want rapid enlightenment, and often the illusion of knowing proves maybe to be a barrier. It's also true that many sets of data are so dull, that the lack of information and meaning is compensated by adding elements that give a kitsch look-and-feel (aka chartjunk), shifting the attention from the valuable elements to decorations. So, how do we overcome the various challenges?

Probably, the most important step when visualizing data is to define the primary purpose of the end product. Is it to inform, to summarize or to navigate the data, to provide different perspectives at macro and micro level, to help discovery, to explore, to sharpen the questions, to make people think, respectively understand, to carry a message, to be artistic or represent truthfully the reality, or maybe is just a filler or point of attraction in a textual content?

Clarifying the initial purpose is important because it makes upfront the motives and expectations explicit, allowing to determine the further requirements, characteristics, and set maybe some limits in what concern the time spent and the qualitative and/or qualitative criteria upon which the end result should be eventually evaluated. Narrowing down such aspects helps in planning and the further steps performed.

Many of the steps are repetitive and past experience can help reduce the overall effort. Therefore, professionals in the field, driven by intuition and experience probably don't always need to go through the full extent of the process. Conversely, what is learned and done poorly, has high chances of delivering poor quality.

A visualization can be considered as effective when it serves the intended purpose(s), when it reveals with minimal effort the patterns, issues or facts hidden in the data, when it allows people to explore the data, ask questions and find answers altogether. One can talk also about efficiency, especially when readers can see at a glance the many aspects encoded in the visualization. However, the more the discovery process is dependent on data navigation via filters or other techniques, the more difficult it becomes to talk about efficiency.

Better criteria to judge visualizations is whether they are meaningful and useful for the readers, whether the readers understood the authors' intent, the further intrinsic implication, though multiple characteristics can be associated with these criteria: clarity, specificity, correctedness, truthfulness, appropriateness, simplicity, etc. All these are important in lower or higher degree depending on the broader context of the visualization.

All these must be weighted in the bigger picture when creating visualizations, though there are probably also exceptions, especially on the artistic side, where artists can cut corners for creating an artistic effect, though also in here the authors need to be truthful to the data and make sure that their work don't distort excessively the facts. Failing to do so might not have an important impact on the short term considerably, though in time the effects can ripple with unexpected effects.

Previous Post <<||>> Next Post

13 December 2024

🧭💹Business Intelligence: Perspectives (Part 20: From BI to AI)

Business Intelligence Series

No matter how good data visualizations, reports or other forms of BI artifacts are, they only serve a set of purposes for a limited amount of time, limited audience or any other factors that influence their lifespan. Sooner or later the artifacts become thus obsolete, being eventually disabled, archived and/or removed from the infrastructure.

Many artifacts require a considerable number of resources for their creation and maintenance over time. Sometimes the costs can be considerably higher than the benefits brought, especially when the data or the infrastructure are used for a narrow scope, though there can be other components that need to be considered in the bigger picture. Having a report or visualization one can use when needed can have an important impact on the business in correcting issues, sizing opportunities or filling the knowledge gaps.

Even if it’s challenging to quantify the costs associated with the loss of opportunities rooted in the lack of data, respectively information, the amounts can be considerable high, greater even than building a whole BI infrastructure. Organization’s agility in addressing the important gaps can make a considerable difference, at least in theory. Having the resources that can be pulled on demand can give organizations the needed competitive boost. Internal or external resources can be used altogether, though, pragmatically speaking, there will be always a gap between demand and supply of knowledgeable resources.

The gap in BI artefacts can be addressed nowadays by AI-driven tools, which have the theoretical potential of shortening the gap between needs and the availability of solutions, respectively a set of answers that can be used in the process. Of course, the processes of sense-making and discovery are not that simple as we’d like, though it’s a considerable step forward.

Having the possibility of asking questions in natural language and guiding the exploration process to create visualizations and other artifacts using prompt engineering and other AI-enabled methods offers new possibilities and opportunities that at least some organizations started exploring already. This however presumes the existence of an infrastructure on which the needed foundation can be built upon, the knowledge required to bridge the gap, respectively the resources required in the process.

It must be stressed out that the exploration processes may bring no sensible benefits, at least no immediately, and the whole process depends on organizations’ capabilities of identifying and sizing the respective opportunities. Therefore, even if there are recipes for success, each organization must identify what matters and how to use technologies and the available infrastructure to bridge the gap.

Ideally to make progress organizations need besides the financial resources the required skillset, a set of projects that support learning and value creation, respectively the design and execution of a business strategy that addresses the steps ahead. Each of these aspects implies risks and opportunities altogether. It will be a test of maturity for many organizations. It will be interesting to see how many organizations can handle the challenge, respectively how much past successes or failures will weigh in the balance.

AI offers a set of capabilities and opportunities, however the chance of exploring and failing fast is of great importance. AI is an enabler and not a magic wand, no matter what is preached in technical workshops! Even if progress follows an exponential trajectory, it took us more than half of century from the first steps until now and probably many challenges must be still overcome.

The future looks interesting enough to be pursued, though are organizations capable to size the opportunities, respectively to overcome the challenges ahead? Are organizations capable of supporting the effort without neglecting the other priorities?

Previous Post <<||>> Next Post

12 December 2024

🧭💹Business Intelligence: Perspectives (Part 19: Data Visualization between Art, Pragmatism and Kitsch)

Business Intelligence Series

The data visualizations (aka dataviz) presented in the media, especially the ones coming from graphical artists, have the power to help us develop what is called graphical intelligence, graphical culture, graphical sense, etc., though without a tutor-like experience the process is suboptimal because it depends on our ability of identifying what is important and which are the steps needed for decoding and interpreting such work, respectively for integrating their messages in our overall understanding about the world.

When such skillset is lacking, without explicit annotations or other form of support, the reader might misinterpret or fail to observe important visual cues even for simple visualizations, with all the implications deriving from this – a false understanding, and further aspects deriving from it, this being probably the most important aspect to consider. Unfortunately, even the most elaborate work can fail if the reader doesn’t have a basic understanding of all that’s implied in the process.

The books of Willard Brinton, Ana Rogers, Jacques Bertin, William Cleveland, Leland Wilkinson, Stephen Few, Albert Cairo, Soctt Berinato and many others can help the readers build a general understanding of the dataviz process and how data visualizations or simple graphics can be used/misused effectively, though each reader must follow his/her own journey. It’s also true that the basics can be easily learned, though the deeper one dives, the more interesting and nontrivial the journey becomes. Fortunately, the average reader can stick to the basics and many visualizations are simple enough to be understood.

To grasp the full extent of the implications, one can make comparisons with the domain of poetry where the author uses basic constructs like metaphor, comparisons, rhythm and epithets to create, communicate and imprint in reader’s mind old and new meanings, images and feelings altogether. Artistic data visualizations tend to offer similar charge as poetry does, even if the impact might not appeal so much to our artistic sensibility. Though dataviz from this perspective is or at least resembles an art form.

Many people can write verses, though only a fraction can write good meaningful poetry, from which a smaller fraction get poems, respectively even fewer get books published. Conversely, not everything can be expressed in verses unless one finds good metaphors and other aspects that can be leveraged in the process. Same can be said about good dataviz.

One can argue that in dataviz the author can explore and learn especially by failing fast (seeing what works and what doesn’t). One can also innovate, though the creator has probably a limited set of tools and rules for communication. Enabling readers to see the obvious or the hidden in complex visualizations or contexts requires skill and some kind of mastery of the visual form.

Therefore, dataviz must be more pragmatic and show the facts. In art one has the freedom to distort or move things around to create new meanings, while in dataviz it’s important for the meaning to be rooted in 'truth', at least by definition. The more the creator of a dataviz innovates, the higher the chances of being misunderstood. Moreover, readers need to be educated in interpreting the new meanings and get used to their continuous use.

Kitsch is a term applied to art and design that is perceived as naïve imitation to the degree that it becomes a waste of resources even if somebody pays the tag price. There’s a trend in dataviz to add elements to visualizations that don’t bring any intrinsic value – images, colors and other elements can be misused to the degree that the result resembles kitsch, and the overall value of the visualization is diminished considerably.

Previous Post <<||>> Next Post

09 December 2024

🏭🗒️Microsoft Fabric: Microsoft Fabric [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 8-Dec-2024

Microsoft Fabric

{goal}complete (end-to-end) analytics platform [6]

{characteristic} unified

{objective} provides a single, integrated environment for all the organization

{benefit} data professionals and the business users can collaborate on data projects [5] and solutions

{characteristic}serverless SaaS model (aka SaaS-ified)

{objective} provisioned automatically with the tenant [6]
{objective} highly scalable [5]
{objective} cost-effectiveness [5]
{objective} accessible

⇐ from anywhere with an internet connection [5]

{objective} continuous updates

⇐ provided by Microsoft

{objective} continuous maintenance

⇐ provided by Microsoft

provides a set of integrated services that enable to ingest, store, process, and analyze data in a single environment [5]

{objective} secure
{objective} governed

{goal} lake-centric

{characteristic} OneLake-based

all workloads automatically store their data in the OneLake workspace folders [6]
all the data is organized in an intuitive hierarchical namespace [6]
data is automatically indexed [6]
provides a set of features

discovery
MIP labels
lineage
PII scans
sharing
governance
compliance

{characteristic} one copy

available for all computes
all compute engines store their data automatically in OneLake

⇐ the data is stored in a (single) common format

⇐ delta parquet file format

open standards format
the storage format for all tabular data in Microsoft Fabric

⇐ the data is directly accessible by all the engines [6]

⇐ no import/export needed

all compute engines are fully optimized to work with Delta Parquet as their native format [6]
a shared universal security model is enforced across all the engines [6]

{characteristic} open at every tier

{goal} empowering

{characteristic} intuitive
{characteristic} built into M365
{characteristic} insight to action

{goal} AI-powered

{characteristic} Copilot accelerated
{characteristic} ChatGPT enabled
{characteristic} AI-driven insights

complete analytics platform

addresses the needs of all data professionals and business users who target harnessing the value of data

{feature} scales automatically

the system automatically allocates an appropriate number of compute resources based on the job size
the cost is proportional to total resource consumption, rather than size of cluster or number of resources allocated
jobs in general complete faster (and usually, at less overall cost)

⇒ not need to specify cluster sizes

natively supports

Spark
data science
log-analytics
real-time ingestion and messaging
alerting
data pipelines
Power BI reporting
interoperability with third-party services

from other vendors that support the same open

data virtualization mechanisms

{feature} mirroring [notes]
{feature} shortcuts [notes]

allow users to reference data without copying it
{benefit} make other domain data available locally without the need for copying data

{feature} tenant (aka Microsoft Fabric tenant, MF tenant)

a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
can contain any number of workspaces

{feature} workspaces

{definition} a collection of items that brings together different functionality in a single environment designed for collaboration
associated with a domain [3]

{feature} domains [notes]

{definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
subdomains

a way for fine tuning the logical grouping data under a domain [1]

subdivisions of a domain

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric [link]
[2] Microsoft Learn: Fabric (2024) Governance overview and guidance [link]
[3] Microsoft Learn: Fabric (2023) Fabric domains [link]
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam [link]
[5] Microsoft Learn: Fabric (2024) Introduction to end-to-end analytics using Microsoft Fabric [link]
[6] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
API - Application Programming Interface
M365 - Microsoft 365
MF - Microsoft Fabric
PII - Personal Identification Information
SaaS - software-as-a-service

SQL Troubles

Pages

25 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part II: Under the Hood) 🆕

💎🏭SQL Reloaded: Number of Records IV (via sys.partitions DMV)

22 December 2024

#️⃣Software Engineering: Mea Culpa (Part VI: A Look Back)

21 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part I: Creating a View) 🆕

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VII: Data Stores Comparison)

14 December 2024

🧭💹Business Intelligence: Perspectives (Part 21: Data Visualization Revised)

13 December 2024

🧭💹Business Intelligence: Perspectives (Part 20: From BI to AI)

12 December 2024

🧭💹Business Intelligence: Perspectives (Part 19: Data Visualization between Art, Pragmatism and Kitsch)

09 December 2024

🏭🗒️Microsoft Fabric: Microsoft Fabric [Notes]

About Me