01 January 2025

💎🏭SQL Reloaded: SQL Server Metadata (Part I: Indexes Overview)

There are scenarios in which it's useful to gather information about the available indexes and their definition as a primary step for troubleshooting or index maintenance. Moreover, it's useful to take a baseline of the defined indexes and update it accordingly when indexes change. A minimum of information can be gathered in Excel files or similar repositories, though for the same can be used also metadata tools, at least when they're easy to use and the associated costs aren't neglectable. 

Usually, there are two levels at which the information is needed - at index, respectively at column level. Sometimes it's useful to have an independent query for each level of detail, though in data warehouses and similar use cases it's useful to provide a model on top of the metadata, at least to improve query's maintainability.

The first view encapsulates the logic needed to export the data and it's based on the sys.indexes, sys.objects, sys.schemas and sys.database_principals system objects. 

-- create schema for metadata
CREATE SCHEMA meta;

-- clean after
DROP VIEW IF EXISTS meta.vIndexes

-- create views
CREATE OR ALTER VIEW meta.vIndexes
AS
-- sys metadata - indexes
SELECT OBJ.schema_id
, IND.object_id 
, IND.index_id 
, SCH.name schema_name
, OBJ.name table_name
, IND.name index_name
, IND.type_desc index_type
, IND.is_primary_key
, IND.is_unique_constraint
, IND.fill_factor
, IND.has_filter
, IND.auto_created
, IND.is_unique is_unique_index
, DBP.type principal_type
, DBP.type_desc principal_type_desc
--, INC.*
FROM sys.indexes IND WITH (NOLOCK)
     JOIN sys.objects OBJ WITH (NOLOCK)
       ON IND.object_id = OBJ.object_id
          JOIN sys.schemas SCH WITH (NOLOCK)
            ON OBJ.schema_id = SCH.schema_id
               JOIN sys.database_principals DBP
                 ON SCH.principal_id = DBP.principal_id 
WHERE DBP.type IN ('S', 'E')

Upon case, the needed information can be exported via a query like the one below:

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.table_name IN ('Address', 'Customer')
ORDER BY IND.table_name
, IND.index_name

Output:

db_name schema_name table_name index_name index_type principal_type
AdventureWorks01... SalesLT Address AK_Address_rowguid NONCLUSTERED S
AdventureWorks01... SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion NONCLUSTERED S
AdventureWorks01... SalesLT Address IX_Address_StateProvince NONCLUSTERED S
AdventureWorks01... SalesLT Address PK_Address_AddressID CLUSTERED S
AdventureWorks01... SalesLT Customer AK_Customer_rowguid NONCLUSTERED S
AdventureWorks01... SalesLT Customer IX_Customer_EmailAddress NONCLUSTERED S
AdventureWorks01... SalesLT Customer PK_Customer_CustomerID CLUSTERED S

Similarly, on top of the above view can be built a similar object that provides also the column-related information by adding the sys.index_columns and sys.columns system object to the logic:

-- clean after
DROP VIEW IF EXISTS meta.vIndexColumns

-- create 
CREATE OR ALTER VIEW meta.vIndexColumns
AS
-- sys metadata - index columns
SELECT IND.db_id
, IND.schema_id
, INC.object_id 
, INC.index_id 
, INC.index_column_id
, INC.column_id
, IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, COL.name column_name
, INC.key_ordinal
, INC.partition_ordinal
, IND.index_type
, IND.is_primary_key
, IND.is_unique_constraint
, IND.fill_factor
, IND.has_filter
, IND.auto_created
, IND.is_unique_index
, INC.is_descending_key
, INC.is_included_column
, IND.principal_type
, IND.principal_type_desc
FROM sys.index_columns INC
     JOIN sys.columns COL WITH (NOLOCK)
       ON INC.object_id = COL.object_id 
      AND INC.column_id = COL.column_id 
     JOIN meta.vIndexes IND WITH (NOLOCK)
       ON INC.object_id = IND.object_id
      AND INC.index_id = IND.index_id

And, there's an example of the query based on this view:

-- sys metadata - index columns
SELECt INC.schema_name
, INC.table_name
, INC.index_name
, INC.column_name
, INC.key_ordinal
, INC.index_type
, INC.principal_type
FROM meta.vIndexColumns INC
WHERE INC.schema_name = 'SalesLT'
  AND INC.table_name IN ('Address', 'Customer')
ORDER BY INC.table_name
, INC.index_name
, INC.key_ordinal

Output:
schema_name table_name index_name column_name key_ordinal index_type principal_type
SalesLT Address AK_Address_rowguid rowguid 1 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine1 1 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine2 2 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion City 3 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion StateProvince 4 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion PostalCode 5 NONCLUSTERED S
SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion CountryRegion 6 NONCLUSTERED S
SalesLT Address IX_Address_StateProvince StateProvince 1 NONCLUSTERED S
SalesLT Address PK_Address_AddressID AddressID 1 CLUSTERED S
SalesLT Customer AK_Customer_rowguid rowguid 1 NONCLUSTERED S
SalesLT Customer IX_Customer_EmailAddress EmailAddress 1 NONCLUSTERED S
SalesLT Customer PK_Customer_CustomerID CustomerID 1 CLUSTERED S

Notes:
1) As a DBA it's useful to take a baseline of the indexes defined and reevaluate their usefulness over time. These are one of the checks that should be done when one becomes responsible for the administration of a database server, independently of the vendor.
2) The definitions of the views can be extended as needed, though one should try to keep the overall complexity to a minimum. 
3) There are voices against the use of NOLOCK. Feel free to change the objects accordingly!
4) It's useful to work in a dedicated schema (e.g. meta) and have a different naming convention that deviates slightly from one defined by Microsoft. This should make sure that no confusion with the system objects exists.
5) Azure SQL takes over many of the responsibilities for index maintenance. Even if indexes are managed automatically by the system, a baseline is still needed, at least to evaluate functionality's performance, respectively the changes that occurred in the environment. 
6) Except attributes which were added for specific functionality over time, the queries should work at least starting with SQL Server 2005 forward.
7) See also the notes on clustered vs nonclustered indexes.

Happy coding!

Previous Post <<||>> Next Post

25 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part II: Under the Hood) [new feature]

The first tests performed with Microsoft Fabric's SQL databases (see previous post) revealed that with small exceptions many of the objects and administration scripts I built for SQL Server 2005 up to 2022 worked until now with minimal changes, though the tests made focused only on the standard basic functionality - the use of user-defined functions, stored procedures, views, respectively DMVs. As usual, the devil probably hides in details.

First of all, one can easily check current database's configuration settings via the sys.configurations:

-- database configuration values
SELECT cnf.configuration_id
, cnf.name
, cnf.description
, cnf.value 
, cnf.minimum 
, cnf.maximum 
, cnf.value_in_use 
, cnf.is_dynamic 
, cnf.is_advanced 
FROM sys.configurations cnf
WHERE value <> 0 
--WHERE name LIKE 'data%'
ORDER BY cnf.name 

One can look also at the database-scoped configuration via the sys.database_scoped_configurations, especially at the records whose value deviates from the default:

-- SQL database - database scoped configuration
SELECT name
, value
, value_for_secondary
, is_value_default 
FROM sys.database_scoped_configurations
WHERE is_value_default <>1 --search for deviations from default
--WHERE name=N'MAXDOP' -- search for individual values
ORDER BY name
Output:
name value value_for_secondary is_value_default
MAXDOP 8 False

Querying the sys.databases revealed that there seems to be only one system database available, namely the master database used to retrieve the definition of the system objects. In what concerns the various properties, it's probably important to check the compatibility level, the collation name, and other settings that apply:

-- database information
SELECT db.database_id
, db.[name] AS database_name
, db.recovery_model_desc 
, db.log_reuse_wait_desc 
, db.[compatibility_level] 
, db.page_verify_option_desc 
, db.collation_name
, db.user_access_desc
, db.is_change_feed_enabled
, db.is_data_lake_replication_enabled
FROM sys.databases AS db
Output:
database_id database_name recovery_model_desc log_reuse_wait_desc compatibility_level page_verify_option_desc collation_name user_access_desc is_change_feed_enabled is_data_lake_replication_enabled
1 master FULL NOTHING 160 CHECKSUM SQL_Latin1_General_CP1_CI_AS MULTI_USER False False
26 AdventureWorks01-... FULL NOTHING 160 CHECKSUM SQL_Latin1_General_CP1_CI_AS MULTI_USER True True

Unfortunately, it's not possible to change a database's collation, though other collations on individual columns are supported [2]. Compared to the standard SQL Server, it's not possible to change the compatibility level to a previous value. It will be interesting to see whether an SQL database will use always the latest version of SQL Server or whether future compatibility levels are supported in parallel.

There is no tempdb listed, though querying directly the objects from tempdb by using 3-part references allows to retrieve their content. For example, the following query retrieves the various database files available currently:

-- SQL database: tempdb files 
SELECT dbf.file_id
, dbf.name file_name
--, dbf.physical_name
, dsp.name file_group
--, type 
, dbf.type_desc file_type
--, dbf.growth growth_kb
, Cast(dbf.growth/128.0  as decimal(18,2)) growth_mb
--, dbf.is_percent_growth
--, dbf.max_size max_size_kb
, Cast(NullIf(dbf.max_size, -1)/128.0  as decimal(18,2)) max_size_mb
--, dbf.size file_size_kb
, Cast(dbf.size/128.0 as decimal(18,2)) file_size_mb
, dbf.state_desc 
, dbf.is_read_only 
FROM tempdb.sys.database_files dbf
     LEFT JOIN tempdb.sys.data_spaces dsp
       ON dbf.data_space_id = dsp.data_space_id
ORDER BY dbf.Name
Output:
file_id file_name file_group file_type growth_mb max_size_mb file_size_mb state_desc is_read_only
1 tempdev PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
11 tempdev10 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
12 tempdev11 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
13 tempdev12 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
14 tempdev13 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
15 tempdev14 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
16 tempdev15 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
17 tempdev16 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
3 tempdev2 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
4 tempdev3 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
5 tempdev4 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
6 tempdev5 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
7 tempdev6 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
8 tempdev7 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
9 tempdev8 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
10 tempdev9 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
2 templog LOG 64.00 216256.00 16.00 ONLINE False

By removing the reference to tempdb from the query, one can get the same information for the current database. For the SalesLT were created only two database and one log file. 

There seems to be no msdb database, used primarily by the SQL Server Agent for scheduling alerts and jobs. This implies that the functionality based on it won't work and one needs to find alternatives!

Looking at the feature comparison from [2], the most important are the lack of support for always encrypted, auditing, bulk inserts, change data capture (CDC), elastic queries, full-text search, in-memory optimization, ledger, OPENROWSET. In addition it supports cross-database three-part name queries via the SQL analytics endpoint [2].

Until now, the most annoying fact is that in the web UI results are returned in different "pages", and thus makes it a bit more challenging to navigate the output. However, using a cursor to iterate through the results and saving the content to a table solves the problem (see link). The last query from the post was used to retrieve the system objects together with the number of records returned. 

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Frequently asked questions for SQL database in Microsoft Fabric (preview) [link]
[2] Microsoft Learn (2024) Features comparison: Azure SQL Database and SQL database in Microsoft Fabric (preview) [link]

💎🏭SQL Reloaded: Number of Records IV (via sys.partitions)

To get the exact number of records in a table one can use the COUNT (see post) or the more recent COUNT_BIG function, though for big tables this can be an inefficient operation for the database engine:

-- number of records via COUNT
SELECT count(*) row_count
FROM SalesLT.Product

Moreover, sometimes the operation needs to be repeated for a number of tables, e.g. dropdown tables in Dynamics 365 for Finance and Operations (D365 F&O). Writing the query as a UNION allows to export the data as a single table and do comparisons (e.g. in Excel). The same approach can be used also when multiple columns are used for grouping, though one must account for the additional columns in the other subqueries. However, the more tables are involved, the more difficult it becomes to maintain the query over time. 
 
-- number of records via COUNT for multiple tables
SELECT 'SalesLT.Product' table_name
, count(*) row_count
FROM SalesLT.Product
UNION ALL
SELECT 'SalesLT.ProductDescription' table_name
, count(*) row_count
FROM SalesLT.ProductDescription
UNION ALL
SELECT 'SalesLT.ProductModel' table_name
, count(*) row_count
FROM SalesLT.ProductModel

There are many scenarios in which it's needed to get an approximate of the number of records available in a table and doing a record count might prove to be too expensive. For a quick and dirty solution one can use the sys.partitions DMV  instead:

-- number of records via DMV for single object
SELECT object_id
, OBJECT_NAME(object_id) object_name
, OBJECT_SCHEMA_NAME(object_id) schema_name
, SUM(Rows) AS row_count
, data_compression_desc AS compression_type
, COUNT(*) partitions_count
FROM sys.partitions 
WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
  AND OBJECT_ID('SalesLT.Product') = object_id
GROUP BY object_id
, data_compression_desc
ORDER BY row_count DESC;

The query is based on sys.partitions table [1] which contains a row for each partition of all the tables and most types of indexes in the database. The documentation mentions that "rows" indicates the approximate number of rows in the considered partition.

Alternatively, one can bring more tables into the query to extend its range of applicability. 

-- number of records via DMVs
SELECT S.name + '.' + T.name SearchName
, S.Name SchemaName
, T.name TableName
, P.row_count
, P.compression_type
, P.partitions_count
FROM sys.tables T
     LEFT JOIN (
        SELECT object_id
        , SUM(Rows) AS row_count
        , data_compression_desc AS compression_type
        , COUNT(*) partitions_count
        FROM sys.partitions 
        WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
        --AND OBJECT_ID('SalesLT.Product') = object_id
        GROUP BY object_id
        , data_compression_desc
     ) P
    ON T.object_id = P.object_id
     JOIN sys.schemas as S
	   on S.schema_id = T.schema_id
WHERE S.Name = 'SalesLT'
  AND T.Name LIKE 'Product%'
ORDER BY row_count DESC;

The data can be exported regularly to give an idea how tables' cardinality changes over time. One can find this useful as part of the loading process in data warehouses or other solutions (e.g. data migrations). 

By using a FULL JOIN instead of a LEFT JOIN one can retrieve only the tables that have records. 

One should consider only the tables in scope, and eventually remove the records associated with the system objects (e.g. sys or information_schema upon case).

 -- constraints to be added in the WHERE clause to remove the records related to system objects
 AND OBJECT_NAME(object_id) NOT LIKE 'sys%'
 AND OBJECT_NAME(object_id) NOT LIKE 'queue_%' 
 AND OBJECT_NAME(object_id) NOT LIKE 'filestream_tombstone%' 

There are also scenarios in which the count is needed only for a subset of the data. It's the case of D365 F&O (in which the number of records is needed by DataAreaId (aka company) or another field. A solution can be built using the sp_MSForEachTable stored procedure (see the last query from this post) and a cursor.

Notes:
The code used in this post is available also in the GitHub repository.

Happy coding and Merry Christmas!

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2024) sys.partitions (Transact-SQL) [link]
[2] Microsoft Learn (2024) COUNT_BIG (Transact-SQL) [link]

22 December 2024

#️⃣Software Engineering: Mea Culpa (Part VI: A Look Back)

Software Engineering Series
Software Engineering Series

Looking back at my university years, I'd say that there are three teachers, respectively courses that made a considerable impact on students' lives. In the second year I learned Category Algebra, which despite the fact that it reflected past knowledge and the topics were too complex for most of us, it provided us with a unprecedented layer of abstraction that showed us that Mathematics is not what we thought it to be!

The second course was related to the Complex plane theory, a course in which, the decan of the university at those times, challenged our way of thinking about relatively basic concepts. It was a big gap between what we thought about Mathematics, and what the subject proved to be. The course was thought in a post-university year together with a course on Relativity Theory, in which even we haven't understood much about the concepts and theories, it was the first time (except the Graph theory), we saw applied Mathematics to a broader context. Please don't misunderstand me! There were many other valuable teachers and courses, though these were the three courses that made the most important impact for me!

During those times, we attended also courses on Fortran, Pascal, C++, HTML and even dBase, and, even if each programming language brought something new in the landscape, I can't say they changed how we thought about the world (some of us had similar courses during the lyceum years) and problem solving. That's what for example SQL or more generally a database-related course brought, even if I had to wait for the first MooC courses to appear. Equally important was also Scott E Page's course on Model Theory, which introduced the model-thinking approach, a structured way of thinking about models, with applicability to the theoretical and practical aspects of life.

These are the courses that anybody interested in programming and/or IT should attend! Of course, there are also courses on algorithms, optimization, linear and non-linear programming, and they bring an arsenal of concepts and techniques to think about, though, even if they might have a wide impact, I can't compare them with the courses mentioned above. A course should (ideally) change the way we think about the world to make a sensible difference! Same goes for programming and theoretical concepts too!...

Long after I graduated, I found many books and authors that I wished I had met earlier! Quotable Math reflects some of the writings I found useful, though now it seems already too late for those books to make a considerable impact! Conversely, it's never too late to find new ways to look at life, and this is what some books achieve! This is also a way of evaluating critically what we want to read or what is worth reading!

Of course, there are many courses, books or ideas out there, though if they haven't changed the way you think about life, directly or indirectly, are they worth attending, respectively reading? Conversely, if one hasn't found a new perspective brought by a topic, probably one barely scratched the surface of the subject, independently if we talk here about students or teachers. For some topics, it's probably too much to ask, though pragmatically talking, that's the intrinsic value of what we learn! 

That's a way to think about life and select the books worth reading! I know, many love reading for the sake of reading, though the value of a book, theory, story or other similar artifacts should be judged especially by the impact they have on our way of thinking, respectively on our lives. Just a few ideas that's maybe worth reflective upon... 

21 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part I: Creating a View) [new feature]

At this year's Ignite conference it was announced that SQL databases are available now in Fabric in public preview (see SQL Databases for OLTP scenarios, [1]). To test the functionality one can import the SalesLT database in a newly created empty database, which made available several tables:
 
-- tables from SalesLT schema (queries should be run individually)
SELECT TOP 100 * FROM SalesLT.Address
SELECT TOP 100 * FROM SalesLT.Customer
SELECT TOP 100 * FROM SalesLT.CustomerAddress
SELECT TOP 100 * FROM SalesLT.Product ITM 
SELECT TOP 100 * FROM SalesLT.ProductCategory
SELECT TOP 100 * FROM SalesLT.ProductDescription 
SELECT TOP 100 * FROM SalesLT.ProductModel  
SELECT TOP 100 * FROM SalesLT.ProductModelProductDescription 
SELECT TOP 100 * FROM SalesLT.SalesOrderDetail
SELECT TOP 100 * FROM SalesLT.SalesOrderHeader

The schema seems to be slightly different than the schemas used in previous tests made in SQL Server, though with a few minor changes - mainly removing the fields not available - one can create the below view:
 
-- drop the view (cleaning step)
-- DROP VIEW IF EXISTS SalesLT.vProducts 

-- create the view
CREATE OR ALTER VIEW SalesLT.vProducts
-- Products (view) 
AS 
SELECT ITM.ProductID 
, ITM.ProductCategoryID 
, PPS.ParentProductCategoryID 
, ITM.ProductModelID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, PPM.Name ProductModel 
, PPS.Name ProductSubcategory 
, PPC.Name ProductCategory  
, ITM.Color 
, ITM.StandardCost 
, ITM.ListPrice 
, ITM.Size 
, ITM.Weight 
, ITM.SellStartDate 
, ITM.SellEndDate 
, ITM.DiscontinuedDate 
, ITM.ModifiedDate 
FROM SalesLT.Product ITM 
     JOIN SalesLT.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
     JOIN SalesLT.ProductCategory PPS 
        ON ITM.ProductCategoryID = PPS.ProductCategoryID 
         JOIN SalesLT.ProductCategory PPC 
            ON PPS.ParentProductCategoryID = PPC.ProductCategoryID

-- review the data
SELECT top 100 *
FROM SalesLT.vProducts

In the view were used FULL JOINs presuming thus that a value was provided for each record. It's always a good idea to test the presumptions when creating the queries, and eventually check from time to time whether something changed. In some cases it's a good idea to always use LEFT JOINs, though this might have impact on performance and probably other consequences as well.
 
-- check if all models are available
SELECT top 100 ITM.*
FROM SalesLT.Product ITM 
    LEFT JOIN SalesLT.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
WHERE PPM.ProductModelID IS NULL

-- check if all models are available
SELECT top 100 ITM.*
FROM SalesLT.Product ITM 
    LEFT JOIN SalesLT.ProductCategory PPS 
        ON ITM.ProductCategoryID = PPS.ProductCategoryID 
WHERE PPS.ProductCategoryID IS NULL

-- check if all categories are available
SELECT PPS.*
FROM SalesLT.ProductCategory PPS 
     LEFT JOIN SalesLT.ProductCategory PPC 
       ON PPS.ParentProductCategoryID = PPC.ProductCategoryID
WHERE PPC.ProductCategoryID IS NULL

Because the Product categories have an hierarchical structure, it's a good idea to check the hierarchy as well:
 
-- check the hierarchical structure 
SELECT PPS.ProductCategoryId 
, PPS.ParentProductCategoryId 
, PPS.Name ProductCategory
, PPC.Name ParentProductCategory
FROM SalesLT.ProductCategory PPS 
     LEFT JOIN SalesLT.ProductCategory PPC 
       ON PPS.ParentProductCategoryID = PPC.ProductCategoryID
--WHERE PPC.ProductCategoryID IS NULL
ORDER BY IsNull(PPC.Name, PPS.Name)

This last query can be consolidated in its own view and the previous view changed, if needed.

One can then save all the code as a file. 
Except some small glitches in the editor, everything went smoothly. 

Notes:
1) One can suppose that many or most of the queries created in the previous versions of SQL Server work also in SQL databases. The future and revised posts on such topics are labelled under sql database.
2) During the various tests I got the following error message when trying to create a table:
"The external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Create' was denied on the requested resource."
At least in my case all I had to do was to select "SQL Database" instead of "SQL analytics endpoint" in the web editor. Check the top right dropdown below your user information.
[3] For a full least of the available features see [2].

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL database in Microsoft Fabric (Preview) [link]
[2] Microsoft Learn (2024) Features comparison: Azure SQL Database and SQL database in Microsoft Fabric (preview) [link]

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VI: Data Stores Comparison)

Business Intelligence Series
Business Intelligence Series

Microsoft made available a reference guide for the data stores supported for Microsoft Fabric workloads [1], including the new Fabric SQL database (see previous post). Here's the consolidated table followed by a few aspects to consider: 

Area Lakehouse Warehouse Eventhouse Fabric SQL database Power BI Datamart
Data volume Unlimited Unlimited Unlimited 4 TB Up to 100 GB
Type of data Unstructured, semi-structured, structured Structured, semi-structured (JSON) Unstructured, semi-structured, structured Structured, semi-structured, unstructured Structured
Primary developer persona Data engineer, data scientist Data warehouse developer, data architect, data engineer, database developer App developer, data scientist, data engineer AI developer, App developer, database developer, DB admin Data scientist, data analyst
Primary dev skill Spark (Scala, PySpark, Spark SQL, R) SQL No code, KQL, SQL SQL No code, SQL
Data organized by Folders and files, databases, and tables Databases, schemas, and tables Databases, schemas, and tables Databases, schemas, tables Database, tables, queries
Read operations Spark, T-SQL T-SQL, Spark* KQL, T-SQL, Spark T-SQL Spark, T-SQL
Write operations Spark (Scala, PySpark, Spark SQL, R) T-SQL KQL, Spark, connector ecosystem T-SQL Dataflows, T-SQL
Multi-table transactions No Yes Yes, for multi-table ingestion Yes, full ACID compliance No
Primary development interface Spark notebooks, Spark job definitions SQL scripts KQL Queryset, KQL Database SQL scripts Power BI
Security RLS, CLS**, table level (T-SQL), none for Spark Object level, RLS, CLS, DDL/DML, dynamic data masking RLS Object level, RLS, CLS, DDL/DML, dynamic data masking Built-in RLS editor
Access data via shortcuts Yes Yes Yes Yes No
Can be a source for shortcuts Yes (files and tables) Yes (tables) Yes Yes (tables) No
Query across items Yes Yes Yes Yes No
Advanced analytics Interface for large-scale data processing, built-in data parallelism, and fault tolerance Interface for large-scale data processing, built-in data parallelism, and fault tolerance Time Series native elements, full geo-spatial and query capabilities T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics Interface for data processing with automated performance tuning
Advanced formatting support Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format Full indexing for free text and semi-structured data like JSON Table support for OLTP, JSON, vector, graph, XML, spatial, key-value Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format
Ingestion latency Available instantly for querying Available instantly for querying Queued ingestion, streaming ingestion has a couple of seconds latency Available instantly for querying Available instantly for querying

It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.

However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!

For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag. 

The warehouse seems to be more versatile, though careful attention needs to be given to its design. 

The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas. 

Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios. 

Power BI datamarts have been in preview for a couple of years.

References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]
[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link

14 December 2024

🧭💹Business Intelligence: Perspectives (Part XXI: Data Visualization Revised)

Data Visualization Series
Data Visualization Series

Creating data visualizations nowadays became so easy that anybody can do it with a minimum of effort and knowledge, which on one side is great for the creators but can be easily become a nightmare for the readers, respectively users. Just dumping data in visuals can be barely called data visualization, even if the result is considered as such. The problems of visualization are multiple – the lack of data culture, the lack of understanding processes, data and their characteristics, the lack of being able to define and model problems, the lack of educating the users, the lack of managing the expectations, etc.

There are many books on data visualization though they seem an expensive commodity for the ones who want rapid enlightenment, and often the illusion of knowing proves maybe to be a barrier. It's also true that many sets of data are so dull, that the lack of information and meaning is compensated by adding elements that give a kitsch look-and-feel (aka chartjunk), shifting the attention from the valuable elements to decorations. So, how do we overcome the various challenges? 

Probably, the most important step when visualizing data is to define the primary purpose of the end product. Is it to inform, to summarize or to navigate the data, to provide different perspectives at macro and micro level, to help discovery, to explore, to sharpen the questions, to make people think, respectively understand, to carry a message, to be artistic or represent truthfully the reality, or maybe is just a filler or point of attraction in a textual content?

Clarifying the initial purpose is important because it makes upfront the motives and expectations explicit, allowing to determine the further requirements, characteristics, and set maybe some limits in what concern the time spent and the qualitative and/or qualitative criteria upon which the end result should be eventually evaluated. Narrowing down such aspects helps in planning and the further steps performed. 

Many of the steps are repetitive and past experience can help reduce the overall effort. Therefore, professionals in the field, driven by intuition and experience probably don't always need to go through the full extent of the process. Conversely, what is learned and done poorly, has high chances of delivering poor quality. 

A visualization can be considered as effective when it serves the intended purpose(s), when it reveals with minimal effort the patterns, issues or facts hidden in the data, when it allows people to explore the data, ask questions and find answers altogether. One can talk also about efficiency, especially when readers can see at a glance the many aspects encoded in the visualization. However, the more the discovery process is dependent on data navigation via filters or other techniques, the more difficult it becomes to talk about efficiency.

Better criteria to judge visualizations is whether they are meaningful and useful for the readers, whether the readers understood the authors' intent, the further intrinsic implication, though multiple characteristics can be associated with these criteria: clarity, specificity, correctedness, truthfulness, appropriateness, simplicity, etc. All these are important in lower or higher degree depending on the broader context of the visualization.

All these must be weighted in the bigger picture when creating visualizations, though there are probably also exceptions, especially on the artistic side, where artists can cut corners for creating an artistic effect, though also in here the authors need to be truthful to the data and make sure that their work don't distort excessively the facts. Failing to do so might not have an important impact on the short term considerably, though in time the effects can ripple with unexpected effects.


Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.