Showing posts with label Metadata. Show all posts
Showing posts with label Metadata. Show all posts

29 March 2024

Data Management: Data (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 29-Mar-2024

Data

  • {definition} raw, unrelated numbers or entries that represent facts, concepts,  events,  and/or associations 
  • categorized by
    • domain
      • {type} transactional data
      • {type} master data
      • {type} configuration data
        • {subtype}hierarchical data
        • {subtype} reference data
        • {subtype} setup data
        • {subtype} policy
      • {type} analytical data
        • {subtype} measurements
        • {subtype} metrics
        • {subtype} 
    • structuredness
      • {type} structured data
      • {type} semi-structured data
      • {type} unstructured data
    • statistical usage as variable
      • {type} categorical data (aka qualitative data)
        • {subtype} nominal data
        • {subtype} ordinal data
        • {subtype} binary data
      • {type} numerical data (aka quantitative data)
        • {subtype} discrete data
        • {subtype} continuous data
    • size
      • {type} small data
      • {type} big data
  • {concept} transactional data
    • {definition} data that describe business transactions and/or events
    • supports the daily operations of an organization
    • commonly refers to data created and updated within operational systems
    • support applications that automated key business processes 
    • usually stored in normalized tables
  • {concept} master data
    • {definition}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]
      • the key business entities on which transaction are executed
    • the dimensions around on which analysis is conducted
      • used to categorize, evaluate and aggregate transactional data
    • can be shared across more than one transactional applications
    • there are master data similar to most organizations, but also master data specific to certain industries 
    • often appear in more than one area within the business
    • represent one version of the truth
    • can be further divided into specialized subsets
    • {concept} master data entity
      • core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies 
      • may be classified within a hierarchy
        • the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways
          • e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization
    • do not change as frequent like transactional data
      • less volatile than transactional data 
      • there are master data that don’t change at all 
        • e.g. geographic locations
    • strategic asset of the business 
    • needs to be managed with the same diligence as other strategic assets
  • {concept} metadata 
    • {definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]
      • data about data
    • refers to 
      • database schemas for OLAP & OLTP systems
      • XML document schemas
      • report definitions
      • additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
      • application configuration data
  • {concept} analytical data
    • {definition} data that supports analytical activities 
      • e.g. decision making, reporting queries and analysis 
    • comprises
      • numerical values
      • metrics
      • measurements
    • stored in OLAP repositories 
      • optimized for decision support 
      • enterprise data warehouses
      • departmental data marts
      • within table structures designed to support aggregation, queries and data mining 
  • {concept} hierarchical data 
    • {definition} data that reflects a hierarchy 
      • relationships between data are represented in hierarchies
    • typically appears in analytical applications
    • {concept} hierarchy
      • "a classification structure arranged in levels of detail from the broadest to the most detailed level" [2]
      • {concept} natural hierarchy
        • stem from domain-based attributes
        • represent an intrinsic structure of the dat
          • they are natural for the data
            • e.g. product taxonomy (categories/subcategories)
        • useful for drilling down from a general to a detailed level in order to find reasons, patterns, and problems
          • common way of analyzing data in OLAP applications
          • common way of filtering data in OLTP applications
      • {concept} explicit hierarchy
        • organize data according to business needs
        • entity members can be organized in any way
        • can be ragged
          • the hierarchy can end at different levels
      • {concept} derived hierarchy
        • domain-based attributes form natural hierarchies 
        • relationships between entities must already exist in a model
        • can be recursive
  • {concept} structured data
    • {definition} "data that has a strict metadata defined"
  • {concept} unstructured data 
    • {definition} data that doesn't follow predefined metadata
    • involves all kinds of documents 
    • can appear in a database, in a file, or even in printed material
  • {concept} semi-structured data 
    • {definition} structured data stored within unstructured data,
    • data typically in XML form
      • XML is widely used for data exchange
    • can appear in stand-alone files or as part of a database (as a column in a table)
    • useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema
References:
[1] The Art of Service (2017) Master Data Management Course 
[2] DAMA International (2011) "The DAMA Dictionary of Data Management", 

06 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Data Objects Metadata

There seem to be four main sources for learning about the functionality available in the Delta Lake especially in what concerns the SQL dialect used by the Spark SQL: Databricks [1], Delta Lake [2], Azure Databricks [3], respectively the Data Engineering documentation in Microsoft Fabric [4] and the afferent certification material. Unfortunately, the latter focuses more on PySpark. So, until Microsoft addresses the gap, one can consult the other sources, check what's working and built thus the required knowledge for handling the various tasks. 

First of all, it's important to understand which the data objects available in Microsoft Fabric are. Based on [5] I could identify the following hierarchy:


According to the same source [5] the metastore contains all of the metadata that defines data objects in the lakehouse, while the catalog is the highest abstraction in the lakehouse. The database, called also a schema, keeps its standard definition - a collection of data objects, such as tables or views (aka relations) and functions. The tables, views and functions keep their standard definitions. Except the metastore, these are also the (securable) objects on which permissions can be set. 

One can explore the structure in Spark SQL by using the SHOW command:

-- explore the data objects from the lakehouse
SHOW CATALOGS;

SHOW DATABASES;

SHOW SCHEMAS;

SHOW CATALOGS;

SHOW VIEWS;

SHOW TABLES;

SHOW FUNCTIONS;

Moreover, one can list only the objects from an object from the parent (e.g. the tables existing in a database):
 
-- all tables from a database
SHOW TABLES FROM Testing;

-- all tables from a database matching a pattern
SHOW TABLES FROM Testing LIKE 'Asset*';

-- all tables from a database matching multiple patterns
SHOW TABLES FROM Testing LIKE 'Asset*|cit*';

Notes:
1) Same syntax applies for views and functions, respectively for the other objects in respect to their parents (from the hierarchy). 
2) Instead of FROM one can use the IN keyword, though I'm not sure what's the difference given that they seem to return same results.
3) For databases and tables one can use also the SQL Server to export the related metadata. Unfortunately, it's not the case for views and functions because when the respective objects are created in Spark SQL, they aren't visible over the SQL Endpoint, and vice versa.

4) Given that there are multiple environments that deal with delta tables, to minimize the confusion that might result from their use, it makes sense to use database as term instead of schema. 

References:
[1] Databricks (2024) SQL language reference (link)
[2] Delta Lake (2023) Delta Lake documentation (link)
[3] Microsoft Learn (2023) Azure Databricks documentation (link)
[4] Microsoft Learn (2023) Data Engineering documentation in Microsoft Fabric (link)
[5] Databricks (2023) Data objects in the Databricks lakehouse (link)

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata II (Updating a table's COMMENT attributes)

While using the DESCRIBE TABLE metadata I observed that its output shown also a Comment attribute which was NULL for all the columns. Browsing through the Databricks documentation (see [1]), I found that the Comment can be provided in a delta table's definition as follows (code to be run in a notebook):

-- drop the table (if already exists)
DROP TABLE IF EXISTS Assets2;

--create the table
CREATE TABLE Assets2 (
 Id int NOT NULL COMMENT 'Asset UID',
 CreationDate timestamp NOT NULL COMMENT 'Creation Date',
 Vendor string NOT NULL COMMENT 'Vendors name',
 Asset string NOT NULL COMMENT 'Asset type',
 Model string NOT NULL COMMENT 'Asset model',
 Owner string NOT NULL COMMENT 'Current owner',
 Tag string NOT NULL COMMENT 'Assets tag',
 Quantity decimal(13, 2) NOT NULL COMMENT 'Quantity'
) 
USING DELTA
COMMENT 'Vendor assets';

-- show table's definition
DESCRIBE TABLE Assets2;

-- show table's details
DESCRIBE DETAIL Assets;

So, I thought there must be a way to update Assets table's definition as well. And here's the solution split into two steps, as different syntax is required for modifying the columns, respectively the table:

-- modify columns' COMMENT for an existing table
ALTER TABLE Assets ALTER COLUMN ID COMMENT 'Asset UID';
ALTER TABLE Assets ALTER COLUMN CreationDate COMMENT 'Creation Date';
ALTER TABLE Assets ALTER COLUMN Vendor COMMENT 'Vendors name';
ALTER TABLE Assets ALTER COLUMN Asset COMMENT 'Asset type';
ALTER TABLE Assets ALTER COLUMN Model COMMENT 'Asset model';
ALTER TABLE Assets ALTER COLUMN Owner COMMENT 'Current owner';
ALTER TABLE Assets ALTER COLUMN Tag COMMENT 'Assets tag';
ALTER TABLE Assets ALTER COLUMN Quantity COMMENT 'Quantity';

-- show table's definition
DESCRIBE TABLE Assets;

The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180621522,"operation":"CHANGE COLUMN","operationParameters":{"column":"{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}"},"readVersion":13,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"42590d18-e04a-4fb2-bbfa-94414b23fb07"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

And the second step (see [2]):

-- modify a table's COMMENT
COMMENT ON TABLE Assets IS 'Vendor assets';

-- describe table's details
DESCRIBE DETAIL Assets;

    The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180808138,"operation":"SET TBLPROPERTIES","operationParameters":{"properties":"{\"comment\":\"Vendor assets\"}"},"readVersion":14,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"21c315b5-a81e-4107-8a19-6256baee7bd5"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}}
,{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}}
,{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}}
,{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}}
,{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}}
,{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}}
,{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}}
,{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

Notes:
1) Don't forget to remove the Assets2 table!
2) Updating the COMMENT for a single delta table is simple, though for updating the same for a list of tables might be task for a PySpark job. It makes sense to provide the respective metadata from the start, when that's possible. Anyway, the information should be available in lakehouse's data dictionary.
3) One can reset the COMMENT to NULL or to an empty string.

Happy coding!

Resources:
[1] Databaricks (2023) ALTER TABLE (link)
[2] Databaricks (2023) COMMENT ON (link)

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata I (General Information)

In a previous post I've created a delta table called Assets. For troubleshooting and maintenance tasks it would be useful to retrieve a table's metadata - properties, definition, etc. There are different ways to extract the metadata, depending on the layer and method used.

INFORMATION_SCHEMA in SQL Endpoint

Listing all the delta tables available in a Lakehouse can be done using the INFORMATION_SCHEMA via the SQL Endpoint:

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA  

Output:
TABLE_CATALOG TABLE_SCHEMA TABLE_NAME TABLE_TYPE
Testing dbo city BASE TABLE
Testing dbo assets BASE TABLE

The same schema can be used to list columns' definition:
 
-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'assets'
ORDER BY ORDINAL_POSITION

Note:
Given that the INFORMATION_SCHEMA is an ANSI-standard schema for providing metadata about a database's objects (including views, schemata), probably this is the best way to retrieve general metadata as the above (see [3]). More information over the SQL Endpoint can be obtained by querying directly the SQL Server metadata.

Table's Properties

The Delta Lake documentation reveals that a table's properties can be retrieved via the DESCRIBE command, which can be used in a notebook's cell (see [1] for attributes' definition):

-- describe table's details
DESCRIBE DETAIL Assets;

Output (transposed):
Attribute Value
format delta
id 83e87b3c-28f4-417f-b4f5-842f6ba6f26d
name spark_catalog.testing.assets
description NULL
location abfss://[…]@onelake.dfs.fabric.microsoft.com/[…]/Tables/assets
createdAt 2024-02-05T16:18:54Z
lastModified 2024-02-05T16:29:52Z
partitionColumns
numFiles 1
sizeInBytes 3879
properties [object Object]
minReaderVersion 1
minWriterVersion 2
tableFeatures appendOnly,invariants

One can export the table metadata also from the sys.tables via the SQL Endpoint, though will be considered also SQL Server based metadata:

-- table metadata
SELECT t.object_id
, schema_name(t.schema_id) schema_name
, t.name table_name
, t.type_desc
, t.create_date 
, t.modify_date
, t.durability_desc
, t.temporal_type_desc
, t.data_retention_period_unit_desc
FROM sys.tables t
WHERE name = 'assets'

Output (transposed):
Attribute Value
object_id 1264723558
schema_name dbo
table_name assets
type_desc USER_TABLE
create_date 2024-02-05 16:19:04.810
modify_date 2024-02-05 16:19:04.810
durability_desc SCHEMA_AND_DATA
temporal_type_desc NON_TEMPORAL_TABLE
data_retention_period_unit_desc INFINITE

Note:
1) There seem to be thus two different repositories for storing the metadata, thing reflected also in the different timestamps.

Table's Definition 

A table's definition can be easily exported via the SQL Endpoint in SQL Server Management Studio. Multiple tables' definition can be exported as well via the Object explorer details within the same IDE. 

In Spark SQL you can use the DESCRIBE TABLE command:

-- show table's definition
DESCRIBE TABLE Assets;

Output:
col_name data_type comment
Id int NULL
CreationDate timestamp NULL
Vendor string NULL
Asset string NULL
Model string NULL
Owner string NULL
Tag string NULL
Quantity decimal(13,2) NULL

Alternatively, you can use the EXTENDED keyword with the previous command to show further table information:
 
-- show table's definition (extended)
DESCRIBE TABLE EXTENDED Assets;

Output (only the records that appear under '# Detailed Table Information'):
Label Value
Name spark_catalog.testing.assets
Type MANAGED
Location abfss://[...]@onelake.dfs.fabric.microsoft.com/[...]/Tables/assets
Provider delta
Owner trusted-service-user
Table Properties [delta.minReaderVersion=1,delta.minWriterVersion=2]

Note that the table properties can be listed individually via the SHOW TBLPROPERTIES command:
 
--show table properties
SHOW TBLPROPERTIES Assets;

A column's definition can be retrieved via a DESCRIBE command following the syntax:

-- retrieve column metadata
DESCRIBE Assets Assets.CreationDate;

Output:
info_name info_value
col_name CreationDate
data_type timestamp
comment NULL

In PySpark it's trickier to retrieve a table's definition (code adapted after Dennes Torres' presentation at Toboggan Winter edition 2024):

%%pyspark
import pyarrow.dataset as pq
import os
import re

def show_metadata(file_path, table_name):
    #returns a delta table's metadata 

    print(f"\{table_name}:")
    schema_properties = pq.dataset(file_path).schema.metadata
    if schema_properties:
        for key, value in schema_properties.items():
            print(f"{key.decode('utf-8')}:{value.decode('utf-8')}")
    else:
        print("No properties available!")

#main code
path = "/lakehouse/default/Tables"
tables = [f for f in os.listdir(path) if re.match('asset',f)]

for table in tables:
    show_metadata(f"/{path}/"+ table, table)

Output:
\assets:
org.apache.spark.version:3.4.1
org.apache.spark.sql.parquet.row.metadata:{"type":"struct","fields":[
{"name":"Id","type":"integer","nullable":true,"metadata":{}},
{"name":"CreationDate","type":"timestamp","nullable":true,"metadata":{}},
{"name":"Vendor","type":"string","nullable":true,"metadata":{}},
{"name":"Asset","type":"string","nullable":true,"metadata":{}},
{"name":"Model","type":"string","nullable":true,"metadata":{}},
{"name":"Owner","type":"string","nullable":true,"metadata":{}},
{"name":"Tag","type":"string","nullable":true,"metadata":{}},
{"name":"Quantity","type":"decimal(13,2)","nullable":true,"metadata":{}}]}
com.microsoft.parquet.vorder.enabled:true
com.microsoft.parquet.vorder.level:9

Note:
One can list the metadata for all tables by removing the filter on the 'asset' table:

tables = os.listdir(path)

Happy coding!

References:
[1] Delta Lake (2023) Table utility commands (link)
[2] Databricks (2023) DESCRIBE TABLE (link)
[3] Microsoft Learn (2023) System Information Schema Views (link)

04 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations I (Basic Operations)

When building a data-related solution, the most basic functionality for a SQL developer is the CRUD (create, read, update, delete) operations. That's why I decided to test the feature on Lakehouse delta tables, for a first demonstrative post on Microsoft Fabric. 

Unfortunately, at least for the moment, the SQL Endpoint allows only read access to the delta tables, which frankly for an SQL developer is an important limitation. There's however the possibility of running SQL over Spark SQL, which supports a basic dialect of SQL (similar to HiveSQL, though it might lack some features). To run the code, you'll need to create a notebook and you can set the language to Spark SQL for the whole notebook. Make sure, that you have a Lakehouse assigned in the notebook. Of course, you'll need access to Microsoft Fabric.

Create a cell for each of the following blocks of code. To make sure that the code works, it might be a good idea to run the cells individually!

-- drop the test table (if exists already)
DROP TABLE IF EXISTS Assets;

--create the test table
CREATE TABLE Assets(
 Id int NOT NULL,
 CreationDate timestamp NOT NULL,
 Vendor string NOT NULL,
 Asset string NOT NULL,
 Model string NOT NULL,
 Owner string NOT NULL,
 Tag string NOT NULL,
 Quantity decimal(13, 2) NOT NULL
) 
USING DELTA;

-- insert test data
INSERT INTO Assets
VALUES (1, '2024-03-01T00:00:00Z', 'IBM', 'Laptop 1','Model 1','Owner 1','XX0001','1');

-- insert more test records 
INSERT INTO Assets
VALUES ('2', '2024-02-01T00:00:00Z','IBM','Laptop 2','Model 2','Owner 2','XX0001','1')
, ('3', '2024-02-01T00:00:00Z','Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', '2024-01-01T00:00:00Z','Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', '2024-01-15T00:00:00Z','Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', '2024-01-16T00:00:00Z','Dell','Laptop 4','Model 4','Owner 4','DD0001','1');

-- retrieving all the records
SELECT *
FROM Assets
ORDER BY Vendor;

-- updating a record
UPDATE Assets 
SET CreationDate = '2024-01-19T00:00:00Z'
WHERE Id = 6;

-- deleting a record
DELETE FROM Assets
WHERE Id = 2;

-- reviewing the changes
SELECT *
FROM Assets
ORDER BY Vendor;

Comments:
[1] There seems to be no way to provide a schema (see the SQL syntax for Spark), however the dbo schema seems to be used in the background (see table metadata post). Calling a table with the schema in SQL Spark results in an error:
"[TABLE_OR_VIEW_NOT_FOUND] The table or view `<schema>`.`<object_name>` cannot be found. Verify the spelling and correctness of the schema and catalog."

[2] The data types are different from the ones in SQL Server (e.g. timestamp for dates, string instead of varchars). Dates and decimals support precision, while strings translate to a varchar with 8000 characters (see table metadata post).

[3] Implicit conversions seem to occur when the format is correct. Otherwise, an explicit conversion is needed.

[4] Consider using semicolons at the end of each statement. That's mandatory when running multiple statements within the same cell. Otherwise, an error results.

[5] It would be interesting to test what's the performance when doing CRUD operations on large datasets. 

[6] One can use the LIMIT x clause as alternative for TOP x from T-SQL:

-- reviewing the last 3 records
SELECT *
FROM Assets
ORDER BY CreationDate
LIMIT 3;

[7] SELECT FROM VALUES seems to work as well, though the data types must be the same (no implicit conversions occur):

SELECT *
FROM (
VALUES ('1', '2024-03-01T00:00:00Z', 'IBM', 'Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', '2024-02-01T00:00:00Z','IBM','Laptop 2','Model 2','Owner 2','XX0001','1')
) DAT(Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity);

Therefore, when bringing the two INSERTs together, you'll need to change the first value from the first INSERT for the ID from numeric to string.

You can use the DESCRIBE QUERY command to troubleshoot the differences:

 -- retrieve query's output metadata
 DESCRIBE QUERY 
 VALUES (1, '2024-03-01T00:00:00Z', 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1') 
 AS DAT(Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity)

Output:
col_name data_type comment
Id int NULL
CreationDate string NULL
Vendor string NULL
Asset string NULL
Model string NULL
Owner string NULL
Tag string NULL
Quantity string NULL

Happy coding!

Resources:
[1] Microsoft Learn (2023) How to use Microsoft Fabric notebooks (link)
[2] Apache Spark (2023) SQL syntax (link)

10 October 2023

Base Enums Metadata in Dynamics 365 for Finance and Operations

The list of values that don't have their own tables are managed within the application as (Base) Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values.

Fortunately, there's a way to go around this limitation. First, there's the old obsolete AX 2012 documentation for Base Enums, which documents the Names and Descriptions for the Enum Values (e.g., search for CustVendNegInstStatus). As Microsoft stopped maintaining the documentation, the changes made in D365 are not reflected.

Secondly, the mapping between Values and Names is stored in two of the tables available in D365, the following query allowing to retrieve the mapping:

-- Metadata - Base Enum Values 
SELECT EIT.Id 
, EIT.Name EnumName
, EVT.EnumValue EnumValueId
, EVT.Name EnumValueName 
, ' WHEN ' + Cast(EVT.EnumValue as nvarchar(10)) + ' THEN ''' + EVT.Name + '''' [CASE Statement]
FROM dbo.EnumValueTable EVT
     JOIN dbo.EnumIdTable EIT
       ON EVT.EnumId = EIT.Id
WHERE EIT.Name LIKE 'CustVendNegInstStatus%'
--WHERE EVT.Name LIKE '%Redrawn%'
ORDER BY EIT.Name 
, EVT.EnumValue 

The [Case Statement] column generates already the statement for the CASE. One can search for the Base Enum, respectively for the value itself. 

Unfortunately, the metadata are not complete and, as there's no timestamp, it's not possible to identify what and when changed. Therefore, one needs to regularly check the changes manually, or store a copy of the respective data as baseline, thus being able to compare the latest metadata against the baseline. The solution is not ideal, but it does the trick.

To identify the Base Enum corresponding to a field one can export the script behind a form when the functionality is available. Otherwise, one can use the Table Browser and generate the script from the corresponding source table. 

Occasionally, the value stored in the above tables may deviate slightly from the expectation. It's useful then to search only for a substring, use both sources and even search on the internet for more information.

When building reports or a semantic layer on top of D365 data model, for almost all important tables is needed to retrieve such mappings. It's a good idea to encapsulate the mappings in single "base views" and reuse it when needed, avoiding thus to replicate the mappings for each use of the source table (see the logic for TDM.vEcoResProduct).

The data generated by the above script could be dumped to a table and thus the table joined in the queries when the values for a Base Enum are needed, though I don't recommend such an approach because it would increase the complexity of the queries, already complex in certain scenarios. Also the misuse of views for creating pseudoconstants is not recommended. 

Happy coding!

25 January 2023

SQL Reloaded: Documenting External Tables

When using serverless SQL pool, CETAS (aka Create External Table as Select) are the main mechanism of making data available from the Data Lake for queries. In case is needed to document them, sys.external_tables DMV can be used to export their metadata, much like sys.tables or sys.views can be used for the same:

-- CETAS metadata
SELECT TOP (50) ext.object_id
, ext.name
, schema_name(ext.schema_id) [schema_name]
, ext.type_desc
, ext.location
, ext.data_source_id
, ext.file_format_id
, ext.max_column_id_used
, ext.uses_ansi_nulls
, ext.create_date
, ext.modify_date
FROM sys.external_tables ext

It's interesting that CETAS have type_desc = 'USER_TABLE' in sys.all_objects, same like user-defined tables in SQL Server have:

-- CETAS' metadata via sys.all_objects
SELECT *
FROM sys.all_objects
WHERE object_id = object_id('<schema_name>.<CETAS name>')

The data source and file format can be retrieved via the sys.external_data_sources and sys.external_file_formats DMVs. Moreover, it's useful to include the logic into a view, like the one below:
 
-- drop view
--DROP VIEW IF EXISTS dbo.vAdminExternalTables

-- create view
CREATE VIEW dbo.vAdminExternalTables
AS
-- external tables - metadata 
SELECT ext.object_id
, sch.name + '.' + ext.name [unique_identifier]
, sch.name [schema]
, ext.name [object]
, ext.type_desc [type]
, ext.max_column_id_used 
, ext.location
, eds.name data_source 
, eff.name file_format 
, ext.create_date 
, ext.modify_date
FROM sys.external_tables ext
     JOIN sys.schemas sch
       ON ext.schema_id = sch.schema_id 
     JOIN sys.external_data_sources eds 
       ON ext.data_source_id = eds.data_source_id
     JOIN sys.external_file_formats eff
       ON ext.file_format_id = eff.file_format_id

-- testing the view
SELECT top 10 *
FROM dbo.vAdminExternalTables

The view can be used then for further queries, for example checking the CETAS created or modified starting with a given date:
 
-- external tables created after a certain date
SELECT *
FROM dbo.vAdminExternalTables ext
WHERE ext.create_date >= '20230101'
  OR ext.modify_date >= '20230101';

Or, when the CETAS are deployed from one environment to another, one can compare the datasets returned by the same view between environments, something like in the below query:
 
-- comparison external tables metadata between two databases
SELECT *
FROM (
    SELECT *
    FROM <test_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) PRD
	FULL OUTER JOIN (
    SELECT *
    FROM <prod_database>.dbo.vAdminExternalTables
    WHERE [Schema] = '<schema_name>'
	) UAT
	ON PRD.[unique_identifier] = UAT.[unique_identifier]
-- WHERE PRD.[unique_identifier] IS NULL OR UAT.[unique_identifier] IS NULL

The definitions for multiple CETAS can be exported from the source database in one step via the Object Explorer Details >> Tables >> External tables >> (select CETAS) >> Script Table as >> ... . 

Happy coding!
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.