Showing posts with label transaction log. Show all posts
Showing posts with label transaction log. Show all posts

06 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata II (Updating a table's COMMENT attributes)

While using the DESCRIBE TABLE metadata I observed that its output shown also a Comment attribute which was NULL for all the columns. Browsing through the Databricks documentation (see [1]), I found that the Comment can be provided in a delta table's definition as follows (code to be run in a notebook):

-- drop the table (if already exists)
DROP TABLE IF EXISTS Assets2;

--create the table
CREATE TABLE Assets2 (
 Id int NOT NULL COMMENT 'Asset UID',
 CreationDate timestamp NOT NULL COMMENT 'Creation Date',
 Vendor string NOT NULL COMMENT 'Vendors name',
 Asset string NOT NULL COMMENT 'Asset type',
 Model string NOT NULL COMMENT 'Asset model',
 Owner string NOT NULL COMMENT 'Current owner',
 Tag string NOT NULL COMMENT 'Assets tag',
 Quantity decimal(13, 2) NOT NULL COMMENT 'Quantity'
) 
USING DELTA
COMMENT 'Vendor assets';

-- show table's definition
DESCRIBE TABLE Assets2;

-- show table's details
DESCRIBE DETAIL Assets;

So, I thought there must be a way to update Assets table's definition as well. And here's the solution split into two steps, as different syntax is required for modifying the columns, respectively the table:

-- modify columns' COMMENT for an existing table
ALTER TABLE Assets ALTER COLUMN ID COMMENT 'Asset UID';
ALTER TABLE Assets ALTER COLUMN CreationDate COMMENT 'Creation Date';
ALTER TABLE Assets ALTER COLUMN Vendor COMMENT 'Vendors name';
ALTER TABLE Assets ALTER COLUMN Asset COMMENT 'Asset type';
ALTER TABLE Assets ALTER COLUMN Model COMMENT 'Asset model';
ALTER TABLE Assets ALTER COLUMN Owner COMMENT 'Current owner';
ALTER TABLE Assets ALTER COLUMN Tag COMMENT 'Assets tag';
ALTER TABLE Assets ALTER COLUMN Quantity COMMENT 'Quantity';

-- show table's definition
DESCRIBE TABLE Assets;

The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180621522,"operation":"CHANGE COLUMN","operationParameters":{"column":"{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}"},"readVersion":13,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"42590d18-e04a-4fb2-bbfa-94414b23fb07"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

And the second step (see [2]):

-- modify a table's COMMENT
COMMENT ON TABLE Assets IS 'Vendor assets';

-- describe table's details
DESCRIBE DETAIL Assets;

    The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180808138,"operation":"SET TBLPROPERTIES","operationParameters":{"properties":"{\"comment\":\"Vendor assets\"}"},"readVersion":14,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"21c315b5-a81e-4107-8a19-6256baee7bd5"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}}
,{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}}
,{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}}
,{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}}
,{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}}
,{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}}
,{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}}
,{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

Notes:
1) Don't forget to remove the Assets2 table!
2) Updating the COMMENT for a single delta table is simple, though for updating the same for a list of tables might be task for a PySpark job. It makes sense to provide the respective metadata from the start, when that's possible. Anyway, the information should be available in lakehouse's data dictionary.
3) One can reset the COMMENT to NULL or to an empty string.

Happy coding!

Resources:
[1] Databaricks (2023) ALTER TABLE (link)
[2] Databaricks (2023) COMMENT ON (link)

05 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations II (Looking at the Files)

Delta Table's Files

In a previous post, I tested the CRUD operations against a Microsoft Fabric's delta table. To view the files behind, you'll need to navigate to the Lakehouse, right click on the Assets table and pick 'View files' from the floating menu:

Assets' delta table files in Lakehouse

One can navigate within the folder structure, look at the transaction log files under _delta_log folder, though there seems to be no way to download the respective files. For this purpose, alternatively, you can use the Azure Storage Explorer (downloadable here). After installing the Explorer use the 'ADLSGen2 container or directory' option to connect to the folder where the Assets table was created (see [1]).

Assets' delta table files in Azure Storage Explorer

And, here's the content of the _delta_log from where the files can be downloaded:

Assets' log files in Azure Storage Explorer

The _temporary folder is empty.

It seems that for each cell with a DML/DDL command a log file was created. Therefore, it should make no difference if you run the cells individually or as a whole, respectively whether the statements are included in only one cell. 

The first file holds the content generated when creating the table:

{"commitInfo":{"timestamp":1707006399711,"operation":"CREATE TABLE","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"c3e6f24e-fa78-4941-a29d-28aac3a25926"}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"a9a45d67-bcf9-4bef-8f48-c6b8f7a64f58","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1707006399640}}

The second file holds the content generated when inserting the first line in the table:

{"commitInfo":{"timestamp":1707006405806,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"1","numOutputBytes":"3713"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"ac51fa78-8384-4e53-bb02-40d65632a218"}}
{"add":{"path":"part-00000-b50df7ae-66d8-41fd-a54d-6238e053d269-c000.snappy.parquet","partitionValues":{},"size":3713,"modificationTime":1707006405683,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":
{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The third file holds the content for inserting the multiple lines, while the fourth file contains the content for the UPDATE statement:

{"commitInfo":{"timestamp":1707006417538,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Id#1952 = 6)\"]"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"2297","scanTimeMs":"1613","numAddedFiles":"1","numUpdatedRows":"1","numAddedBytes":"3870","rewriteTimeMs":"681"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"70f0f718-331a-48fc-8419-9f80599ca889"}}
{"remove":{"path":"part-00000-0bd7b054-1bde-44cd-a852-80332d98ae8b-c000.snappy.parquet","deletionTimestamp":1707006417533,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","partitionValues":{},"size":3870,"modificationTime":1707006417436,"dataChange":true,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":2,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 2\",\"Model\":\"Model 2\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The fifth file, holds the content related to the DELETE statement:

{"commitInfo":{"timestamp":1707006422147,"operation":"DELETE","operationParameters":{"predicate":"[\"(Id#2665 = 2)\"]"},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"1425","numDeletedRows":"1","scanTimeMs":"908","numAddedFiles":"1","numAddedBytes":"3837","rewriteTimeMs":"517"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"1f61449f-6c4a-40ce-80ed-6ff1a9a834e9"}}
{"remove":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","deletionTimestamp":1707006422134,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-bee2e3a5-7def-4466-94ff-ccd86aa37b8c-c000.snappy.parquet","partitionValues":{},"size":3837,"modificationTime":1707006422055,"dataChange":true,"stats":"{\"numRecords\":4,\"minValues\":{\"Id\":3,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 3\",\"Model\":\"Model 3\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"WX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

Before looking at each file, let's note the engineInfo's information that denotes the Spark, respectively the Delta Lake versions: "Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8". It's important to check these versions against the documentation when troubleshooting!

In each file, besides the "timestamp", observe the values of "operation", "operationParameters", "operationMetrics", "dataChange" and "VORDER". 

Operation indicate the DML or DDL statements run. As per my understanding dataChange and VORDER are set to true. 

The DESCRIBE HISTORY Command

You can retrieve a history of the actions performed on a table for the past 30 days via the DESCRIBE command, which can be entered in a notebook's cell:

-- describe table's history (30 days of data)
DESCRIBE HISTORY Assets;

Attributes' definition can be found in [3], see 'History schema' section. For big tables, it might be a good idea to limit command's output only to several records via the LIMIT keyword:

-- describe table's history (30 days of data, last 5 records)
DESCRIBE HISTORY Assets LIMIT 5;

Table Maintenance

Running multiple change commands on a delta table will result in many such small files, which impacts the read access to data and metadata. The files can be compacted via the table maintenance (see also the notes on delta tables). The OPTIMIZE command allows to consolidate multiple small Parquet files into larger files:

-- compacting the files
OPTIMIZE Assets;

Alternatively, you can do the same from the Lakehouse, navigate to the Assets table, right click on it and from the floating menu select Maintenance and run the job with the provided (see also [4]):

Run maintenance commands for a table

Running the command generated a new file:

{"commitInfo":{"timestamp":1707150591870,"operation":"OPTIMIZE","operationParameters":{"predicate":"[]","zOrderBy":"[]","auto":false},"readVersion":4,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"2","numRemovedBytes":"7550","p25FileSize":"3879","numDeletionVectorsRemoved":"0","minFileSize":"3879","numAddedFiles":"1","maxFileSize":"3879","p75FileSize":"3879","p50FileSize":"3879","numAddedBytes":"3879"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"b6466509-9f1a-467d-ad99-a0e95ed694ec"}}
{"add":{"path":"part-00000-453e9218-fe59-4138-a7dc-0e71c94c07d2-c000.snappy.parquet","partitionValues":{},"size":3879,"modificationTime":1707150591746,"dataChange":false,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":1,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-cbaaf2dc-b8d6-44bc-a4e1-97db0ba96074-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3713,"tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-6714ec80-1bd8-41f3-b8b7-5197025ec08f-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3837,"tags":{"VORDER":"true"}}}

Happy coding!

Resources:
[1] Microsoft Azure (2023) Azure Storage Explorer (link)
[2] Sam Debruyn (2023) Exploring OneLake with Microsoft Azure Storage Explorer (link)
[3] Delta Lake (2023) Table utility commands (link)
[4] Microsoft Learn (2023) Use table maintenance feature to manage delta tables in Fabric (link)

08 January 2023

SQL Server: DELETE vs. TRUNCATE TABLE Cheat Sheet

The comparison between the DELETE and TRUNCATE TABLE commands resumes to more than saying that one method is faster than the other or that one should always use TRUNCATE TABLE when deleting all the records from a table, which typically is not advisable in production environments. I tried to provide an overview of the two commands, though this should be considered as "work in progress".
 
Disclaimer: please refer to the SQL Server Docs for the complete set of features broken down on version.
 
DELETETRUNCATE TABLE
DefinitionDML command that removes one or more rows from a table or viewDDL command that removes all rows from a table or specified partitions of a table, without logging the individual row deletions
Scopetables, views, memory-optimized tables, common table expressions, MERGE, sp_MSforEachTable, linked serverstables, partitions
Behavior• removes rows one at a time and records an entry in the transaction log for each deleted row
  - for big tables the transaction log fills fast and may reach its limit
• fully logged ⇒rollback supported
• executed using a row lock, though locks might be escalated to a larger scope ⇒ performance may degrade
• [views] deletes the records only from the base table
• allows outputting the deleted records (via OUTPUT clause)
• [heaps] pages made empty may remain allocated ⇒ can’t be reused by other objects
• equivalent of a DROP & CREATE TABLE
• uses an optimized logging mode:
   - removes the data by deallocating the data pages used to store the table data and records only the page deallocations in the transaction log
   - a deferred-drop mechanism unhooks the allocations for the table and putting them on the ‘deferred-drop queue’ for later processing by a background task deallocates all the pages and extents
• fully logged, however rollback supported only with explicit transactions
• leaves zero pages in the table
• resets the identity property
Syntax (simple form)
DELETE <table_name>
FROM <database>
[WHERE <search_condition>]   
[OPTION (<query_options>)]
TRUNCATE TABLE <table_name>
[ WITH ( PARTITIONS ( { 
<partition_number_expression> 
| <range> }))]
Performance• degrades with the numbers of records
• [large tables] can cause the transaction log to become full
• the operation completes almost instantaneously
• best practice because is faster and uses fewer system and transaction log resources
Constraints• a DELETE may fail if - violates a trigger - tries to remove a row referenced by data in another table with a FOREIGN KEY constraint
• TOP can’t be used in a DELETE statement against partitioned views • doesn’t reset the identity property
• can’t be used with views
• can’t be used on tables:
   - referenced by a FOREIGN KEY constraint, except self-references
   - participate in an indexed view - published using transactional or merge replication - system-versioned temporal.
   - referenced by an EDGE constraint
• can’t activate a trigger
• can’t be used on views & memory-optimized tables
Permissions• [minimum] DELETE permission on target table, and SELECT permission, it if includes a WHERE clause
• default to
   - table owner
   - members of the sysadmin fixed server role   
   - db_owner and db_datawriter fixed database roles
• table owners & members of sysadmin, db_owner & db_securityadmin roles can transfer permissions to other users
• [minimum] ALTER permission on target table
• default to
   - table owner - members of the sysadmin fixed server role
   - db_owner and db_ddladmin fixed database roles
• permissions are not transferable
• doesn’t support direct permissions (workaround: use TRUNCATE in stored procedure, and assign the required permission to it using the EXECUTE AS clause)
Scenarios• delete a set of records from a table
   - based on fix constraints
   - based on records from another table
   - based on a join with a source table
• empty a set of tables from a database
Recommendations• use a TRUNCATE when is safe to delete all the records ⇒ make sure that a backup or copy of the data is available
• [large tables] consider dropping the indexes before performing a DELETE when this covers all or most of the data, and recreate them afterwards
• [large tables] if the volume of data to be deleted is big compared with the remaining data, consider moving the data to a table with a similar structure, perform a TRUNCATE and then move the data back (see [5])
• [large tables] consider deleting data in batches with log truncation in single-user mode (be careful in production environments)
• [heaps] specify the TABLOCK hint in the DELETE statement
• [heaps] create a clustered index on the heap before deleting the row
Myths•TRUNCATE TABLE is a non-logged operation (see [4])
•TRUNCATE TABLE is a minimally logged operation (see [3])
Related conceptsnon-logged/minimally-logged/fully-logged operations, deferred-drop mechanism, sp_MSforEachTable stored procedure, transaction log, MERGE, DDL, DML

Resources:
[1] Microsoft SQL Docs (2022) DELETE [link]
[2] Microsoft SQL Docs (2022) TRUNCATE TABLE [link]
[3] Microsoft TechNet (2017) SQL Server: Understanding Minimal Logging Under Bulk-Logged Recovery Model vs. Logging in Truncate Operation [link]
[4] Paul Randal (2013) The Myth that DROP and TRUNCATE TABLE are Non-Logged [link]
[5] SQL-troubles (2018) ERP Systems: Dynamics AX 2009 – Deleting Obsolete Companies [link]
[6] Microsoft TechNet (2014) SQL Server: An Examination of Logging in Truncate Table Statement and Its Comparison with Delete Statement [link]

20 June 2018

ERP Systems: Dynamics AX 2009 – Deleting Obsolete Companies

Introduction   

    During implementations, migrations and other projects are created in Dynamics AX temporary companies (aka legal entities, data areas) that aren’t needed anymore once they fulfilled their purpose. Excepting the fact that obsolete companies occupy space in the data center, under certain circumstances they can lead to performance problems. The logical thing to do would be to delete the obsolete companies as long there’s no further demand from the business.    

   In what follows we will look at several methods for deleting obsolete companies. The scripts were tested in Dynamics AX 2009, and more likely they’ll work in coming versions as long the data model behind was kept.

Warning:
    Please note that the scripts are provided “AS IS” only to exemplify a technique and they come without any warranty! Before attempting any of the methods described here, review the comments from “Further Considerations” section!


Method 1: Using DynamcsAX Built-In Functionality   

   Dynamics AX 2009 provides built-in functionality for deleting a company, however when the volume of data in the system goes above a certain limit the functionality starts to perform poorly, even when run directly on an AOS. (It is recommended to run long-running administration jobs directly on the AOS rather than clients.)    For example, it was attempted to use this method to delete several companies in Dynamics AX Test environment. By the first company the deletion job needed a few hours, while by the second company the job hasn’t finished after two days, being thus forced to stop it. After two further failed attempts it came the time to look for another solution.

Warning:
     It seems that this solution can lead to orphaned data (see [1]). So, even if you are using this method, you might need to consider one of the following methods as well.


Method 2: Using sp_MSforEachTable   

  In almost all tables in AX the company is stored in a DataAreaId attribute. Over this attribute the records belonging to a company are logically partitioned. This allows writing a script via the undocumented sp_MSforEachTable stored procedure:

--delete the data for one data area
sp_MSforEachTable @command1 = 'DELETE FROM ? WHERE DataAreaId = ''m01'''


An error with be thrown for the tables that don’t contain the DataAreaId attribute:
Msg 207, Level 16, State 1, Line 1


Invalid column name 'DataAreaId'.The script can be extended to delete in the same step two or more companies:

--delete the data for multiple data areas
 sp_MSforEachTable @command1 = 'DELETE FROM ? WHERE DataAreaId IN (''m01'', ''m02'')'


     During the first test the script needed half of hour to run, however a few tables  in which the company is stored in other attributes remained untouched. One can either search for such tables manually, via a script, or run the built-in AX functionality. We opted for running the built-in functionality, which managed to delete the remaining data relatively fast.

Warning:
Microsoft doesn’t support this method and can be used when the volume of obsolete data is relatively small!    What does it mean relatively small? The most important limitation of this method is the transaction log, considering that the deleted data are logged. One can either change log’s size to accommodate the volume of data to be deleted or run the deletion only for a subset of the tables. (Changing the recovery model to “simple” or “bulk-logged” won’t make a difference.)

   The second important limitation is the available memory, once the available memory is reached SQL Server having to paginate the data, fact that could lead to further disk space consumed.    Other limitations have more with the performance to do, e.g. each deletion is reflected also in the indexes. One might consider for example dropping the indexes before deletion and recreating them afterwards.


Method 3: Using a Cursor    

  Instead of using the undocumented sp_MSforEachTable stored procedure, the loop can be performed via a cursor (see [1]). This method is advantageous when the deletion needs to be performed only for a subset of tables one could use a cursor. The deletion can be grouped together with other activities and run together.


Method 4: Using „Shadow“ Tables    

   When the volume of data available is huge, and the volume of data that remain in the table is small compared with the overall data, it might be useful to consider using “shadow” tables. One can take advantage of the fact that a truncate command performs incomparable better than a delete command.  To use a truncate on a table, the records that need to be kept could be saved temporarily to a copy (aka “shadow”) of the table, the truncate then applied, and the copied records could be moved back. The following scripts exemplify the logic needed to delete the records from InventDim (inventory dimensions) table:

-- (optional) prove the number of records
SELECT count(*) 
FROM dbo.InventDim 
WHERE DataAreaId = 'm01'

-- create the “shadow” table
CREATE TABLE [dbo].[INVENTDIM_Dump](
[INVENTDIMID] [nvarchar](30) NOT NULL,
[INVENTBATCHID] [nvarchar](21) NOT NULL,
[WMSLOCATIONID] [nvarchar](12) NOT NULL,
[INVENTSERIALID] [nvarchar](21) NOT NULL,
[INVENTLOCATIONID] [nvarchar](10) NOT NULL,
[CONFIGID] [nvarchar](10) NOT NULL,
[INVENTSIZEID] [nvarchar](10) NOT NULL,
[INVENTCOLORID] [nvarchar](10) NOT NULL,
[INVENTSITEID] [nvarchar](10) NOT NULL,
[DATAAREAID] [nvarchar](4) NOT NULL,
[RECVERSION] [int] NOT NULL,
[RECID] [bigint] NOT NULL,
[WMSPALLETID] [nvarchar](18) NOT NULL,
[INVENTSTYLEID] [nvarchar](10) NOT NULL
) ON [PRIMARY]

-- copy the data into the “shadow” table
INSERT INTO [dbo].[InventDim_Dump] WITH (TABLOCK)
SELECT *
FROM [dbo].[InventDim] 
WHERE DataAreaId = 'm01'

-- truncate the data frome the main table 
--TRUNCATE TABLE [dbo].[InventDim]

-- copy the data back
INSERT INTO [dbo].[InventDim] WITH (TABLOCK)
SELECT *
FROM [dbo].[InventDim_Dump]

-- (optional) prove whether the IDs were correctly copied 
SELECT count(*)
FROM [dbo].[InventDim] A
JOIN [dbo].[InventDim_Dump] B
ON A.recid = B.RECID 
AND A. DATAAREAID = B.DATAAREAID 
WHERE A.DataAreaId = 'm01'

-- drop the „shadow“ table 
--DROP TABLE[dbo].[InventDim_Dump]

  

   As can be seen the “shadow” tables are simplified versions of the original tables, without constraints or indexes. They can be eventually created in another schema or even other database.   

   Except the script for table’s creation in the other scripts table’s name can be easily replaced in the editor via the search and replace functionality, trick that reduces considerably the time needed for development. I needed on average 5 minutes for each table, plus 3-4 hours for further tests.    

   The optional steps are more for exemplification and can be eventually removed.  

   The Tablock hint used in inserts provides better performance and minimizes the volume of data logged.    

   I used this method only for the tables having more than 3 million records, around 50 tables in total. Between them there were a few tables having 20-200 GB worth of records. I started with these big tables and figured out that also smaller tables could benefit from this method. A few minutes gained for each small table resulted in the end in a gain of a couple of hours.

   The remained records were 0-25% of the initial tables.   

   In theory, these steps could be performed within a cursor in which the creation of the “shadow” tables could be automated via table metadata as well. This approach will pay-off especially when the schema is not fixed, or the procedure needs to be repeated on different schemas.


Method 5: Delete Records in Batches    

   There will be a point beyond which the performance provided by the fourth method will deprecate considerably. This point is based on the volume of records available in the table, and the records needed to be inserted back and forth. Without further tests, I suppose that this point lies in the 50-75% interval. Beyond this point for big tables in range of 10x or 100x GB it might be useful to delete the data in batches. A push in this direction might be constrained by the need to shrink the transaction log in between the deletes. The query could be written as follow:

-- deleting top x records 
DELETE top 10000
FROM dbo.InventDim WITH (TABLOCK)
WHERE DataAreaId = m01

   The query can be included in a loop or run manually until no records are returned. It can be tested with different batch sizes to determine the best solution. In between is recommended to check also the growth of the log file and truncate it accordingly when needed.


Method 6: Using X++ Code  

    For those having some basic knowledge of X++ and Dynamics AX classes, a solution based on deleting data via AX code could prove to be a better solution as standard functionality can be leveraged, functionality that eventually considers also the business logic implemented. The downside is the code that need to be written for this purpose, however there are already some examples available on the web (see [4]).


Hint:
In AX 2012 built-in support for batch deletes was added via the delete_from statement (see [3]).


Further Considerations    

   Before attempting a deletion, it might be useful to analyze how many records will be deleted from each table, and eventually devise different scenarios for specific table categories. To get the number of records one can use either the built-in functionality from AX or use the sp_MSforEachTable stored procedure and export the results to text, following to overwork the data further in Excel:

-- listing the number of records per company 
sp_MSforEachTable @command1 = 'SELECT dataareaid, ''?'' table_name, count(*) no_records FROM ? WHERE DataAreaId IN (''m01'', ''m02'') GROUP BY dataareaid'

The results can be used also to approximate the space occupied by the data.   

   Independently of the method used it is recommended to restrict users‘ access to the system and to deactivate the scheduled AX or SQL Server jobs. This will ensure that no blockings will occur in the system during the respective time.    

   As data are synchronized between the AOS’s and the database, it is recommended to shut down the not needed AOS services before the deletions are performed, and restart them once all activities were performed.   

   To minimize the risks associated with the loss of data it’s recommended to perform a backup of the database(s) before performing any changes.    

   By deleting the data directly on the database, the business logic from AX (including customizations) is skipped. In theory this can lead to logical inconsistencies, however considering that all the data for a company are deleted, the risks are very small, unless intercompanies are involved.   

   After the data are deleted it is recommended to recreate the indexes and update the statistics on the tables.  

   Check whether the transaction log can accommodate the volume of records to be deleted! In extreme cases your SQL Server might crash! From this consideration it might be advantageous to delete only a company at a time.    

   Based on the volume of data available in the transaction log it might be needed to truncate the log(s) between the steps, as well at the end.  

   After the principle “better safe than sorrow”, it might be a good idea to check the physical and logical consistency of the data before letting the users in.   

  To minimize the impact on the business, it is recommended to perform the deletion outside the working hours, otherwise the action can lead to blocking and even deadlocks in the system.     Always attempt to use standard functionality and resort to other methods only when there’s no way around it.

  It is recommended to always test the scripts thoroughly in the test environment before attempting their productive usage!

References:
[1] Microsoft Dynamics AX Technical Support Blog (2010) How to delete orphaned data remained from deleted company?, by Martin Falta [Online] Available from: https://blogs.msdn.microsoft.com/emeadaxsupport/2010/12/09/how-to-delete-orphaned-data-remained-from-deleted-company/
[2] Art of Creation (2010) Delete an AX company on SQL [Online] Available from: http://www.artofcreation.be/2010/02/03/delete-an-ax-company-on-sql/
[3] MSDN (2012) delete_from Statement [Online] Available from: https://msdn.microsoft.com/en-us/library/aa624886.aspx[
4] Kevin’s blog (2017) Dynamics Ax 2012 History cleanup, by Kevin Roos [Online] Available from: https://www.kevinroos.be/2017/07/dynamics-ax-2012-history-cleanup/

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.