Showing posts with label DML. Show all posts
Showing posts with label DML. Show all posts

31 January 2024

🏭🗒️Microsoft Fabric: Parquet Format [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 31-Jan-2024

[Microsoft Fabric] Parquet format

  • {definition} open source, column-oriented data file format designed for efficient data storage and retrieval [1]
    • provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk [1]
    • designed to be a common interchange format for both batch and interactive workloads [1]
  • {characteristic} open source file format
    • similar to other columnar-storage file formats available in Hadoop [1]
      • e.g. RCFile, ORC
    • became an industry standard 
      •  {benefit} provides interoperability across multiple tools
  • {characteristic} language agnostic [1]
    • different programming languages can be used to manipulate the data
  • {characteristic} column-based format [1]
    • files are organized by column
      • ⇐ rather than by row
      • ⇒ saves storage space and speeds up analytics queries [1]
    •  reads only the needed columns 
      • ⇐ non-relevant data are skipped
      • ⇒ greatly minimizes the IO [1]
        • aggregation queries are less time-consuming compared to row-oriented databases [1]
    • {benefit} increased data throughput and performance [1]
      • ⇒ recommended for analytical workloads
  • {characteristic} highly efficient data compression/decompression [1]
    • supports flexible compression options and efficient encoding schemes [1]
      • data can be compressed by using one of the several codecs available [1]
        • ⇒ different data files can be compressed differently [1]
    •  reduced storage requirements [1]
      • by at least one-third on large datasets
      • ⇒ {benefit} saves on cloud storage space
    •  greatly improves scan and deserialization time [1]
      • ⇒ {benefit} reduces the processing costs
    • {downside} can be slower to write than row-based file formats
      • primarily because they contain metadata about the file contents 
      • though have fast read times
  • {characteristic} supports complex data types and advanced nested data structures [1]
    • implemented using the record-shredding and assembly algorithm
      • accommodates complex data structures that can be used to store the data [1]
      • optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types [1]
        • the approach is best especially for those queries that need to read certain columns from a large table [1]
  • {characteristic} cloud-ready
    • works best with interactive and serverless technologies [1]
  • {characteristic} immutable
    • a file can't be update to modify the column name, reorder or drop columns [2]
      • ⇐ requires rewriting the whole file [2]
  • {characteristic} binary-based file
    • ⇒ not easily readable (by humans)
  • {characteristic} self-describing 
    •  contains metadata about schema and structure
    • {concept} row groups (aka segments) 
      • contains data from the same columns
        • {constraint} column names are case sensitive
    • {concept} file footer 
      • stores metadata statistics for each row group [2]
        • min/max statistics 
        • the number of rows
        • can be leveraged by data processing engines to run queries more efficiently [2]
          • ⇐ depending on the query, entire row group can be skipped [2]
    • {concept} file header
  •  large datasets can be split across multiple parquet files
    • ⇐ the structure can be flat or hierarchical 
    • managing multiple files has several challenges
    • the files can be used to define a table (aka parquet table)
      • ⇐ {constraint} the files must have the same definition
        • ⇐ schema enforcement must be coded manually [2]
      • {limitation} [Data Lake] no support for ACID transactions [2]
        • ⇒ easy to corrupt [2]
          • partially written files will break any subsequent read operations
            • the compute engine will try to read in the corrupt files and error out [2]
            • corrupted files must be manually identified and deleted manually to fix the issues [2]
      • {limitation} it's not easy to delete rows from it [2]
        • requires reading all the data, filtering out the data not needed, and then rewriting the entire table [2]
      • {limitation} doesn't support DML transactions [2]
      • {limitation} there is no change data feed [2]
      • {limitation} slow file listing [2]
        • small files require excessive I/O overhead
          • ideally the files should be between 64 MB and 1 GB
          • ideally the files should be compacted into larger files (aka small file compaction, bin-packing)
      • {limitation} expensive footer reads to gather statistics for file skipping [2]
        • fetching all the footers and building the file-level metadata for the entire table is slow [2]
          • ⇐ it requires a file-listing operation [2]
        • the effectiveness of data skipping depends on how many files can be can skipped when performing a query [2]
      • {limitation} doesn't support schema enforcement [2]
      • {limitation} doesn't support check constraints [2]
      • {limitation} doesn't support data versioning [2]
    • {concept} table partitioning
      • {definition} common optimization approach used to store the data of the same table in different directories, with partitioning column values encoded in the path of each partition directory [6]
      • {recommendation} avoid partitioning by columns with very high cardinality
    • {concept} bin-packing (aka compaction, bin-compaction)
      • aims to produce evenly-balanced data files with respect to their size on disk, 
        • ⇐ but not necessarily in respect to the number of tuples per file [7]
      • requires an algorithm that efficiently organizes the files into equal size containers [6]
      • {characteristic} idempotent
        •  if it is run twice on the same dataset, the second run has no effect [7]
  • {feature} [Microsoft Fabric] V-order
    • {definition} write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engines [3]
    • applies special sorting, row group distribution, dictionary encoding and compression on parquet files [3]
      • requires less compute engines resources in to read it [3]
        • provides further cost efficiency and performance
          • has a 15% impact on average write times but provides up to 50% more compression [3]
    • {characteristic} open-source parquet format compliant
      • all parquet engines can read it as a regular parquet file [3]
      • ⇐ table properties and optimization commands can be used on control V-Order on its partitions [3]
      • compatible with other features [3]
    • applied at parquet file level [3]
    • enabled by default
  • {command} OPTIMIZE
    • merges all changes into bigger, consolidated parquet files (aka bin-compaction) [3]
    • [Spark] dynamically optimizes partitions while generating files with a default 128 MB size [5]
      • the target file size may be changed per workload requirements using configurations [5]
    • properly designing the table physical structure based on the ingestion frequency and expected read patterns is likely more important than running the optimization command [3]
    • running the compaction operation brings the data lake in an unusable state for readers [7]
    • {warning} manually compacting the files is inefficient and error prone [7]
      • no way to differentiate files that contain new data from files that contain existing data that was just compacted into new files [7]
  • [Delta Lake] when ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, VORDER sequentially [3]

Acronyms:
ACID - atomicity, consistency, isolation, durability
IO - Input/Output
MF - Microsoft Fabric
ORC - Optimized Row Columnar
RCFile - Record Columnar File

Resources:
[1] Databricks (2023) What is Parquet? (link)
[2] Data Lake (2023) Delta Lake vs. Parquet Comparison (link)
[3] Data Mozart (2023) Parquet file format – everything you need to know! (link)
[4] Microsoft Learn (2023) Query Parquet files using serverless SQL pool in Azure Synapse Analytics (link)
[5] Microsoft Learn (2023) Lakehouse tutorial: Prepare and transform data in the lakehouse (link)
[6] Apache Spark (2023) Spark SQL Guide (link)
[7] Delta Lake (2023) Delta Lake Small File Compaction with OPTIMIZE (link)
[8] Delta Lake (2023) Optimizations (link)

08 January 2023

💠🛠️SQL Server: DELETE vs. TRUNCATE TABLE Cheat Sheet

The comparison between the DELETE and TRUNCATE TABLE commands resumes to more than saying that one method is faster than the other or that one should always use TRUNCATE TABLE when deleting all the records from a table, which typically is not advisable in production environments. I tried to provide an overview of the two commands, though this should be considered as "work in progress".
 
Disclaimer: please refer to the SQL Server Docs for the complete set of features broken down on version.
 
DELETETRUNCATE TABLE
DefinitionDML command that removes one or more rows from a table or viewDDL command that removes all rows from a table or specified partitions of a table, without logging the individual row deletions
Scopetables, views, memory-optimized tables, common table expressions, MERGE, sp_MSforEachTable, linked serverstables, partitions
Behavior• removes rows one at a time and records an entry in the transaction log for each deleted row
  - for big tables the transaction log fills fast and may reach its limit
• fully logged ⇒rollback supported
• executed using a row lock, though locks might be escalated to a larger scope ⇒ performance may degrade
• [views] deletes the records only from the base table
• allows outputting the deleted records (via OUTPUT clause)
• [heaps] pages made empty may remain allocated ⇒ can’t be reused by other objects
• equivalent of a DROP & CREATE TABLE
• uses an optimized logging mode:
   - removes the data by deallocating the data pages used to store the table data and records only the page deallocations in the transaction log
   - a deferred-drop mechanism unhooks the allocations for the table and putting them on the ‘deferred-drop queue’ for later processing by a background task deallocates all the pages and extents
• fully logged, however rollback supported only with explicit transactions
• leaves zero pages in the table
• resets the identity property
Syntax (simple form)
DELETE <table_name>
FROM <database>
[WHERE <search_condition>]   
[OPTION (<query_options>)]
TRUNCATE TABLE <table_name>
[ WITH ( PARTITIONS ( { 
<partition_number_expression> 
| <range> }))]
Performance• degrades with the numbers of records
• [large tables] can cause the transaction log to become full
• the operation completes almost instantaneously
• best practice because is faster and uses fewer system and transaction log resources
Constraints• a DELETE may fail if - violates a trigger - tries to remove a row referenced by data in another table with a FOREIGN KEY constraint
• TOP can’t be used in a DELETE statement against partitioned views • doesn’t reset the identity property
• can’t be used with views
• can’t be used on tables:
   - referenced by a FOREIGN KEY constraint, except self-references
   - participate in an indexed view - published using transactional or merge replication - system-versioned temporal.
   - referenced by an EDGE constraint
• can’t activate a trigger
• can’t be used on views & memory-optimized tables
Permissions• [minimum] DELETE permission on target table, and SELECT permission, it if includes a WHERE clause
• default to
   - table owner
   - members of the sysadmin fixed server role   
   - db_owner and db_datawriter fixed database roles
• table owners & members of sysadmin, db_owner & db_securityadmin roles can transfer permissions to other users
• [minimum] ALTER permission on target table
• default to
   - table owner - members of the sysadmin fixed server role
   - db_owner and db_ddladmin fixed database roles
• permissions are not transferable
• doesn’t support direct permissions (workaround: use TRUNCATE in stored procedure, and assign the required permission to it using the EXECUTE AS clause)
Scenarios• delete a set of records from a table
   - based on fix constraints
   - based on records from another table
   - based on a join with a source table
• empty a set of tables from a database
Recommendations• use a TRUNCATE when is safe to delete all the records ⇒ make sure that a backup or copy of the data is available
• [large tables] consider dropping the indexes before performing a DELETE when this covers all or most of the data, and recreate them afterwards
• [large tables] if the volume of data to be deleted is big compared with the remaining data, consider moving the data to a table with a similar structure, perform a TRUNCATE and then move the data back (see [5])
• [large tables] consider deleting data in batches with log truncation in single-user mode (be careful in production environments)
• [heaps] specify the TABLOCK hint in the DELETE statement
• [heaps] create a clustered index on the heap before deleting the row
Myths•TRUNCATE TABLE is a non-logged operation (see [4])
•TRUNCATE TABLE is a minimally logged operation (see [3])
Related conceptsnon-logged/minimally-logged/fully-logged operations, deferred-drop mechanism, sp_MSforEachTable stored procedure, transaction log, MERGE, DDL, DML

Resources:
[1] Microsoft SQL Docs (2022) DELETE [link]
[2] Microsoft SQL Docs (2022) TRUNCATE TABLE [link]
[3] Microsoft TechNet (2017) SQL Server: Understanding Minimal Logging Under Bulk-Logged Recovery Model vs. Logging in Truncate Operation [link]
[4] Paul Randal (2013) The Myth that DROP and TRUNCATE TABLE are Non-Logged [link]
[5] SQL-troubles (2018) ERP Systems: Dynamics AX 2009 – Deleting Obsolete Companies [link]
[6] Microsoft TechNet (2014) SQL Server: An Examination of Logging in Truncate Table Statement and Its Comparison with Delete Statement [link]

01 January 2011

🔖Book Review: Lynn Beighley's "Head First SQL" (2011)

I love the Head First O’Reilly book series which besides the book on SQL, has acquired over its short existence several other titles like the ones on Data Analysis, Statistics and Excel, topics strongly correlated with the thematic of the current blog, but also related tangential topics like Project Management, Design Patterns, Object-Oriented Analysis and Design, Software Development, Web Design, Programming, PHP & MySQL, HTML with CSS & XHTML, C#, Java, AJAX, JavaScript, Pyhton, Servlets and JSP, EJB or Rails. (The book series contains more titles which seems to be available also at Head First Labs together with other supporting information.) 

What I like about this book series is that approaches the themes from a basic level, using a rich set of "brain-friendly" visual representations, conversational and personalized style, challenging food for thought, plus some advices on how to use book’s format, facts that allow the reader to make most of the learning process. In fact the books are approachable for any newbies in IT branch, so I recommend the whole set of books to beginners as well to experienced programmers, considering the fact that everybody has something to learn, at least when considering the presentational format of the book.

Coming back to the current title, Head First SQL, it covers almost everything somebody has to know about database design, DML and DDL statements, database objects and security in a nutshell. It focuses only on the SQL Standard (I wish I know which one as there are several versions/revisions, see SQL 1999), so it could be used for all type of RDBMS that implements the standard, even if the book is built around MySQL. A consequence of this aspect is that any vendor specific flavors won’t be found in here, so the book should be used in combination with the documentation or the books that target the vendor-specific functionality. There are also several topics not covered but mentioned roughly: reserved words and special characters, temporary tables, data conversion, some numeric functions and operators, indexes, etc.

I must remark that some of the definitions are somehow ambiguous, and I feel that in some cases additional clarifications are necessary. Some examples in this direction could be considered the very first important definitions, the ones of the database seen as "a container that holds tables and SQL structures related to those tables", the one of column seen as piece of data stored by a table, or of a row seen as a single set of columns that describe attributes of a single thing. The first definition is quite acceptable if it’s ignored the existence of non-relational databases (it’s also true that the book targets RDBMS even if not evident), however the other two definitions are too vague, at least from my point of view. 

It’s also true that more rigor and detail would involve the introduction of other concepts (e.g. entity, matrix, etc.), requiring more space and maybe complicate the topics. The reader could intuit from explanatory text and examples what’s all about, however expecting the reader to bridge such gaps should be avoided especially when we consider the intrinsic nature of this book in particular, and IT books in general. Anyway, this shouldn’t stop you discover and enjoy the book!

Notes:
1. As I learned on the hard way, not all book formats and writing styles are matching everybody’s taste and expectations. The simplistic and rich visually format of the book could annoy as well, so might be a good idea to browse first the book on Scribd before buying it.
2. The source code is available here.

09 March 2009

🛢DBMS: Trigger (Definitions)

"A special form of stored procedure that goes into effect when a user gives a change command such as insert, delete, or update to a specified table or column. Triggers are often used to enforce referential integrity." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A special form of stored procedure that goes into effect when data within a table is modified. Triggers are often created to enforce integrity or consistency among logically related data in different tables." (Patrick Dalton, "Microsoft SQL Server Black Book", 1997)

"A special type of stored procedure that is set off by actions taken on a table. Triggers allow for complex relationships between tables and complex business rules to be checked automatically." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"A stored procedure that executes automatically when data in a specified table is modified. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Code stored in the database that executes automatically when certain events occur. Traditionally associated only with table write events such as INSERT, UPDATE, or DELETE, newer versions of Oracle provide the ability to define triggers on views and on other system events such as logon, logoff, and system error." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A stored procedure that executes when data in a specified table is modified. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A trigger is a stored procedure that is fired when data is modified from a table using any of the three modification statements: DELETE, INSERT, or UPDATE. FOR and AFTER are synonymous, and are usually implied when referring to triggers, rather than INSTEAD OF triggers. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"A chunk of code that executes when a specified event occurs, usually before or after an INSERT, UPDATE, or DELETE command." (Gavin Powell, "Beginning Database Design", 2006)

"A database method that is automatically invoked as the result of Data Manipulation Language (DML) activity within a persistence mechanism." (Pramod J Sadalage & Scott W Ambler, "Refactoring Databases: Evolutionary Database Design", 2006)

"A stored procedure that is fired when data is modified from a table using any of the three modification statements DELETE, INSERT, or UPDATE. FOR and AFTER are synonymous and are usually implied when referring to triggers rather than INSTEAD OF triggers. Triggers are often created to enforce referential integrity or consistency among logically related data in different tables." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006)

"A stored procedure that executes when certain conditions occurs such as when a record is created, modified, or deleted. Triggers can perform special actions such as creating other records or validating changes." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A type of stored procedure that fires in response to action on a table. DML triggers are associated with INSERT, UPDATE, and DELETE statements. DDL triggers are associated with CREATE, ALTER, and DROP statements." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"Stored in, and managed by, your database server, this software is executed when a certain event occurs. These events can range from information creation or modification to structural changes to your database. When the event occurs, the trigger is executed, causing a pre-determined set of actions to take place. These actions can encompass data validation, alerts, warnings, and other administrative operations. Triggers can invoke other triggers and stored procedures." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A stored procedure that executes in response to a Data Manipulation Language (DML) or Data Definition Language (DDL) event." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A SQL program module that is executed when a specific data modification activity occurs. Triggers are stored in the database they manipulate." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"A stored procedure that can be triggered and executed automatically when a database operation such as insert, update, or delete takes place." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A procedural SQL code that is automatically invoked by the relational database management system upon the occurrence of a data manipulation event." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"A software routine guaranteed to execute when an event occurs. Often a trigger will monitor changes to data values. A trigger includes a monitoring procedure, a set or range of values to check data integrity, and one or more procedures invoked in response, which may update other data or fulfill a data subscription." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"An event that causes a handler to be run." (Jon Orwant et al, "Programming Perl, 4th Ed.", 2012)

"An event-driven specialized procedure that is attached to database tables; typically implemented to support data integrity requirements." (Craig S Mullins, "Database Administration", 2012)

"A database object that is associated with a single base table or view and that defines a rule. The rule consists of a set of SQL statements that runs when an insert, update, or delete database operation occurs on the associated base table or view." (IBM, "Informix Servers 12.1", 2014)

"A database object that is associated with a single base table or view and that defines a rule. The rule consists of a set of SQL statements that runs when an insert, update, or delete database operation occurs on the associated base table or view." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

 "A PL/SQL or Java procedure that fires when a table or view is modified or when specific user or database actions occur. Procedures are explicitly run, whereas triggers are implicitly run." (Oracle, "Oracle Database Concepts")

"A stored procedure that executes in response to a data manipulation language (DML) or data definition language (DDL) event." (Microsoft Technet,)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.