SQL Troubles: TOP

Showing posts with label TOP. Show all posts

04 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations I (Basic Operations)

When building a data-related solution, the most basic functionality for a SQL developer is the CRUD (create, read, update, delete) operations. That's why I decided to test the feature on Lakehouse delta tables, for a first demonstrative post on Microsoft Fabric.

Unfortunately, at least for the moment, the SQL Endpoint allows only read access to the delta tables, which frankly for an SQL developer is an important limitation. There's however the possibility of running SQL over Spark SQL, which supports a basic dialect of SQL (similar to HiveSQL, though it might lack some features). To run the code, you'll need to create a notebook and you can set the language to Spark SQL for the whole notebook. Make sure, that you have a Lakehouse assigned in the notebook. Of course, you'll need access to Microsoft Fabric.

Create a cell for each of the following blocks of code. To make sure that the code works, it might be a good idea to run the cells individually!

-- drop the test table (if exists already)
DROP TABLE IF EXISTS Assets;

--create the test table
CREATE TABLE Assets(
 Id int NOT NULL,
 CreationDate timestamp NOT NULL,
 Vendor string NOT NULL,
 Asset string NOT NULL,
 Model string NOT NULL,
 Owner string NOT NULL,
 Tag string NOT NULL,
 Quantity decimal(13, 2) NOT NULL
) 
USING DELTA;

-- insert test data
INSERT INTO Assets
VALUES (1, '2024-03-01T00:00:00Z', 'IBM', 'Laptop 1','Model 1','Owner 1','XX0001','1');

-- insert more test records 
INSERT INTO Assets
VALUES ('2', '2024-02-01T00:00:00Z','IBM','Laptop 2','Model 2','Owner 2','XX0001','1')
, ('3', '2024-02-01T00:00:00Z','Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', '2024-01-01T00:00:00Z','Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', '2024-01-15T00:00:00Z','Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', '2024-01-16T00:00:00Z','Dell','Laptop 4','Model 4','Owner 4','DD0001','1');

-- retrieving all the records
SELECT *
FROM Assets
ORDER BY Vendor;

-- updating a record
UPDATE Assets 
SET CreationDate = '2024-01-19T00:00:00Z'
WHERE Id = 6;

-- deleting a record
DELETE FROM Assets
WHERE Id = 2;

-- reviewing the changes
SELECT *
FROM Assets
ORDER BY Vendor;

Comments:
[1] There seems to be no way to provide a schema (see the SQL syntax for Spark), however the dbo schema seems to be used in the background (see table metadata post). Calling a table with the schema in SQL Spark results in an error:
"[TABLE_OR_VIEW_NOT_FOUND] The table or view `<schema>`.`<object_name>` cannot be found. Verify the spelling and correctness of the schema and catalog."

[2] The data types are different from the ones in SQL Server (e.g. timestamp for dates, string instead of varchars). Dates and decimals support precision, while strings translate to a varchar with 8000 characters (see table metadata post).

[3] Implicit conversions seem to occur when the format is correct. Otherwise, an explicit conversion is needed.

[4] Consider using semicolons at the end of each statement. That's mandatory when running multiple statements within the same cell. Otherwise, an error results.

[5] It would be interesting to test what's the performance when doing CRUD operations on large datasets.

[6] One can use the LIMIT x clause as alternative for TOP x from T-SQL:

-- reviewing the last 3 records
SELECT *
FROM Assets
ORDER BY CreationDate
LIMIT 3;

[7] SELECT FROM VALUES seems to work as well, though the data types must be the same (no implicit conversions occur):

SELECT *
FROM (
VALUES ('1', '2024-03-01T00:00:00Z', 'IBM', 'Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', '2024-02-01T00:00:00Z','IBM','Laptop 2','Model 2','Owner 2','XX0001','1')
) DAT(Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity);

Therefore, when bringing the two INSERTs together, you'll need to change the first value from the first INSERT for the ID from numeric to string.

You can use the DESCRIBE QUERY command to troubleshoot the differences:

 -- retrieve query's output metadata
 DESCRIBE QUERY 
 VALUES (1, '2024-03-01T00:00:00Z', 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1') 
 AS DAT(Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity)

Output:

col_name	data_type	comment
Id	int	NULL
CreationDate	string	NULL
Vendor	string	NULL
Asset	string	NULL
Model	string	NULL
Owner	string	NULL
Tag	string	NULL
Quantity	string	NULL

Happy coding!

Resources:
[1] Microsoft Learn (2023) How to use Microsoft Fabric notebooks (link)
[2] Apache Spark (2023) SQL syntax (link)

05 February 2011

SQL Reloaded: Deleting Sequential Data From a Table

    Last week I run into an interesting solution to a simple problem, problem I dealt with in several occasions too: how to delete (efficiently) sequential data, or how LuborK calls it in his post, “a problem of efficient ordered delete”. In this case he’s talking about the deletion of the first n records from a dataset of sequential data. It sounds like a simple problem considering that we can easily select in a first step the first n records using the TOP function and then delete the respective matched records. That’s also what LuborK does, however the novelty of the solution proposed it’s a little unexpected, and this not from the point of view of the approach, but of the gain in performance. The bottom line: by encapsulating the inner query within a view, the deletion of sequential data becomes more efficient.

    While reading the respective post I thought: what if the inner query is encapsulated in a inline table-valued function?! Would the performance remain the same or deprecate? Does it make sense to add the overhead of creating a view for each such scenario in order to gain some performance? Then, reading the comments, one of them was pointing something so obvious: the inner query can be encapsulated within a common table expression, following to perform on it the deletion. An elegant solution I haven’t had available under SQL Server 2000 at the times I had to solve the same problem. Actually my approach was a little different, at that time identifying and storing the greatest value of a new data set, building thus a collection of values that would allow me to select a given range. This approach was quite useful in the respective scenarios and quite acceptable as performance, plus the fact that I was reusing the greatest value in further logic – two birds with one shot.

    This being said, I was thinking that except the methods provided by LuborK, I could provide in this post several other simple techniques. Please note that I’m not focusing on performance, but on the techniques. Eventually you can display the statistics related to the disk activity and time required to parse, compile and execute each batch of statements. So, let’s start with the creation and population of the table! In order to minimize the differences, I took the same example and added just a little make-up – comments and a small change in the way the values of the second column are inserted.

-- dropping the table 
DROP TABLE dbo.TestTable 

--declaring the table 
CREATE TABLE dbo.TestTable ( 
  id int primary key 
, value varchar (50))  

-- inserting the records in a loop 
DECLARE @index int  
SET @index=1  
SET NOCOUNT ON  
WHILE (@index<100000)  
BEGIN 
    INSERT INTO dbo.TestTable  
    VALUES (@index,'test '+cast(@index as varchar(10))) 
    SET @index=@index+1  
END

Supposing we are interested in the first 10000 of records of a sequential data set, it would be much easier if we would know the maximum value from the respective data set. Then we could use the value to perform a simple deletion:

-- deletion based on stored max value DECLARE @max int 
SELECT @max = MAX(ID) 
FROM( 
    SELECT top(10000) id 
    FROM TestTable  
    ORDER BY ID 
) A 

--deleting the data  
DELETE FROM TestTable 
WHERE id <= @max

If we don’t need the maximum value for further logic, the two statements in can be combined in one query:

-- deletion based on max value - within one query DELETE FROM TestTable 
WHERE id <= ( 
     SELECT MAX(ID) 
     FROM ( 
          SELECT top(10000) id 
          FROM dbo.TestTable  
          ORDER BY ID 
     ) A 
) 

--deleting the data  
DELETE FROM TestTable 
WHERE id <= @max

When dealing with huge volumes of data, and not only then, in case of sequential processing we can store the maximum value in a given table, and pull the values as required. This approach allows us to process any interval within the data set, in this case the delete resumes to a simple delete statement in which @start and @end were retrieved from the respective table:

-- deleting the data within a given interval 
DELETE FROM dbo.TestTable 
WHERE id BETWEEN @start AND @end

If this seems trivial, let’s spice up things! As I observed, few people know that it’s possible to delete data from a given table using an inline table-valued function. Huh, what’s that?! If it seems impossible to you, take a look at documentation! I’m referring here to the GRANT object permissions section in which of interest is the following statement that refers to the ANSI 92 permissions applicable to the various database objects, for table-valued function being listed the following permissions: DELETE, INSERT, REFERENCES, SELECT, UPDATE. In order to exemplify the idea, let’s create the UDF:

-- creating the UDF function 
CREATE FUNCTION dbo.TestUDF1( 
@start int 
, @end int) 
RETURNS TABLE 
AS RETURN ( 
    SELECT id, value  
    FROM dbo.TestTable  
    WHERE id BETWEEN @start AND @end 
)  

--deleting the data 
 DELETE FROM dbo.TestUDF1(20001, 30000)

Voila! The example is supposed to work, unless you are having a problem with the permissions. Let’s consider this functionality from the perspective of our problem! What LuborK does is to include the inner query in a view and then delete the data:

-- creating the view 

CREATE VIEW dbo.vTestView 
AS 
SELECT top(10000) id 
FROM dbo.TestTable  
ORDER BY ID  


--deleting the data 
DELETE FROM dbo.vTestView

One of the problems with this approach is that the value representing the number of records to be returned is hardcoded. And here comes to help an inline table-valued function, also referred as a “parameterized view”:

 -- creating the UDF function 
 CREATE FUNCTION dbo.TestUDF2( 
@number_records int) 
RETURNS TABLE 
AS RETURN ( 
    SELECT top (@number_records) id, value  
    FROM dbo.TestTable  
    ORDER BY id 
)        

--deleting the data 
DELETE FROM dbo.TestUDF2(10001)

And, as closure, the example based on the common table expression:

 -- deletion using a CTE 
;WITH CTE  
AS ( 
     SELECT top(10000) id 
     FROM dbo.TestTable  
     ORDER BY ID 
)  
DELETE FROM CTE

03 February 2010

SQL Reloaded: Ways of Looking at Data III

In the previous posts from the same cycle (see Part I and Part II) I shown how the ranking window functions can be used in SQL Server in order to show the first/last record appearing in a certain partition. A broader set of reports are the ones focusing on the top n records matching a certain criteria, for example the top 10 Vendors/Products based on the amount purchased. Before the introduction of ranking window functions in SQL Server 2005 such requirements were previously addressed using the TOP clause. The following two queries are showing the top 10 Vendors, respectively the top 10 Vendors/Products based on the amount purchased, the first query focusing on the active Vendors while the second on all Vendors – now it depends on the requirements whether only the active or all Vendors are needed, therefore the Users should be always asked about this concern.

In contrast the use of ranking/aggregated window functions allows more flexibility in selecting not only the first n Vendors, but any interval, thus being possible to use this feature as a simple pagination technique. Here is the first query modified.

As for the second query the TOP clause allows only to select the top 10 Vendors/Products based on the amount purchased, not being possible to select each top n Vendors for each purchased Product as ranking functions allow. Here is the query for the top 3 Vendors for a Product.

The logic for the current example could be reused in further analysis, therefore it makes sense to encapsulated it in a database object. A view would do, though if we would like to use the number of Vendors as parameter, then it’s more appropriate to use a table-valued function or a stored procedure for this. Even if in theory stored procedures provide better performance, if we like to include the above logic in further queries then creating a table-valued function is the best choice.

Once the table-valued function was created we can go on and use it in other queries, for example here is a simple query showing maximum 5 vendors for each product.

Most probably the Users used with the pivoting functionality of Excel would ask you if it’s possible to show all the Vendors together in one line. Starting with SQL Server 2005 Microsoft introduced the PIVOT operator though it can pivot only one attribute, fortunately there is an alternative using a GROUP BY together with the CASE function.

The downside of this technique is that you can work only with a predefined number of Vendors per Product (here 3), on the other side there are few the cases in which more than 5 Vendors exist for the same Product. An alternative would be to process the data using a programming language as VBA, not a complicated thing to do at all. In almost 10 years of experience I found only a few cases in which the functionality provided by RDBMS needed to be extended with additional coding outside of the RDBMS (e.g. Excel, ASP), mainly when the data needed to be organized beyond the simple tabular structure.

Previous Post <<||>> Next Post

31 August 2008

SQL Server New Features: ROWCOUNT in action

Especially when working with big tables, the default behaviour of Query Analyzer is to not show the output until the last record has been fetched. This can be time and resource consuming and therefore I’ve appreciated the fact that TOAD and SQL Developer are fetching only a certain number of records. Now I can see that same can be done starting with SQL Server 2005 onward by modifying ROWCOUNT server property using Query/Query Options menu functionality.

Query Options under SQL Server 2008 or by running the command: SET ROWCOUNT <number of records>; Of course somebody may limit the number of records returned by a query using TOP function when working with SQL Server and ROWNUM in Oracle, though I find it not always handy – it depends from case to case. There are also technical implications between the two types of usage, according SQL Server Books online it is recommended to TOP with SELECT over ROWCOUNT with regard to scope and query optimization, however in this context only the later makes sense:

"As a part a SELECT statement, the query optimizer can use the value of expression in the TOP clause as part of generating an execution plan for a query. Because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query."

Notes: 1. Do not mix the ROWNUM with @@ROWNUM function which returns the number of rows affected by the last statement. 2. Some of us list all the records in order to see the number of records returned by a query, though that’s totally not advisable!

SQL Troubles

Pages