04 February 2011

Deleting Sequential Data From a Table

    Last week I run into an interesting solution to a simple problem, problem I dealt with in several occasions too: how to delete (efficiently) sequential data, or how LuborK calls it in his post, “a problem of efficient ordered delete”.  In this case he’s talking about the deletion of the first n records from a dataset of sequential data. It sounds like a simple problem considering that we can easily select in a first step the first n records using the TOP function and then delete the respective matched records. That’s also what LuborK does, however the novelty of the solution proposed it’s a little unexpected, and this not from the point of view of the approach, but of the gain in performance. The bottom line: by encapsulating the inner query within a view, the deletion of sequential data becomes more efficient.

    While reading the respective post I thought: what if the inner query is encapsulated in a inline table-valued function?! Would the performance remain the same or deprecate? Does it make sense to add the overhead of creating a view for each such scenario in order to gain some performance? Then, reading the comments, one of them was pointing something so obvious: the inner query can be encapsulated within a common table expression, following to perform on it the deletion. An elegant solution I haven’t had available under SQL Server 2000 at the times I had to solve the same problem. Actually my approach was a little different, at that time identifying and storing the greatest value of a new data set, building thus a collection of values that would allow me to select a given range. This approach was quite useful in the respective scenarios and quite acceptable as performance, plus the fact that I was reusing the greatest value in further logic – two birds with one shot.

    This being said, I was thinking that except the methods provided by LuborK, I could provide in this post several other simple techniques. Please note that I’m not focusing on performance, but on the techniques. Eventually you can display the statistics related to the disk activity and time required to parse, compile and execute each batch of statements. So, let’s start with the creation and population of the table! In order to minimize the differences, I took the same example and added just a little make-up – comments and a small change in the way the values of the second column are inserted.
-- dropping the table
DROP TABLE dbo.TestTable

--declaring the table
CREATE TABLE dbo.TestTable (
  id int primary key

,
value varchar (50))


-- inserting the records in a loop
DECLARE @index int
SET
@index=1

SET
NOCOUNT ON

WHILE
(@index<100000)

BEGIN

    INSERT INTO dbo.TestTable
    VALUES (@index,'test '+cast(@index as varchar(10)))

    SET @index=@index+1

END


    Supposing we are interested in the first 10000 of records of a sequential data set, it would be much easier if we would know the maximum value from the respective data set. Then we could use the value to perform a simple deletion:
-- deletion based on stored max value
DECLARE @max int
SELECT
@max = MAX(ID)

FROM
(

    SELECT top(10000) id
    FROM TestTable
    ORDER BY ID

)
A


--deleting the data
DELETE FROM TestTable
WHERE
id <= @max


    If we don’t need the maximum value for further logic, the two statements in can be combined in one query:
-- deletion based on max value - within one query
DELETE FROM TestTable
WHERE
id <= (

     SELECT MAX(ID)

     FROM (

          SELECT top(10000) id
          FROM dbo.TestTable
          ORDER BY ID
     ) A

)

    When dealing with huge volumes of data, and not only then, in case of sequential processing we can store the maximum value in a given table, and pull the values as required. This approach allows us to process any interval within the data set, in this case the delete resumes to a simple delete statement in which @start and @end were retrieved from the respective table:
-- deleting the data within a given interval
DELETE FROM dbo.TestTable
WHERE
id BETWEEN @start AND @end


   If this seems trivial, let’s spice up things! As I observed, few people know that it’s possible to delete data from a given table using an inline table-valued function. Huh, what’s that?! If it seems impossible to you, take a look at documentation! I’m referring here to the GRANT object permissions section in which of interest is the following statement that refers to the ANSI 92 permissions applicable to the various database objects, for table-valued function being listed the following permissions: DELETE, INSERT, REFERENCES, SELECT, UPDATE. In order to exemplify the idea, let’s create the UDF:
-- creating the UDF function
CREATE FUNCTION dbo.TestUDF1(
@start int

,
@end int)

RETURNS
TABLE
AS
RETURN (
    SELECT id, value
    FROM dbo.TestTable
    WHERE id BETWEEN @start AND @end
)


--deleting the data
DELETE FROM dbo.TestUDF1(20001, 30000)

    Voila! The example is supposed to work, unless you are having a problem with the permissions. Let’s consider this functionality from the perspective of our problem! What LuborK does is to include the inner query in a view and then delete the data:
-- creating the view
CREATE VIEW dbo.vTestView
AS
SELECT
top(10000) id

FROM
dbo.TestTable

ORDER
BY ID


--deleting the data
DELETE FROM dbo.vTestView

    One of the problems with this approach is that the value representing the number of records to be returned is hardcoded. And here comes to help an inline table-valued function, also referred as a “parameterized view”:
-- creating the UDF function
CREATE FUNCTION dbo.TestUDF2(
@number_records int)

RETURNS
TABLE
AS
RETURN (
    SELECT top (@number_records) id, value
    FROM dbo.TestTable
    ORDER BY id
)


--deleting the data
DELETE FROM dbo.TestUDF2(10001)

    And, as closure, the example based on the common table expression:
-- deletion using a CTE
;WITH CTE
AS
(

     SELECT top(10000) id
     FROM dbo.TestTable
     ORDER BY ID

)

DELETE
FROM CTE

No comments: