SQL Troubles: data preparation

Showing posts with label data preparation. Show all posts

11 February 2023

💎🏭SQL Reloaded: Misusing Views in Serverless SQL Pool

The lack of user-defined tables in serverless SQL pool is probably one of the biggest gaps one needs to overcome as developer in Azure Synapse. On the other side, given that views are "virtual tables", views can be misused to store small chunks of data that don't change that often.

In the process there are two activities involved - preparing the data, respectively creating the view. Some developers might prefer preparing the data in Excel though, when possible, I prepare the data directly from the source system. For example, the first four columns from the below query can be used for the first approach, while the last column prepares the data as needed by a VALUES.

-- Product subcategories denormalized
SELECT PCT.ProductSubcategoryID
, PCT.ProductCategoryID
, PSC.Name Category
, PCT.Name Subcategory 
, CONCAT(', (''', PCT.ProductSubcategoryID, ''', ''', PCT.ProductCategoryID, ''', ''', PSC.Name, ''', ''', PCT.Name, ''')') [Values] 
FROM Production.ProductSubcategory PCT
     JOIN Production.ProductCategory PSC
	   ON PCT.ProductCategoryID = PSC.ProductCategoryID

Independently of the method used for data preparation, the view can be built as follows:

-- creating the view
CREATE VIEW dbo.ProductCategories
AS
-- Product categories
SELECT Cast(DAT.ProductSubcategoryID as int) ProductSubcategoryID
, Cast(DAT.ProductCategoryID as int) ProductCategoryID
, Cast(DAT.Category as nvarchar(255)) Category
, Cast(DAT.Subcategory  as nvarchar(255)) Subcategory
, Dense_Rank() OVER (PARTITION BY DAT.Category, DAT.Subcategory

         ORDER BY DAT.ProductSubcategoryID, DAT.ProductCategoryID) Ranking
FROM (-- prepared data
VALUES ('1', '1', 'Bikes', 'Mountain Bikes')
, ('2', '1', 'Bikes', 'Road Bikes')
, ('3', '1', 'Bikes', 'Touring Bikes')
, ('4', '2', 'Components', 'Handlebars')
, ('5', '2', 'Components', 'Bottom Brackets')
, ('6', '2', 'Components', 'Brakes')

) DAT(ProductSubcategoryID, ProductCategoryID, Category, Subcategory)

-- testing the view
SELECT *
FROM dbo.ProductCategories

Observe that for each column was defined explicitly a data type and, even if the definition might change, it's still a better idea than relying on the database engine for inferring the data type. Secondly, I prefer to include also a "Ranking" column based on some attributes I'm expecting to be unique over the whole dataset. This, just in case some duplicate might make its way into the dataset. This is not a must, but something to consider in exceptional cases.

The alternative would be to save the same data into a file and make it available as a CETAS. For small datasets the overhead can be minimal. This can be a better idea if the users need to modify the data manually. On the other side, storing the definition of a view into an SQL file and making corrections as needed can prove to be faster for a developer, at least until the dataset stabilizes and less changes are needed.

It's important to understand that in this way a feature is misused and there are more likely some penalties deriving from it. As mentioned in the beginning, the dataset must be small (e.g. maximum a few hundreds of records) and change seldom. In extremis, it's not advisable to build a solution based on this!

One scenario I had to use this option was creating a dataset based on recursive logic built inside of a stored procedure, which would output the data in the form needed by the view, as done above. This was used as workaround, as I couldn't save the data via a pipeline. (I had a deja vu from the first attempts of exporting data in SSIS 2005!)

In theory the maintenance of such a view can be automated via a pipeline, if the effort makes sense, and the solution is stable enough.

I blogged some years back on Misusing Views and Pseudo-Constants.

Happy coding!

Previous Post <<||>> Next Post

15 February 2018

🔬Data Science: Data Preparation (Definitions)

Data preparation: "The process which involves checking or logging the data in; checking the data for accuracy; entering the data into the computer; transforming the data; and developing and documenting a database structure that integrates the various measures. This process includes preparation and assignment of appropriate metadata to describe the product in human readable code/format." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data Preparation describes a range of processing activities that take place in order to transform a source of data into a format, quality and structure suitable for further analysis or processing. It is often referred to as Data Pre-Processing due to the fact it is an activity that organises the data for a follow-on processing stage." (experian) [source]

"Data preparation [also] called data wrangling, it’s everything that is concerned with the process of getting your data in good shape for analysis. It’s a critical part of the machine learning process." (RapidMiner) [source]

"Data preparation is an iterative-agile process for exploring, combining, cleaning and transforming raw data into curated datasets for self-service data integration, data science, data discovery, and BI/analytics." (Gartner)

"Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data." (Talend) [source]

SQL Troubles

Pages

11 February 2023

💎🏭SQL Reloaded: Misusing Views in Serverless SQL Pool

15 February 2018

🔬Data Science: Data Preparation (Definitions)

About Me