A Software Engineer and data professional's blog on SQL, data, databases, data architectures, data management, programming, Software Engineering, Project Management, ERP implementation and other IT related topics.
Pages
- 🏠Home
- 🗃️Posts
- 🗃️Definitions
- 🏭Fabric
- ⚡Power BI
- 🔢SQL Server
- 📚Data
- 📚Engineering
- 📚Management
- 📚SQL Server
- 📚Systems Thinking
- ✂...Quotes
- 🧾D365: GL
- 💸D365: AP
- 💰D365: AR
- 👥D365: HR
- ⛓️D365: SCM
- 🔤Acronyms
- 🪢Experts
- 🗃️Quotes
- 🔠Dataviz
- 🔠D365
- 🔠Fabric
- 🔠Engineering
- 🔠Management
- 🔡Glossary
- 🌐Resources
- 🏺Dataviz
- 🗺️Social
- 📅Events
- ℹ️ About
05 February 2010
🎡SSIS: Wizarding an SSIS Package (First Magic Class)
Step 1: Start SQL Server Import and Export Wizard
From SQL Server Management Studio choose the database from which you want to export the data (e.g. AdventureWorks), right click on it and from the floating menu choose Tasks/Export Data. This action will bring the ‘SQL Server Import and Export Wizard’ used, as its name indicates, for importing and exporting data on the fly from/to SQL Server or any other source.
Step 2: Choose a Data Source
In ‘Choose a Data Source’ step select the ‘Data Source’, SQL Server Native Client 10.0 for exporting data from SQL Server, choose ‘Server name’ from the list of SQL Server available, select the Authentication mode and the Database (e. g. AdventureWorks), then proceed to the next step by clicking ‘Next’.
Step 3: Choose a Destination
In ‘Choose a Destination’ step select the Destination, in this case ‘Microsoft Excel’, browse for the Excel file to which you want to export the data (e.g. Products.xls) and check ‘First row has column names’ in case you want to include the column names, then proceed to the next step by clicking ‘Next’.
Step 4: Specify Table Copy or Query
In ‘Specify Table Copy or Query’ step there are two options you could choose: ‘Copy data from one or more tables or views’ or ‘Write a query to specify the data to transfer’ – their names are speaking for itself. The second option allows for more flexibility and you could just copy paste the query used for your expert. In this example just go with the first option and then proceed to the next step by clicking ‘Next’.
Step 5: Select Source Tables and Views
In ‘Select Source Tables and Views’ step select the database objects (e.g. Production.vProducts) from which you’ll export the data, then proceed to the next step by clicking ‘Next’. If the destination allows it, it’s possible to choose more than one database object.
Step 6: Review Data Type Mapping
In ‘Review Data Type Mapping’ typically you could go with the provided defaults thus you can proceed to the next step by clicking ‘Next’.
Step 7: Save and Run Package
In ‘Save and Run Package’ step you could choose to ‘Run immediately’ the package and/or ‘Save SSIS Package’ to the ‘SQL Server’ or locally to the ‘File System’. Saving the package locally allows you to modify and rerun the package at a later date. For this example check the ‘Run immediately’, respectively the ‘Save SSIS Package’ and ‘File System’ option, then proceed to the next step by clicking ‘Next’.
Step 8: Save SSIS Package
In ‘Save SSIS Package’ step provide the intended Name (e.g. Export Products) or Description of the Package, choose the location where the package will be saved, then proceed to the next step by clicking ‘Next’.
Step 9: Complete the Wizard
The ‘Complete the Wizard’ step allows reviewing the choices made in the previous steps and eventually navigating to the previous steps in order and to the needful changes. Therefore once you proved the details proceed to the next step by clicking ‘Next’.
Step 10: Executing the Package
In ‘Executing the Package’ step the Package is run, the progress being shown as each step. Just close the Package once you reviewed the steps.
If the package run successfully you can go on and check the exported data, apply additional formatting, etc.
Step 11: Inspecting the Package
The package can be reopened and modified in Microsoft Visual Studio or SQL Server Business Intelligence Development Studio, for this just open one of the two environments and from the Main Menu select File/Open File, then browse for the location where the Package (e.g. Export Products.dtsx) was saved and open the file. Here is the package created by the Wizard:
As can be seen the package contains two tasks, the ‘Preparation SQL Task 1’ which creates the table (e.g. vProducts) in the Excel file, and ‘Data Flow Task 1’ which dumps the data in the created table. By double clicking the ‘Data Flow Task 1’ you could see its content: - the ‘Source - vProducts’ OLEDB Data Source holding the connectivity information to the SQL Server and the list of Columns in scope. - the ‘Data Conversion’ Transformation that allows converting the data between Source and Destination given the fact that Excel has different data types than SQL Server - the ‘Destination - vProducts’ Excel Destination holding the connectivity information for the Excel file and the Mappings in place.
You could explore the properties of each object in order to learn more about its attributes used and the values they took.
Ok, so you’ve open the package, how do you run it again? For this is enough to double click on the locally saved Package (e.g. Export Products.dtsx), action that will bring the ‘Execute Package Utility’, and click on the ‘Execute’ button to run the Package.
If you haven’t done any changes to the Excel file to which you moved the data (e.g. Products.xls) and haven’t moved the file, then the package will fail when it attempts to run the ‘the ‘Preparation SQL Task 1’ task because a table with the expected name already exists in the file.
In order to avoid this error to appear you’ll have to delete from the Excel file the sheet (e.g. vProducts) where the data were dumped and save the file. Now you could execute again the package and it should execute without additional issues.
If you’ve removed/deleted the Excel file, even if you get a warning that the Excel file is not available, the package will run in the end without problems. Instead of deleting the sheet manually you could use the File System Task to move the file to other location, though that’s a topic for another post.
🕋Data Warehousing: Star Schema (Definitions)
"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to the fact table by a key column." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)
"A star schema is a dimensional data model implemented on a relational database." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)
"A star schema is a set of tables comprised of a single, central fact table surrounded by dimension tables. Each dimension is represented by a single dimension table. Star schemas implement dimensional data structures with denormalized dimensions. Snowflake schemas are an alternative to a star schema design." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)
"A single fact table surrounded by a single hierarchical layer of dimensional tables, in a data warehouse database." (Gavin Powell, "Beginning Database Design", 2006)
"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)
"The instantiation of a dimensional model in a relational database. A star schema consists of a fact table and the dimension tables that it references. The fact table contains facts and foreign keys; the dimension tables contain dimensional attributes by which the facts will be filtered, rolled up, or grouped." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)
"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)
"The implementation of a dimensional model in a relational database. The tables are organized around a single central fact table possessing a multi-part key, and each surrounding dimension table has its own primary key." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)
"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)
"The arrangement of the collection of fact and dimension tables in the dimensional data model, resembling a star formation, with the fact table placed in the middle surrounded by the dimension tables. Each dimension table is in a one-to-many relationship with the fact table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)
"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design" 4th Ed., 2010)
"A common form of a dimensional data model, where a fact table is directly linked by foreign keys to several dimension tables." (Craig S Mullins, "Database Administration", 2012)
"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to and usually joined to the fact table by a key column." (Microsoft, "SQL Server 2012 Glossary", 2012)
"A relational schema whose design represents a dimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys." (Oracle, "Database SQL Tuning Guide Glossary", 2013)
"A type of relational database schema that is composed of a set of tables comprising a single, central fact table surrounded by dimension tables. See also dimension table, star join." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)
💎SQL Reloaded: Just in CASE IV (Other Scenarios)
As can be seen on more than one branch of a CASE there is a correlated sub-query based on the same table, the aggregated value being used for further calculations. Most probably the respective query had only the purpose to demonstrate a technique, though I’m having mainly two observations related to it:
1. The query returns a label of whether the Stock is High, Normal, Low or there is no Stock, though I have no measure of what the Stock actually is, thus in a first phase I can’t validate whether the query shows the correct data. In case a report is based on the respective query this reduces considerably also report's usability.
2. A select is replicated across several CASE branches fact that I find it complicates query’s complexity reflected in query’s understanding and maintainability, while considering that the logic needs to be kept in synchronization increases the chances for making a mistake. At least in this situation, being involved only one table, the logic is quite simple though what do you do when the correlated sub-query is more complex?
How about encapsulating query’s logic in a sub-query and join it to the main tables using a left join? This approach would address both issues mentioned above:
When creating a query there is always an important facts that needs to be considered - query’s performance. The second query is simpler and in theory it should be easier to process, expecting to have at least similar performance as the first version. In this case the ProductID is a foreign key in Production.ProductInventory, the search performed on the respective table having minimum impact on performance. If no index is available on the searched attribute, more likely first query’s performance will decrease considerably. The best approach for mitigating the performance differences between the two queries is to look at Client Statistics and Execution Plan. In what concerns the Client Statistics both queries are having similar performance, while the Execution Plan of the second query excepting the fact that is simpler and, without going into details, it seems the second plan is better.
Maybe it makes sense to use correlated sub-queries in a CASE only when the correlated sub-queries have distinct logic, though even then I would recommend using a left join instead, this technique allowing more flexibility being possible to show the actual values in the query and even reuse them if needed in additional calculations.
Talking about correlated queries, I found cases in which two distinct correlated sub-queries with the same logic were used to pull two distinct attributes available in the same table, something like in the below example.
What if the correlated sub-queries are used again with the CASE function like in the below example?
Even if maybe the number of calls to the correlated sub-queries is reduced by using a CASE, I would recommend using a left join instead, this technique offering more flexibility, the logic becoming also much easier to maintain and debug.
04 February 2010
💎SQL Reloaded: Just In CASE III (Quest for Performance)
Given the fact that the CASE function stops after the first match, in case there are no two expressions that evaluate to true for any same value, the CASE could be optimized for the best performance by arranging the branches in the order of the highest number of matches. For example taking the logic from the 3rd example from Part I and checking the number of occurrences for each value, surprisingly (or maybe not) there are 257 NULL values, 97 ‘L’ values, 82 ‘H’ values, respectively 68 ‘M’ values. If we change the branches in this order, the number of evaluations will decrease from 82*1+68*2+97*3+257*3=1280 to 257*1+97*2+82*3+68*3=901, the difference is small though when applied to big data sets the improvement could pay the effort. Here is the modified logic for Example 3:
Rewriting Simple Unions
Have you ever seen a set of similar queries whose output is merged with the help of UNION (ALL) operator? I’ve seen so many queries like the below one, in which the only difference between the sub-queries was residing in the WHERE constraint(s) applied and maybe a few additional calculations!
The example is quite simple, though I think it demonstrates the point. The UNION could be translated to a simple CASE as in the 3rd example. The problem with such queries is that if there is no index on the attribute(s) used is the WHERE constraint, it will be performed a full table scan for each sub-query, quite expensive from a performance standpoint. Even if the respective attribute is indexed, there’s still a pay in performance. Such an approach could be maybe acceptable when in each scenario different tables are joined to the main table, though even then should be checked which one of the two approaches has better performance. A recent example I can recall and in which could have been used several CASE functions instead of the UNION was based on the JOIN of two tables, the query looking something like:
Rewriting Complex Unions
Even if when the base table is used in several union-based merged queries, it doesn’t sense to apply this technique all the times, especially when dealing with complex queries using multiple joins. Now it depends, there are scenarios and scenarios, is must be always considered the trade in performance, readability and usability of a query when applying any technique. The Person.Address from AdventureWorks is the best way to exemplify two scenarios in which it makes sense and doesn’t makes sense to combine the logic in two union queries. The respective table stored the addresses for Customers (Sales & Individuals), Vendors and Employees, thus supposing that we would like to see the Owners and Owner Type for each Address we would need to create several unions.
Let’s consider first the Customer-based Addresses – as there are two types of Customers and because the details for each type are stored in different tables, we might end up creating the following query:
Similar queries could be written in order to get the Vendor and Employee details, and as can be seen also these queries share many similarities.
Pivoting Behavior
Another technique when a CASE could help eliminate several joins to the same table is presented in a post on List Based Aggregations, the CASE function being used together with aggregated functions in order to cumulate the On Hand for several locations and show it on the same line. This approach was quite useful before the PIVOT operator was introduced in SQL Server 2005 (see Pivot operator example), and is still is in case is needed to select more than 1 attribute for each line.
💎SQL Reloaded: Just In CASE II (Clauses and Joins)
The use of CASE functions in WHERE clause is very rare because it could be replaced with several OR operators, the only benefit results maybe from syntax’ clearness or by checking values’ type before using them – the case of numeric and date values. Even if the below example doesn’t make sense from a business point of view, it shows a possible use of CASE within the WHERE clause.
03 February 2010
🕋Data Warehousing: Snowflake Schema (Definitions)
"An extension of a star schema such that one or more dimensions are defined by multiple tables. In a snowflake schema, only primary dimension tables are joined to the fact table. Additional dimension tables are joined to primary dimension tables. " (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)
"A variation on dimensional schema design in which dimension tables are further normalized, split into multiple tables based on hierarchies in the data." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)
"A single fact table which joins to many dimension tables, with each dimension normalized as two or more tables." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)
"A variation of the star schema in which the business dimensions are implemented as a set of normalized tables. The resulting diagram resembles a snowflake." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)
"A normalized version of the STAR schema in which dimension tables are partially or fully normalized. Not generally recommended because it compromises query performance and simplicity for understanding." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)
"A star schema in which dimension tables reference other tables." (Oracle, "Database SQL Tuning Guide Glossary", 2013)
💎SQL Reloaded: Ways of Looking at Data III
In contrast the use of ranking/aggregated window functions allows more flexibility in selecting not only the first n Vendors, but any interval, thus being possible to use this feature as a simple pagination technique. Here is the first query modified.
As for the second query the TOP clause allows only to select the top 10 Vendors/Products based on the amount purchased, not being possible to select each top n Vendors for each purchased Product as ranking functions allow. Here is the query for the top 3 Vendors for a Product.
The logic for the current example could be reused in further analysis, therefore it makes sense to encapsulated it in a database object. A view would do, though if we would like to use the number of Vendors as parameter, then it’s more appropriate to use a table-valued function or a stored procedure for this. Even if in theory stored procedures provide better performance, if we like to include the above logic in further queries then creating a table-valued function is the best choice.
Once the table-valued function was created we can go on and use it in other queries, for example here is a simple query showing maximum 5 vendors for each product.
Most probably the Users used with the pivoting functionality of Excel would ask you if it’s possible to show all the Vendors together in one line. Starting with SQL Server 2005 Microsoft introduced the PIVOT operator though it can pivot only one attribute, fortunately there is an alternative using a GROUP BY together with the CASE function.
The downside of this technique is that you can work only with a predefined number of Vendors per Product (here 3), on the other side there are few the cases in which more than 5 Vendors exist for the same Product. An alternative would be to process the data using a programming language as VBA, not a complicated thing to do at all. In almost 10 years of experience I found only a few cases in which the functionality provided by RDBMS needed to be extended with additional coding outside of the RDBMS (e.g. Excel, ASP), mainly when the data needed to be organized beyond the simple tabular structure.
🕋Data Warehousing: Data Mart [DM] (Definitions)
💎SQL Reloaded: Just In CASE I (Introduction)
SQL Server Books Online A CASE could have one or more WHEN branches and only one or no ELSE branch, the WHEN branches being evaluated in the order they were given, while the ELSE branch is always last evaluated. Once an expression on WHEN branch is evaluated to true, the function returns the value evaluated from result_expression and stops the searching. If no match was found then the function evaluates the result_expression from ELSE branch, and if the ELSE branch is missing a value of NULL is return. Actually it’s simpler than is sounds.
CASE combined with other operators and functions enable to create powerful transformations, many of the transformations done during ETL processing could be solved with a CASE. The following SQL script based on Production.Product table from AdventureWorks database relies heavily on the use of CASE function in order to transform the data to a form intelligible for the Users or any other consumers.
The simplest use of CASE function is the translation of bit values to more meaningful values, typically ‘Y’/’N’ or ‘Yes’/’No’. It’s the case of the code chunks marked with Example 1 & 2 in the above query, the first example showing the use of Simple CASE syntax, while the second one of a Searched CASE.
The third example shows the Simple CASE syntax for multiple values of the same attribute, being provided a branch for each of the values taken by Class attribute (‘H’, ‘M’, ‘L’) and translating them to more meaningful values (‘High’, ‘Medium’, ‘Low’), the ‘n/a’ (not available) value being returned for the other cases.
The alternative syntax could be used too, though for such simple evaluations the previous syntax form is easier to use.
For the ones knowing the IsNull or COALESCE function the 4th example doesn’t make sense, a simpler writing for the same would be IsNull(Color, ‘n/a’). What the example tries to demonstrate is the use of comparison against NULL values using the IS NULL clause.
In the 5th and 6th example the matching expressions make use of IN and BETWEEN operators, any other operators could be used in this context. In the 5th example the BETWEEN operator is working with character data type therefore is used the lexical comparison – for such cases it has to be taken into account that if Size would have taken also the value ‘334’ it would be evaluated on the second branch and return the value ‘S’!
More general, the CASE function could be used to split a set of alphanumeric values or an interval of numeric values in buckets or evaluate specific expressions on each branch, like functions that have different formulas on each interval. If the previous examples work with sets of alphanumeric values, the 6th example uses the StandardCost to derive several Cost Classes (buckets split), the intervals considered are (-infinite, 0], (0, 50], (50, 250], (250, 1000], [1000, 10000] and (10000, infinite).
Somebody would remark: “ok, but the values, 50, 250, and 1000 are considered on more than one branch!”. Correct, but should not be forgotten that the CASE function stops after the first expression evaluated to true, so if the StandardCost is 50 the second branch will be the last branch processed, so the respective piece of code is correct. In order to avoid any confusion we could have written the following lengthier and equivalent code:
The 7th example is another example working with intervals based on the difference in days between the SellEndDate and the current date.
02 February 2010
🕋Data Warehousing: Data Warehouse [DWH] (Definitions)
💎SQL Reloaded: Ways of Looking at Data II
Following PO Header’s example and logical partitioning , could be created a view based only on PO Details and the child tables, though as in this case there are no child tables – does it makes sense to create a view? Yes, why not?! The logic encapsulated by a view doesn’t resume only to joining tables, but also to provide a restrained list of attributes, computed columns or limit the scope by using constraints.
As can be seen from the above query, has been introduced several computed columns: OpenQty – the Quantity remaining to be receipt calculated as the difference between OrderedQty and ReceiptQty, ReceiptTotal – the total value of quantities receipt, OutstandingTotal – the total value corresponding to OpenQty. These three columns are quite often met in analysis and even if their definition is quite easy to follow and implement, it makes sense to include them in the view, thus reducing logic replication. In query writing best practices I was recommending to encapsulate formulas and business logic in functions, however in this case the logic is quite simple and it doesn’t makes sense to create a function for each formula used in a query, the recommendation should be used in theory (complex) formulas that are often reused.
The alternative for partitioned view is to create an extended view that include PO Header and PO Details, such type of view is really easy to use by most of the users. I prefer to use both approaches, having thus more flexibility.
Do not forget to check the variations in cardinality! This exercise allow also to identify whether the referential integrity is kept. Even if it’s hard to believe, I found referential integrity broken in databases designed to enforce it, therefore it’s a good idea to check whether the referntial integrity is kept between the various elements.
The same query could be also created also by using logical partitioned view, though before doing that we have to create a view also for Products:
The final query based on logical partitioned views:
Before using the data there are 1-2 tests that needs to be done in order to assure ourselves that the data available at the various levels of details are synchronized: is the case of the SubTotal which should match the cumulated LineTotal for each line.
Simple reports based on PO Detail level resume in retrieving the POs matching a certain criteriea: closed POs, Open POs, POs with overreceipt, with rejections or any other business case.
Note:
The * (ALL) has been used only in order to minimize query’s length, as a good practice avoid using the “*” and provide instead only the list of attributes needed!
Other types of reports are showing the data at different aggregations, for example calculating the Total Amount spent for each Vendor/Product, the Outstanding Amount, etc.
More complex reports can be built with the help of window functions, for example showing the last PO Unit Price per Vendor/Product, such a report being quite useful when reevaluating the Standard Costs.
Happy coding!
About Me
- Adrian
- Koeln, NRW, Germany
- IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.