05 February 2010

🎡SSIS: Wizarding an SSIS Package (First Magic Class)

I'm sorry to disappoint you, this it’s not another story about Harry Potter! It’s about how to create an SSIS package without any knowledge about SSIS or the components of an SSIS Package with the help of ‘SQL Server Import and Export Wizard’ available from the SQL Server 2008 Management Studio. Here are the steps needed to be followed in order to create a Package that exports the data from a local table to an Excel file.

Step 1: Start SQL Server Import and Export Wizard

From SQL Server Management Studio choose the database from which you want to export the data (e.g. AdventureWorks), right click on it and from the floating menu choose Tasks/Export Data. This action will bring the ‘SQL Server Import and Export Wizard’ used, as its name indicates, for importing and exporting data on the fly from/to SQL Server or any other source.
Step 2: Choose a Data Source

In ‘Choose a Data Source’ step select the ‘Data Source’, SQL Server Native Client 10.0 for exporting data from SQL Server, choose ‘Server name’ from the list of SQL Server available, select the Authentication mode and the Database (e. g. AdventureWorks), then proceed to the next step by clicking ‘Next’.
Step 3: Choose a Destination

In ‘Choose a Destination’ step select the Destination, in this case ‘Microsoft Excel’, browse for the Excel file to which you want to export the data (e.g. Products.xls) and check ‘First row has column names’ in case you want to include the column names, then proceed to the next step by clicking ‘Next’.
Step 4: Specify Table Copy or Query

In ‘Specify Table Copy or Query’ step there are two options you could choose: ‘Copy data from one or more tables or views’ or ‘Write a query to specify the data to transfer’ – their names are speaking for itself. The second option allows for more flexibility and you could just copy paste the query used for your expert. In this example just go with the first option and then proceed to the next step by clicking ‘Next’.
Step 5: Select Source Tables and Views

In ‘Select Source Tables and Views’ step select the database objects (e.g. Production.vProducts) from which you’ll export the data, then proceed to the next step by clicking ‘Next’. If the destination allows it, it’s possible to choose more than one database object.
Step 6: Review Data Type Mapping

In ‘Review Data Type Mapping’ typically you could go with the provided defaults thus you can proceed to the next step by clicking ‘Next’.
Step 7: Save and Run Package

In ‘Save and Run Package’ step you could choose to ‘Run immediately’ the package and/or ‘Save SSIS Package’ to the ‘SQL Server’ or locally to the ‘File System’. Saving the package locally allows you to modify and rerun the package at a later date. For this example check the ‘Run immediately’, respectively the ‘Save SSIS Package’ and ‘File System’ option, then proceed to the next step by clicking ‘Next’.
Step 8: Save SSIS Package

In ‘Save SSIS Package’ step provide the intended Name (e.g. Export Products) or Description of the Package, choose the location where the package will be saved, then proceed to the next step by clicking ‘Next’.
Step 9: Complete the Wizard

The ‘Complete the Wizard’ step allows reviewing the choices made in the previous steps and eventually navigating to the previous steps in order and to the needful changes. Therefore once you proved the details proceed to the next step by clicking ‘Next’.
Step 10: Executing the Package

In ‘Executing the Package’ step the Package is run, the progress being shown as each step. Just close the Package once you reviewed the steps.
If the package run successfully you can go on and check the exported data, apply additional formatting, etc.

Step 11: Inspecting the Package

The package can be reopened and modified in Microsoft Visual Studio or SQL Server Business Intelligence Development Studio, for this just open one of the two environments and from the Main Menu select File/Open File, then browse for the location where the Package (e.g. Export Products.dtsx) was saved and open the file. Here is the package created by the Wizard:
As can be seen the package contains two tasks, the ‘Preparation SQL Task 1’ which creates the table (e.g. vProducts) in the Excel file, and ‘Data Flow Task 1’ which dumps the data in the created table. By double clicking the ‘Data Flow Task 1’ you could see its content: - the ‘Source - vProducts’ OLEDB Data Source holding the connectivity information to the SQL Server and the list of Columns in scope. - the ‘Data Conversion’ Transformation that allows converting the data between Source and Destination given the fact that Excel has different data types than SQL Server - the ‘Destination - vProducts’ Excel Destination holding the connectivity information for the Excel file and the Mappings in place.
You could explore the properties of each object in order to learn more about its attributes used and the values they took.

Ok, so you’ve open the package, how do you run it again? For this is enough to double click on the locally saved Package (e.g. Export Products.dtsx), action that will bring the ‘Execute Package Utility’, and click on the ‘Execute’ button to run the Package.
If you haven’t done any changes to the Excel file to which you moved the data (e.g. Products.xls) and haven’t moved the file, then the package will fail when it attempts to run the ‘the ‘Preparation SQL Task 1’ task because a table with the expected name already exists in the file.
In order to avoid this error to appear you’ll have to delete from the Excel file the sheet (e.g. vProducts) where the data were dumped and save the file. Now you could execute again the package and it should execute without additional issues.

If you’ve removed/deleted the Excel file, even if you get a warning that the Excel file is not available, the package will run in the end without problems. Instead of deleting the sheet manually you could use the File System Task to move the file to other location, though that’s a topic for another post.

🕋Data Warehousing: Star Schema (Definitions)

 "A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to the fact table by a key column." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A star schema is a dimensional data model implemented on a relational database." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A star schema is a set of tables comprised of a single, central fact table surrounded by dimension tables. Each dimension is represented by a single dimension table. Star schemas implement dimensional data structures with denormalized dimensions. Snowflake schemas are an alternative to a star schema design." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A single fact table surrounded by a single hierarchical layer of dimensional tables, in a data warehouse database." (Gavin Powell, "Beginning Database Design", 2006)

"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The instantiation of a dimensional model in a relational database. A star schema consists of a fact table and the dimension tables that it references. The fact table contains facts and foreign keys; the dimension tables contain dimensional attributes by which the facts will be filtered, rolled up, or grouped." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The implementation of a dimensional model in a relational database. The tables are organized around a single central fact table possessing a multi-part key, and each surrounding dimension table has its own primary key." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"The arrangement of the collection of fact and dimension tables in the dimensional data model, resembling a star formation, with the fact table placed in the middle surrounded by the dimension tables. Each dimension table is in a one-to-many relationship with the fact table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design" 4th Ed., 2010)

"A common form of a dimensional data model, where a fact table is directly linked by foreign keys to several dimension tables." (Craig S Mullins, "Database Administration", 2012)

"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to and usually joined to the fact table by a key column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A relational schema whose design represents a dimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"A type of relational database schema that is composed of a set of tables comprising a single, central fact table surrounded by dimension tables. See also dimension table, star join." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

💎SQL Reloaded: Just in CASE IV (Other Scenarios)

    Two days ago I was looking on the Web on random posts to see whether I missed something important related to CASE function and I stumble over an old post of Craig S. Mullins, in one of the examples (See SQL Statement #5) there is a query similar with the one below:
    As can be seen on more than one branch of a CASE there is a correlated sub-query based on the same table, the aggregated value being used for further calculations. Most probably the respective query had only the purpose to demonstrate a technique, though I’m having mainly two observations related to it:
1. The query returns a label of whether the Stock is High, Normal, Low or there is no Stock, though I have no measure of what the Stock actually is, thus in a first phase I can’t validate whether the query shows the correct data. In case a report is based on the respective query this reduces considerably also report's usability.
2. A select is replicated across several CASE branches fact that I find it complicates query’s complexity reflected in query’s understanding and maintainability, while considering that the logic needs to be kept in synchronization increases the chances for making a mistake. At least in this situation, being involved only one table, the logic is quite simple though what do you do when the correlated sub-query is more complex?

    How about encapsulating query’s logic in a sub-query and join it to the main tables using a left join? This approach would address both issues mentioned above:
    When creating a query there is always an important facts that needs to be considered - query’s performance. The second query is simpler and in theory it should be easier to process, expecting to have at least similar performance as the first version. In this case the ProductID is a foreign key in Production.ProductInventory, the search performed on the respective table having minimum impact on performance. If no index is available on the searched attribute, more likely first query’s performance will decrease considerably. The best approach for mitigating the performance differences between the two queries is to look at Client Statistics and Execution Plan. In what concerns the Client Statistics both queries are having similar performance, while the Execution Plan of the second query excepting the fact that is simpler and, without going into details, it seems the second plan is better.

    Maybe it makes sense to use correlated sub-queries in a CASE only when the correlated sub-queries have distinct logic, though even then I would recommend using a left join instead, this technique allowing more flexibility being possible to show the actual values in the query and even reuse them if needed in additional calculations.

    Talking about correlated queries, I found cases in which two distinct correlated sub-queries with the same logic were used to pull two distinct attributes available in the same table, something like in the below example.
    What if the correlated sub-queries are used again with the CASE function like in the below example?
    Even if maybe the number of calls to the correlated sub-queries is reduced by using a CASE, I would recommend using a left join instead, this technique offering more flexibility, the logic becoming also much easier to maintain and debug.

04 February 2010

💎SQL Reloaded: Just In CASE III (Quest for Performance)

Arranging branches

    Given the fact that the CASE function stops after the first match, in case there are no two expressions that evaluate to true for any same value, the CASE could be optimized for the best performance by arranging the branches in the order of the highest number of matches. For example taking the logic from the 3rd example from Part I and checking the number of occurrences for each value, surprisingly (or maybe not) there are 257 NULL values, 97 ‘L’ values, 82 ‘H’ values, respectively 68 ‘M’ values. If we change the branches in this order, the number of evaluations will decrease from 82*1+68*2+97*3+257*3=1280 to 257*1+97*2+82*3+68*3=901, the difference is small though when applied to big data sets the improvement could pay the effort. Here is the modified logic for Example 3:
    This technique is unfortunately not bulletproof – even if less probable in many cases, the order of the values’ occurrences could change overnight, the number of total evaluations varying in time on the number of occurrences for each value.

Rewriting Simple Unions

    Have you ever seen a set of similar queries whose output is merged with the help of UNION (ALL) operator? I’ve seen so many queries like the below one, in which the only difference between the sub-queries was residing in the WHERE constraint(s) applied and maybe a few additional calculations!

    The example is quite simple, though I think it demonstrates the point. The UNION could be translated to a simple CASE as in the 3rd example. The problem with such queries is that if there is no index on the attribute(s) used is the WHERE constraint, it will be performed a full table scan for each sub-query, quite expensive from a performance standpoint. Even if the respective attribute is indexed, there’s still a pay in performance. Such an approach could be maybe acceptable when in each scenario different tables are joined to the main table, though even then should be checked which one of the two approaches has better performance. A recent example I can recall and in which could have been used several CASE functions instead of the UNION was based on the JOIN of two tables, the query looking something like:


Rewriting Complex Unions

    Even if when the base table is used in several union-based merged queries, it doesn’t sense to apply this technique all the times, especially when dealing with complex queries using multiple joins. Now it depends, there are scenarios and scenarios, is must be always considered the trade in performance, readability and usability of a query when applying any technique. The Person.Address from AdventureWorks is the best way to exemplify two scenarios in which it makes sense and doesn’t makes sense to combine the logic in two union queries. The respective table stored the addresses for Customers (Sales & Individuals), Vendors and Employees, thus supposing that we would like to see the Owners and Owner Type for each Address we would need to create several unions.

    Let’s consider first the Customer-based Addresses – as there are two types of Customers and because the details for each type are stored in different tables, we might end up creating the following query:

    As can be seen the two queries are quite similar, most of the tables joined are the same with 3 exceptions: Sales.Store vs. Sales.Individual & Person.Contact. The query could be rewritten using a CASE and left joins instead of inner joins as follows:


    Similar queries could be written in order to get the Vendor and Employee details, and as can be seen also these queries share many similarities.
    Does it makes sense to use the above technique also in this case? Maybe it would be acceptable to merge the Vendor with Employee Addresses queries, though does it make sense to do merge also the Customer Addresses using CASE & left joins. The queries are not so complex so it’s possible to do that, though I think a decision should be taken only when has been studied the performance of the two approaches.

Pivoting Behavior

    Another technique when a CASE could help eliminate several joins to the same table is presented in a post on List Based Aggregations, the CASE function being used together with aggregated functions in order to cumulate the On Hand for several locations and show it on the same line. This approach was quite useful before the PIVOT operator was introduced in SQL Server 2005 (see Pivot operator example), and is still is in case is needed to select more than 1 attribute for each line.

💎SQL Reloaded: Just In CASE II (Clauses and Joins)

The CASE function could appear in all typical clauses – SELECT, WHERE, GROUP BY, ORDER BY and even in JOIN predicates. If the impact on performance is quite small when CASE is used inside of a SELECT clause, the decrease in performance could be quite high when the CASE is used in searches – WHERE constraints or JOIN predicates. Oracle provides function-based indexes, the smart use of such functions could improve in theory searches performed with CASE/DECODE functions, unfortunately as far I know there is no similar feature available on MS SQL Server family.

The use of CASE functions in WHERE clause is very rare because it could be replaced with several OR operators, the only benefit results maybe from syntax’ clearness or by checking values’ type before using them – the case of numeric and date values. Even if the below example doesn’t make sense from a business point of view, it shows a possible use of CASE within the WHERE clause. 

 
There are systems that misuse an alphanumeric attribute in order to store a numeric or date value, therefore before using the respective values in computations it’s necessary/recommended to test whether the values have the expected data type. I met also cases in which the range of dates covered by a date type in a RDBMS is narrower than a similar date type of other RDBMS (e.g. SQL Server vs. Oracle), in such case the test for valid dates is also necessary. For this purpose the CASE could be used together with the IsDate, respectively IsNumeric, functions as in below examples. 


The same data type handling could be used also in joins when with the misuse of alphanumeric data types, the numeric values represent references to other tables, or the date values could be used in the logic of a join. When the misused attribute is on the main table then the impact on performance is minimal, though when the misused attribute appears in the referenced table then the performance could be impacted considerably because is less likely than an index is built on the respective alphanumeric attribute, a full table scan being done. In order to exemplify this situation I will use the dbo.SplitList table-valued function, which takes a list of value and creates a table out of it, and join it with the Production.Product table. I used to Cast the Value attribute from dbo.SplitList to an integer in order to simulate the handling of numeric values. 


 Even if this use might be look strange for some people, it isn’t utopic, I met many cases in which Users wanted to pull for example all the Products available in the system having the values for a given attribute in a given list, and they wanted to see also the values that had no match. In theory could be easier to rewrite the query as follows, and use in Excel a VLOOKUP function to determine the attributes with no match. 


Of course, in this case it would have been much easier to just use the IN operator and provide the list of values.

03 February 2010

🕋Data Warehousing: Snowflake Schema (Definitions)

"An extension of a star schema such that one or more dimensions are defined by multiple tables. In a snowflake schema, only primary dimension tables are joined to the fact table. Additional dimension tables are joined to primary dimension tables. " (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A variation on dimensional schema design in which dimension tables are further normalized, split into multiple tables based on hierarchies in the data." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A single fact table which joins to many dimension tables, with each dimension normalized as two or more tables." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A variation of the star schema in which the business dimensions are implemented as a set of normalized tables. The resulting diagram resembles a snowflake." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A normalized version of the STAR schema in which dimension tables are partially or fully normalized. Not generally recommended because it compromises query performance and simplicity for understanding." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A star schema in which dimension tables reference other tables." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

💎SQL Reloaded: Ways of Looking at Data III

In the previous posts from the same cycle (see Part I and Part II) I shown how the ranking window functions can be used in SQL Server in order to show the first/last record appearing in a certain partition. A broader set of reports are the ones focusing on the top n records matching a certain criteria, for example the top 10 Vendors/Products based on the amount purchased. Before the introduction of ranking window functions in SQL Server 2005 such requirements were previously addressed using the TOP clause. The following two queries are showing the top 10 Vendors, respectively the top 10 Vendors/Products based on the amount purchased, the first query focusing on the active Vendors while the second on all Vendors – now it depends on the requirements whether only the active or all Vendors are needed, therefore the Users should be always asked about this concern. 

In contrast the use of ranking/aggregated window functions allows more flexibility in selecting not only the first n Vendors, but any interval, thus being possible to use this feature as a simple pagination technique. Here is the first query modified.
As for the second query the TOP clause allows only to select the top 10 Vendors/Products based on the amount purchased, not being possible to select each top n Vendors for each purchased Product as ranking functions allow. Here is the query for the top 3 Vendors for a Product.
The logic for the current example could be reused in further analysis, therefore it makes sense to encapsulated it in a database object. A view would do, though if we would like to use the number of Vendors as parameter, then it’s more appropriate to use a table-valued function or a stored procedure for this. Even if in theory stored procedures provide better performance, if we like to include the above logic in further queries then creating a table-valued function is the best choice.
Once the table-valued function was created we can go on and use it in other queries, for example here is a simple query showing maximum 5 vendors for each product.
Most probably the Users used with the pivoting functionality of Excel would ask you if it’s possible to show all the Vendors together in one line. Starting with SQL Server 2005 Microsoft introduced the PIVOT operator though it can pivot only one attribute, fortunately there is an alternative using a GROUP BY together with the CASE function.
The downside of this technique is that you can work only with a predefined number of Vendors per Product (here 3), on the other side there are few the cases in which more than 5 Vendors exist for the same Product. An alternative would be to process the data using a programming language as VBA, not a complicated thing to do at all. In almost 10 years of experience I found only a few cases in which the functionality provided by RDBMS needed to be extended with additional coding outside of the RDBMS (e.g. Excel, ASP), mainly when the data needed to be organized beyond the simple tabular structure.

🕋Data Warehousing: Data Mart [DM] (Definitions)

"A subset of the contents of a data warehouse, stored within its database. A data mart tends to contain data focused at the department level, or on a specific business area. It is frequently implemented to manage the volume and scope of data." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"A data warehouse, or repository, whose scope is limited to a single subject area." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"A type of data warehouse with data specifically designed for a defined set of functions." (Margaret Y Chu, "Blissful Data ", 2004)

[dependent data mart:] "A data mart that obtains its source data from an Enterprise Data Warehouse." (Margaret Y Chu, "Blissful Data ", 2004)

[independent data mart:] "A data mart that obtains its source data from operational systems or other external media." (Margaret Y Chu, "Blissful Data ", 2004)

"A database that contains a copy of operational data, organized to support analysis of a business process. A data mart may be a subject area within an enterprise data warehouse, or an analytic database that is departmentally focused. When not planned as part of an enterprise data warehouse, a data mart may become a stovepipe. When deployed as an adjunct to a normalized data warehouse, a data mart may contain aggregated data. When built around a conformance bus, the data mart is neither a stovepipe nor an aggregation." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

[stovepipe data mart:] "A departmentally focused data warehouse implementation that does not interoperate with other subject areas. Stovepipes are avoided through the design of a data warehouse bus - a set of conformed dimensions used consistently across subject areas." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A platform that maintains data for analysis by a single organization or user group for a specific set of business purposes." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"A departmentalized structure of data feeding from the data warehouse where data is denormalized based on the department’s need for information." (William H Inmon & Anthony Nesavich, "Tapping into Unstructured Data", 2007)

"A data mart is a specialized database containing a subset of data from a data warehouse that is needed for a particular business purpose. A data mart is used for reporting and analysis of business data." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"A smaller data warehouse that holds data of interest to a particular group." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A specialized type of data warehouse that works with a specific set of data to answer a specific need. A data mart is designed to provide quick, easy access to crucial data." (Stuart Mudie et al, "BusinessObjects™ XI Release 2 for Dummies", 2008) 

"A collection of related data from internal and external sources, transformed, integrated, and stored for the purpose of providing strategic information to a specific set of users in an enterprise." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A database in a data warehouse configuration that holds a subset of data specifically organized for a particular kind of reporting." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A database structured for specific analysis and historical reporting needs." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A smaller, more specialized version of a data warehouse that includes data from a specific functional area or department." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A decision support database supporting Business Intelligence in a limited subject area, using a dimensional data model design." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Small data warehouse designed to support a department or SBU." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"A subset of a data warehouse that is designed to focus on a specific set of business information." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A subset of a data warehouse that’s usually oriented to a business group or team." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[dependent data mart:] "A data mart whose sole source of data is the data warehouse; a dependent data mart is a component of the corporate information factory" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[independent data mart:] "A data mart whose source data comes directly from legacy systems, rather than being sourced by a data warehouse" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Data organized to support specific needs of a user community." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Andrew Pham et al, "From Business Strategy to Information Technology Roadmap", 2016)

"A focused collection of operational data that is usually confined to a specific aspect or subject of a business, such as customers, products, or suppliers. It is a more focused decision support data store than a data warehouse." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"A subset of a data warehouse that allows data to be accessed and customized by specific business functions." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Tiffany Pham et al, "From Business Strategy to Information Technology Roadmap", 2018)

"A subset of a data warehouse that contains data that is tailored and optimized for the specific reporting needs of a department or team. A data mart can be a subset of a warehouse for an entire organization, such as data that is contained in online analytical processing (OLAP) tools." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"a subset of the data warehouse used for a specific purpose. Data marts are then department-specific or related to a single line of business (LoB)."  (Francesco Corea, "An Introduction to Data: Everything You Need to Know About AI, Big Data and Data Science", 2019)

"A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers." (The Data Warehousing Institute)

"A database, usually smaller than a data warehouse, designed to help managers make strategic decisions about their business by focusing on a specific subject or department." (Microstrategy)

"A subset of the contents of a data warehouse that tends to contain data focused at the department level, or on a specific business area." (Microsoft)

"A simple data repository that houses data of a specific discipline." (Solutions Review)

"A data mart is a curated subset of data often generated for analytics and business intelligence users. Data marts are often created as a repository of pertinent information for a subgroup of workers or a particular use case." (snowflake) [source]

"A data mart is a subject-oriented database that is often a partitioned segment of an enterprise data warehouse. The subset of data held in a data mart typically aligns with a particular business unit like sales, finance, or marketing." (Talend) [source]

"A data mart is a subset of data from an enterprise data warehouse in which the relevance is limited to a specific business unit or group of users." (Informatica) [source]

"A data mart is a subset of data stored within the overall data warehouse, for the needs of a specific team, section or department within the business enterprise. […] Data marts make it much easier for individual departments to access key data insights more quickly and helps prevent departments within the business organization from interfering with each other’s data." (Sisense) [source]

"A data mart is the access layer of a data warehouse that is used to provide users with data. Data marts are often seen as small slices of the data warehouse. Data warehouses typically house enterprise-wide data, and information stored in a data mart usually belongs to a specific department or team." (Logi Analytics) [source]

"A data mart serves the same role as a data warehouse, but it is intentionally limited in scope. It may serve one particular department or line of business." (Oracle)

"The data mart is a subject-oriented slice of the data warehouse logical model serving a narrow group of users. Many data marts only need a subset of data from the full tables in the data warehouse." (Teradata) [source]

💎SQL Reloaded: Just In CASE I (Introduction)

    One of the functions I often use in the my daily work is CASE, a powerful combination between “Iif” function and “Case” conditional statement available in VB/VBA. It evaluates a list of conditions reducible to false/true and returns a value on the branch for which a match was found, allowing thus to create conditional-based code. There are two syntax forms in use: the Simple CASE that can be used to evaluate the values of an attribute, and the Searched CASE used to evaluate expressions.
SQL Server Books Online     A CASE could have one or more WHEN branches and only one or no ELSE branch, the WHEN branches being evaluated in the order they were given, while the ELSE branch is always last evaluated. Once an expression on WHEN branch is evaluated to true, the function returns the value evaluated from result_expression and stops the searching. If no match was found then the function evaluates the result_expression from ELSE branch, and if the ELSE branch is missing a value of NULL is return. Actually it’s simpler than is sounds.

    CASE combined with other operators and functions enable to create powerful transformations, many of the transformations done during ETL processing could be solved with a CASE. The following SQL script based on Production.Product table from AdventureWorks database relies heavily on the use of CASE function in order to transform the data to a form intelligible for the Users or any other consumers.
    The simplest use of CASE function is the translation of bit values to more meaningful values, typically ‘Y’/’N’ or ‘Yes’/’No’. It’s the case of the code chunks marked with Example 1 & 2 in the above query, the first example showing the use of Simple CASE syntax, while the second one of a Searched CASE.
    The third example shows the Simple CASE syntax for multiple values of the same attribute, being provided a branch for each of the values taken by Class attribute (‘H’, ‘M’, ‘L’) and translating them to more meaningful values (‘High’, ‘Medium’, ‘Low’), the ‘n/a’ (not available) value being returned for the other cases.
    The alternative syntax could be used too, though for such simple evaluations the previous syntax form is easier to use.
    For the ones knowing the IsNull or COALESCE function the 4th example doesn’t make sense, a simpler writing for the same would be IsNull(Color, ‘n/a’). What the example tries to demonstrate is the use of comparison against NULL values using the IS NULL clause.
    In the 5th and 6th example the matching expressions make use of IN and BETWEEN operators, any other operators could be used in this context. In the 5th example the BETWEEN operator is working with character data type therefore is used the lexical comparison – for such cases it has to be taken into account that if Size would have taken also the value ‘334’ it would be evaluated on the second branch and return the value ‘S’!
    More general, the CASE function could be used to split a set of alphanumeric values or an interval of numeric values in buckets or evaluate specific expressions on each branch, like functions that have different formulas on each interval. If the previous examples work with sets of alphanumeric values, the 6th example uses the StandardCost to derive several Cost Classes (buckets split), the intervals considered are (-infinite, 0], (0, 50], (50, 250], (250, 1000], [1000, 10000] and (10000, infinite).
    Somebody would remark: “ok, but the values, 50, 250, and 1000 are considered on more than one branch!”. Correct, but should not be forgotten that the CASE function stops after the first expression evaluated to true, so if the StandardCost is 50 the second branch will be the last branch processed, so the respective piece of code is correct. In order to avoid any confusion we could have written the following lengthier and equivalent code:
    The 7th example is another example working with intervals based on the difference in days between the SellEndDate and the current date.

02 February 2010

🕋Data Warehousing: Data Warehouse [DWH] (Definitions)

"A subject oriented, integrated, time variant, and non-volatile collection of summary and detailed historical data used to support the strategic decision-making processes for the corporation." (William H Inmon, "What is a Data Warehouse?", Prism Vol. 1 (1), 1995)

"A copy of transaction data specifically structured for query and analysis." (Ralph Kimball, The Data Warehouse Toolkit, 1996)

"A database specifically structured for query and analysis. A data warehouse typically contains data representing the business history of an organization. Data in a data warehouse is usually less detailed and longer lived than data from an OLTP system." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"The conglomeration of an organization’s data warehouse staging and presentation areas, where operational data is specifically structured for query and analysis performance and ease-of-use." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed, 2002)

"The data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A data warehouse is a set of computer databases specifically designed with related, historical blissful [high quality] data that assist in formulating decisions and taking action." (Margaret Y Chu, "Blissful Data", 2004)

"A set of computer databases specifically designed with related, historical blissful data that assist in formulating decisions and taking action." (Margaret Y Chu, "Blissful Data ", 2004)

"A collection of integrated, subject-oriented databases designed to support the DSS function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data." (William H Inmon, "Building the Data Warehouse", 2005)

"A database that can follow a Third Normal Form (3NF) or dimensional design and that houses a time-variant collection of data from multiple sources. It’s generally used to collect and store integrated sets of historical data from multiple operational systems and then feed one or more dependent data marts. In some cases, a data warehouse may also provide end user access to support enterprise views of data." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A database containing a copy of operational data that is organized for analytic purposes." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

[Enterprise Data Warehouse] "A database that contains a copy of enterprise data, reorganized for analytic purposes. Subject areas within the enterprise data warehouse are called data marts." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A database specifically structured for query and analysis. A data warehouse typically contains data representing the business history of an organization." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006)

"A relational database used as a repository for storing and analyzing numerical information that has been cleansed and verified." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A technology platform that stores business data for the purpose of strategic decision making. Data warehouses are normally the central integration point for large amounts of detailed, historical data from heterogeneous systems across the company to avail data for business intelligence." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"Central repository holding cleaned and transformed information needed by an organization to make decisions, usually extracted from an operational database." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A collection of integrated, subject-oriented databases designed to support the DSS function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data." (William H Inmon & Anthony Nesavich, "Tapping into Unstructured Data", 2007)

"A data structure that is optimized for distribution. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. (Standard definition from The Data Warehousing Institute)" (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"A specialised database containing consolidated historical data drawn from a number of existing databases to support strategic decision making." (Keith Gordon, "Principles of Data Management", 2007)

"Central repository holding cleaned and transformed information needed by an organization to make decisions, usually extracted from an operational database." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"Data warehouse is a central repository for summarized and integrated data from operational databases and external data sources." (S Sumathi & S Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A logical warehouse of data that gathers production and operational information from various departments of a corporation into a single data entity. This information is loaded regularly, which allows for careful analysis over a period of time." (Stuart Mudie et al, "BusinessObjects™ XI Release 2 for Dummies", 2008)

"A repository of data for offline use in building reports and analyzing historical data." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A data warehouse is a system of records (a business intelligence gathering system) that takes data from a company's operational databases and other data sources and transforms it into a structure conducive to business analysis." (Sivakumar Harinath et al, "Professional Microsoft® SQL Server® Analysis Services 2008 with MDX", 2009)

"A database designed for reporting and data analysis. A data warehouse typically contains data representing the business history of an organization." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A large data store containing the organization’s historical data, which is used primarily for data analysis and data mining." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"An integrated, centralized decision support database and the related software programs used to collect, cleanse, transform, and store data from a variety of operational sources to support Business Intelligence. A Data Warehouse may also include dependent data marts." (DAMA International, The DAMA Guide to the Data Management Body of Knowledge 1st Ed., 2009)

"(1) A centralized database for collecting the data from numerous other systems so that they can be made available for management reporting. The database is close to 3rd Normal Form.  (2)  A system that includes the central database described in 1; plus procedures for extracting, transforming, and loading data from other systems; and one or more data marts that organize subsets of the data for particular reporting purposes." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"[A] data warehouse is a collection of data designed to support management in the decision-making process. It is a subject oriented, integrated, time-variant, non-up datable collection of data used in support of management decision-making processes and business intelligence. It contains a wide variety of data that present a coherent picture of business condition at a single point of time. It is a unique kind of database which focuses on business intelligence, external data and time-variant data." (Vijay K Pallaw, "Database Management Systems" 2nd Ed., 2010)

"A data warehouse is a large, enterprise-wide database that acts as a central storage location for data that has been through the Extract, Transform, and Load (ETL) process. A data warehouse often includes historical data as well." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

[Active Data Warehouse (ADW):] "A data warehouse that is generally capable of supporting near-real-time updates, fast response times, and mixed workloads by leveraging well-architected data models, optimized ETL processes, and the use of workload management." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"A large, often enterprise-wide repository of data used for reporting and analysis. Data warehouses generally collect and manage data from a large number of operational and financial systems across an enterprise." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010) 

"A subject-oriented, integrated, time-variant, and historical collection of summary and detailed data used to support the decision-making and other reporting and analysis needs that require historical, point-in-time information. Data, once captured within the warehouse, is nonvolatile and relevant to a point in time." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A specialized type of database that is used to aggregate data from transaction databases for data analysis purposes, such as identifying and examining business trends, to support planning and decision making." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

[Enterprise Data Warehousing (EDW):] "A data repository of organizational data that is organized, analyzed, and used to enable more informed decision making and planning." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

[federated data warehouse:] "1.A conceptual Data Warehouse made up of multiple decision support databases, potentially on multiple servers, but presented transparently to Business Intelligence users as a unified schema for query, analysis, and reporting. 2.An Enterprise Data Warehouse fed by extracts from departmental Data Warehouses and/or legacy Data Warehouses prior to their incorporation and/or retirement." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The database that stores operations data for long periods of time." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A database used for reporting and analysis." (Craig S Mullins, "Database Administration", 2012)

"A large data store containing the organization’s historical data, which is used primarily for data analysis and data mining. It is the data system of record." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A shared repository of data, often used to support the centralized consolidation of information for decision support." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"A shared repository of data, often used to support the centralized consolidation of information for decision support." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"A store that provides data from the originating source or the operational data stores; it contains historical and derived data. Also known as an information warehouse." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management’s decisions" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A large data store containing the organization’s historical data, which is used primarily for data analysis and data mining. It is the data system of record." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"A centralized database system which captures data and allows later analysis of the collected data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A secondary database that holds older data for analysis. In some applications, you may want to analyze the data and store modified or aggregated forms in the warehouse instead of keeping every outdated record." (Rod Stephens, "Beginning Software Engineering", 2015)

"Decision-making data base containing the totality of a business’ decision-making data (all subjects)." (Fernando Iafrate, "From Big Data to Smart Data", 2015) 

[Enterprise Data Warehouse (EDW):] "A clean data store created to merge and store data from different sources for enterprise data analysis." (David K Pham, "From Business Strategy to Information Technology Roadmap", 2016)

"A granular, time-variant, structured store of historical data in a neutral, nonredundant format for multiple uses. Its purpose is the reuse of data." (Gregory Lampshire et al, "The Data and Analytics Playbook", 2016)

"Electronic storehouses where vast amounts of data are arrayed, integrated, categorised, stored, and sold." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A repository for storing business-relevant data." (Jonathan Ferrar et al, "The Power of People", 2017)

"A very large database designed to support decision making in organizations. It is usually batch updated and structured for rapid online queries and managerial summaries. A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"A data warehouse is a repository for all the data that an enterprise collects from internal and external sources." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"A data warehouse is a repository of enterprise data used for reporting and analysis." (Michelle Gutzait et al, "Hands-On Data Warehousing with Azure Data Factory", 2018) 

"A subject-oriented collection of data that is used to support strategic decision making. The warehouse is the central point of data integration for business intelligence. It is the source of data for data marts within an enterprise and delivers a common view of enterprise data." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

[Enterprise Data Warehouse (EDW):] "system used for analysis and reporting that consists of central repositories of integrated data from a wide spectrum of different sources." (Francesco Corea, "An Introduction to Data: Everything You Need to Know About AI, Big Data and Data Science", 2019)

"A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data that supports management’s decision-making process." (Piethein Strengholt, "Data Management at Scale", 2020)

"A cloud data warehouse is an enterprise data warehouse offered as a managed service (PaaS) on public clouds with optimized integrations for data ingestion, analytics processing, and BI analytics." (Rukmani Gopalan, "The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture", 2022) 

"A data warehouse is a central storage for all data that an enterprise’s various business systems collect." (Logi Analytics) [source]

"A data warehouse is a data management solution to store large quantities of historical business data, performing queries to support various business intelligence and analytics use cases." (Qlik) [source]

"A data warehouse is a database designed to enable business intelligence activities: it exists to help users understand and enhance their organization's performance. It is designed for query and analysis rather than for transaction processing, and usually contains historical data derived from transaction data, but can include data from other sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources." (Oracle)

"A data warehouse is a relational database that is designed for analytical rather than transactional work. It collects and aggregates data from one or many sources so it can be analyzed to produce business insights. It serves as a federated repository for all or certain data sets collected by a business’s operational systems." (snowflake) [source]

"A data warehouse is a repository for data generated and collected by an enterprise's various operational systems." (Techtarget) [source]

"A data warehouse is a storage architecture designed to hold data extracted from transaction systems, operational data stores and external sources. The warehouse then combines that data in an aggregate, summary form suitable for enterprise-wide data analysis and reporting for predefined business needs." (Gartner)

"A data warehouse is a system used to do quick analysis of business trends using data from many sources." (KDnuggets)

[Big data warehouse:] "A specialized, cohesive set of data repositories andplatforms that supports a broad variety of analytics running on-premises, inthe cloud, or in a hybrid environment. BDW leverages traditional and new bigdata technologies such as Hadoop, Spark, columnar and row-based datawarehouses, ETL and streaming, and elastic in-memory and storage frameworks." (Forrester)

[Cloud data warehouse:] "An on-demand, secure, and scalable self-service data warehouse that automates the provisioning, administration, tuning, backup, and recovery to accelerate analytics and actionable insights while minimizing administration requirements." (Forrester)

"A database, typically very large, containing the historical data of an enterprise. Used for decision support or business intelligence, it organizes data and allows coordinated updates and loads." (Microstrategy)

"A large store of data drawing from a wide range of sources that can be processed, split, and analyzed to extract insights that guide management decisions. Data warehouses are typically relational databases that contain historical data and are designed for query and analysis." (Insight Software)

"A record of an enterprise’s past transactional and operational information, stored in a database. Data warehousing is not meant for current 'live' data; rather, data from the production databases are copied to the data warehouse so that queries can be performed without disturbing the performance or the stability of the production systems." (Appian)

"A system used for data analytics. They are a central location of integrated data from other more disparate sources, storing both current (real-time) and historical data which can then be used to create trends reports." (Solutions Review)

"An implementation of an informational database used to store sharable data sourced from an operational database-of-record. It is typically a subject database that allows users to tap into a company's vast store of operational data to track and respond to business trends and facilitate forecasting and planning efforts." (Information Management)

"The database that stores operations data for long periods of time." (Microsoft)

[Enterprise data warehouse:] "A repository of information that is used for reporting and analytics. It includes key data management functions, such as concurrency, security, storage, processing, SQL access, and integration." (Forrester)

[Enterprise data warehouse:] "a single database or set of databases that allow all of an organisation’s data to be stored in a central repository ideally in relationship to each other." (BI System Builders)

"In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The data stored in the warehouse is uploaded from the operational systems (such as marketing, sales, etc.)" (Wikipedia)

💎SQL Reloaded: Ways of Looking at Data II

In a previous post on the same topic I’ve started to exemplify how the logic for reports could be built using the PO Header/Details example provided by AdventureWorks. The queries were looking mainly at Vendor and PO Header data, the next logical step being the introduction of PO Details and Products. If the header contains general information about a PO that applies to all PO Lines, the PO Lines contain the reference to the purchased Product, Quantities (e.g. Ordered, Receipt, Rejected, etc.) & UOM, Prices (e.g. Unit Price), Reference Dates (e.g. Due Date), each PO Header having one or more PO Lines.

Following PO Header’s example and logical partitioning , could be created a view based only on PO Details and the child tables, though as in this case there are no child tables – does it makes sense to create a view? Yes, why not?! The logic encapsulated by a view doesn’t resume only to joining tables, but also to provide a restrained list of attributes, computed columns or limit the scope by using constraints.
As can be seen from the above query, has been introduced several computed columns: OpenQty – the Quantity remaining to be receipt calculated as the difference between OrderedQty and ReceiptQty, ReceiptTotal – the total value of quantities receipt, OutstandingTotal – the total value corresponding to OpenQty. These three columns are quite often met in analysis and even if their definition is quite easy to follow and implement, it makes sense to include them in the view, thus reducing logic replication. In query writing best practices I was recommending to encapsulate formulas and business logic in functions, however in this case the logic is quite simple and it doesn’t makes sense to create a function for each formula used in a query, the recommendation should be used in theory (complex) formulas that are often reused.

The alternative for partitioned view is to create an extended view that include PO Header and PO Details, such type of view is really easy to use by most of the users. I prefer to use both approaches, having thus more flexibility.
    Do not forget to check the variations in cardinality! This exercise allow also to identify whether the referential integrity is kept. Even if it’s hard to believe, I found referential integrity broken in databases designed to enforce it, therefore it’s a good idea to check whether the referntial integrity is kept between the various elements.
The same query could be also created also by using logical partitioned view, though before doing that we have to create a view also for Products:
    The final query based on logical partitioned views:
Before using the data there are 1-2 tests that needs to be done in order to assure ourselves that the data available at the various levels of details are synchronized: is the case of the SubTotal which should match the cumulated LineTotal for each line.
Simple reports based on PO Detail level resume in retrieving the POs matching a certain criteriea: closed POs, Open POs, POs with overreceipt, with rejections or any other business case.
Note:
The * (ALL) has been used only in order to minimize query’s length, as a good practice avoid using the “*” and provide instead only the list of attributes needed!

Other types of reports are showing the data at different aggregations, for example calculating the Total Amount spent for each Vendor/Product, the Outstanding Amount, etc.
More complex reports can be built with the help of window functions, for example showing the last PO Unit Price per Vendor/Product, such a report being quite useful when reevaluating the Standard Costs.
Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.