03 February 2010

💎SQL Reloaded: Ways of Looking at Data III

In the previous posts from the same cycle (see Part I and Part II) I shown how the ranking window functions can be used in SQL Server in order to show the first/last record appearing in a certain partition. A broader set of reports are the ones focusing on the top n records matching a certain criteria, for example the top 10 Vendors/Products based on the amount purchased. Before the introduction of ranking window functions in SQL Server 2005 such requirements were previously addressed using the TOP clause. The following two queries are showing the top 10 Vendors, respectively the top 10 Vendors/Products based on the amount purchased, the first query focusing on the active Vendors while the second on all Vendors – now it depends on the requirements whether only the active or all Vendors are needed, therefore the Users should be always asked about this concern. 

In contrast the use of ranking/aggregated window functions allows more flexibility in selecting not only the first n Vendors, but any interval, thus being possible to use this feature as a simple pagination technique. Here is the first query modified.
As for the second query the TOP clause allows only to select the top 10 Vendors/Products based on the amount purchased, not being possible to select each top n Vendors for each purchased Product as ranking functions allow. Here is the query for the top 3 Vendors for a Product.
The logic for the current example could be reused in further analysis, therefore it makes sense to encapsulated it in a database object. A view would do, though if we would like to use the number of Vendors as parameter, then it’s more appropriate to use a table-valued function or a stored procedure for this. Even if in theory stored procedures provide better performance, if we like to include the above logic in further queries then creating a table-valued function is the best choice.
Once the table-valued function was created we can go on and use it in other queries, for example here is a simple query showing maximum 5 vendors for each product.
Most probably the Users used with the pivoting functionality of Excel would ask you if it’s possible to show all the Vendors together in one line. Starting with SQL Server 2005 Microsoft introduced the PIVOT operator though it can pivot only one attribute, fortunately there is an alternative using a GROUP BY together with the CASE function.
The downside of this technique is that you can work only with a predefined number of Vendors per Product (here 3), on the other side there are few the cases in which more than 5 Vendors exist for the same Product. An alternative would be to process the data using a programming language as VBA, not a complicated thing to do at all. In almost 10 years of experience I found only a few cases in which the functionality provided by RDBMS needed to be extended with additional coding outside of the RDBMS (e.g. Excel, ASP), mainly when the data needed to be organized beyond the simple tabular structure.

🕋Data Warehousing: Data Mart [DM] (Definitions)

"A subset of the contents of a data warehouse, stored within its database. A data mart tends to contain data focused at the department level, or on a specific business area. It is frequently implemented to manage the volume and scope of data." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"A data warehouse, or repository, whose scope is limited to a single subject area." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"A type of data warehouse with data specifically designed for a defined set of functions." (Margaret Y Chu, "Blissful Data ", 2004)

[dependent data mart:] "A data mart that obtains its source data from an Enterprise Data Warehouse." (Margaret Y Chu, "Blissful Data ", 2004)

[independent data mart:] "A data mart that obtains its source data from operational systems or other external media." (Margaret Y Chu, "Blissful Data ", 2004)

"A database that contains a copy of operational data, organized to support analysis of a business process. A data mart may be a subject area within an enterprise data warehouse, or an analytic database that is departmentally focused. When not planned as part of an enterprise data warehouse, a data mart may become a stovepipe. When deployed as an adjunct to a normalized data warehouse, a data mart may contain aggregated data. When built around a conformance bus, the data mart is neither a stovepipe nor an aggregation." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

[stovepipe data mart:] "A departmentally focused data warehouse implementation that does not interoperate with other subject areas. Stovepipes are avoided through the design of a data warehouse bus - a set of conformed dimensions used consistently across subject areas." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A platform that maintains data for analysis by a single organization or user group for a specific set of business purposes." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"A departmentalized structure of data feeding from the data warehouse where data is denormalized based on the department’s need for information." (William H Inmon & Anthony Nesavich, "Tapping into Unstructured Data", 2007)

"A data mart is a specialized database containing a subset of data from a data warehouse that is needed for a particular business purpose. A data mart is used for reporting and analysis of business data." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"A smaller data warehouse that holds data of interest to a particular group." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A specialized type of data warehouse that works with a specific set of data to answer a specific need. A data mart is designed to provide quick, easy access to crucial data." (Stuart Mudie et al, "BusinessObjects™ XI Release 2 for Dummies", 2008) 

"A collection of related data from internal and external sources, transformed, integrated, and stored for the purpose of providing strategic information to a specific set of users in an enterprise." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A database in a data warehouse configuration that holds a subset of data specifically organized for a particular kind of reporting." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A database structured for specific analysis and historical reporting needs." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A smaller, more specialized version of a data warehouse that includes data from a specific functional area or department." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A decision support database supporting Business Intelligence in a limited subject area, using a dimensional data model design." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Small data warehouse designed to support a department or SBU." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"A subset of a data warehouse that is designed to focus on a specific set of business information." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A subset of a data warehouse that’s usually oriented to a business group or team." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[dependent data mart:] "A data mart whose sole source of data is the data warehouse; a dependent data mart is a component of the corporate information factory" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[independent data mart:] "A data mart whose source data comes directly from legacy systems, rather than being sourced by a data warehouse" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Data organized to support specific needs of a user community." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Andrew Pham et al, "From Business Strategy to Information Technology Roadmap", 2016)

"A focused collection of operational data that is usually confined to a specific aspect or subject of a business, such as customers, products, or suppliers. It is a more focused decision support data store than a data warehouse." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"A subset of a data warehouse that allows data to be accessed and customized by specific business functions." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Tiffany Pham et al, "From Business Strategy to Information Technology Roadmap", 2018)

"A subset of a data warehouse that contains data that is tailored and optimized for the specific reporting needs of a department or team. A data mart can be a subset of a warehouse for an entire organization, such as data that is contained in online analytical processing (OLAP) tools." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"a subset of the data warehouse used for a specific purpose. Data marts are then department-specific or related to a single line of business (LoB)."  (Francesco Corea, "An Introduction to Data: Everything You Need to Know About AI, Big Data and Data Science", 2019)

"A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers." (The Data Warehousing Institute)

"A database, usually smaller than a data warehouse, designed to help managers make strategic decisions about their business by focusing on a specific subject or department." (Microstrategy)

"A subset of the contents of a data warehouse that tends to contain data focused at the department level, or on a specific business area." (Microsoft)

"A simple data repository that houses data of a specific discipline." (Solutions Review)

"A data mart is a curated subset of data often generated for analytics and business intelligence users. Data marts are often created as a repository of pertinent information for a subgroup of workers or a particular use case." (snowflake) [source]

"A data mart is a subject-oriented database that is often a partitioned segment of an enterprise data warehouse. The subset of data held in a data mart typically aligns with a particular business unit like sales, finance, or marketing." (Talend) [source]

"A data mart is a subset of data from an enterprise data warehouse in which the relevance is limited to a specific business unit or group of users." (Informatica) [source]

"A data mart is a subset of data stored within the overall data warehouse, for the needs of a specific team, section or department within the business enterprise. […] Data marts make it much easier for individual departments to access key data insights more quickly and helps prevent departments within the business organization from interfering with each other’s data." (Sisense) [source]

"A data mart is the access layer of a data warehouse that is used to provide users with data. Data marts are often seen as small slices of the data warehouse. Data warehouses typically house enterprise-wide data, and information stored in a data mart usually belongs to a specific department or team." (Logi Analytics) [source]

"A data mart serves the same role as a data warehouse, but it is intentionally limited in scope. It may serve one particular department or line of business." (Oracle)

"The data mart is a subject-oriented slice of the data warehouse logical model serving a narrow group of users. Many data marts only need a subset of data from the full tables in the data warehouse." (Teradata) [source]

💎SQL Reloaded: Just In CASE I (Introduction)

    One of the functions I often use in the my daily work is CASE, a powerful combination between “Iif” function and “Case” conditional statement available in VB/VBA. It evaluates a list of conditions reducible to false/true and returns a value on the branch for which a match was found, allowing thus to create conditional-based code. There are two syntax forms in use: the Simple CASE that can be used to evaluate the values of an attribute, and the Searched CASE used to evaluate expressions.
SQL Server Books Online     A CASE could have one or more WHEN branches and only one or no ELSE branch, the WHEN branches being evaluated in the order they were given, while the ELSE branch is always last evaluated. Once an expression on WHEN branch is evaluated to true, the function returns the value evaluated from result_expression and stops the searching. If no match was found then the function evaluates the result_expression from ELSE branch, and if the ELSE branch is missing a value of NULL is return. Actually it’s simpler than is sounds.

    CASE combined with other operators and functions enable to create powerful transformations, many of the transformations done during ETL processing could be solved with a CASE. The following SQL script based on Production.Product table from AdventureWorks database relies heavily on the use of CASE function in order to transform the data to a form intelligible for the Users or any other consumers.
    The simplest use of CASE function is the translation of bit values to more meaningful values, typically ‘Y’/’N’ or ‘Yes’/’No’. It’s the case of the code chunks marked with Example 1 & 2 in the above query, the first example showing the use of Simple CASE syntax, while the second one of a Searched CASE.
    The third example shows the Simple CASE syntax for multiple values of the same attribute, being provided a branch for each of the values taken by Class attribute (‘H’, ‘M’, ‘L’) and translating them to more meaningful values (‘High’, ‘Medium’, ‘Low’), the ‘n/a’ (not available) value being returned for the other cases.
    The alternative syntax could be used too, though for such simple evaluations the previous syntax form is easier to use.
    For the ones knowing the IsNull or COALESCE function the 4th example doesn’t make sense, a simpler writing for the same would be IsNull(Color, ‘n/a’). What the example tries to demonstrate is the use of comparison against NULL values using the IS NULL clause.
    In the 5th and 6th example the matching expressions make use of IN and BETWEEN operators, any other operators could be used in this context. In the 5th example the BETWEEN operator is working with character data type therefore is used the lexical comparison – for such cases it has to be taken into account that if Size would have taken also the value ‘334’ it would be evaluated on the second branch and return the value ‘S’!
    More general, the CASE function could be used to split a set of alphanumeric values or an interval of numeric values in buckets or evaluate specific expressions on each branch, like functions that have different formulas on each interval. If the previous examples work with sets of alphanumeric values, the 6th example uses the StandardCost to derive several Cost Classes (buckets split), the intervals considered are (-infinite, 0], (0, 50], (50, 250], (250, 1000], [1000, 10000] and (10000, infinite).
    Somebody would remark: “ok, but the values, 50, 250, and 1000 are considered on more than one branch!”. Correct, but should not be forgotten that the CASE function stops after the first expression evaluated to true, so if the StandardCost is 50 the second branch will be the last branch processed, so the respective piece of code is correct. In order to avoid any confusion we could have written the following lengthier and equivalent code:
    The 7th example is another example working with intervals based on the difference in days between the SellEndDate and the current date.

02 February 2010

🕋Data Warehousing: Data Warehouse [DWH] (Definitions)

"A subject oriented, integrated, time variant, and non-volatile collection of summary and detailed historical data used to support the strategic decision-making processes for the corporation." (William H Inmon, "What is a Data Warehouse?", Prism Vol. 1 (1), 1995)

"A copy of transaction data specifically structured for query and analysis." (Ralph Kimball, The Data Warehouse Toolkit, 1996)

"A database specifically structured for query and analysis. A data warehouse typically contains data representing the business history of an organization. Data in a data warehouse is usually less detailed and longer lived than data from an OLTP system." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"The conglomeration of an organization’s data warehouse staging and presentation areas, where operational data is specifically structured for query and analysis performance and ease-of-use." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed, 2002)

"The data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A data warehouse is a set of computer databases specifically designed with related, historical blissful [high quality] data that assist in formulating decisions and taking action." (Margaret Y Chu, "Blissful Data", 2004)

"A set of computer databases specifically designed with related, historical blissful data that assist in formulating decisions and taking action." (Margaret Y Chu, "Blissful Data ", 2004)

"A collection of integrated, subject-oriented databases designed to support the DSS function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data." (William H Inmon, "Building the Data Warehouse", 2005)

"A database that can follow a Third Normal Form (3NF) or dimensional design and that houses a time-variant collection of data from multiple sources. It’s generally used to collect and store integrated sets of historical data from multiple operational systems and then feed one or more dependent data marts. In some cases, a data warehouse may also provide end user access to support enterprise views of data." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A database containing a copy of operational data that is organized for analytic purposes." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

[Enterprise Data Warehouse] "A database that contains a copy of enterprise data, reorganized for analytic purposes. Subject areas within the enterprise data warehouse are called data marts." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A database specifically structured for query and analysis. A data warehouse typically contains data representing the business history of an organization." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006)

"A relational database used as a repository for storing and analyzing numerical information that has been cleansed and verified." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A technology platform that stores business data for the purpose of strategic decision making. Data warehouses are normally the central integration point for large amounts of detailed, historical data from heterogeneous systems across the company to avail data for business intelligence." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"Central repository holding cleaned and transformed information needed by an organization to make decisions, usually extracted from an operational database." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A collection of integrated, subject-oriented databases designed to support the DSS function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data." (William H Inmon & Anthony Nesavich, "Tapping into Unstructured Data", 2007)

"A data structure that is optimized for distribution. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. (Standard definition from The Data Warehousing Institute)" (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"A specialised database containing consolidated historical data drawn from a number of existing databases to support strategic decision making." (Keith Gordon, "Principles of Data Management", 2007)

"Central repository holding cleaned and transformed information needed by an organization to make decisions, usually extracted from an operational database." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2007)

"Data warehouse is a central repository for summarized and integrated data from operational databases and external data sources." (S Sumathi & S Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A logical warehouse of data that gathers production and operational information from various departments of a corporation into a single data entity. This information is loaded regularly, which allows for careful analysis over a period of time." (Stuart Mudie et al, "BusinessObjects™ XI Release 2 for Dummies", 2008)

"A repository of data for offline use in building reports and analyzing historical data." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A data warehouse is a system of records (a business intelligence gathering system) that takes data from a company's operational databases and other data sources and transforms it into a structure conducive to business analysis." (Sivakumar Harinath et al, "Professional Microsoft® SQL Server® Analysis Services 2008 with MDX", 2009)

"A database designed for reporting and data analysis. A data warehouse typically contains data representing the business history of an organization." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A large data store containing the organization’s historical data, which is used primarily for data analysis and data mining." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"An integrated, centralized decision support database and the related software programs used to collect, cleanse, transform, and store data from a variety of operational sources to support Business Intelligence. A Data Warehouse may also include dependent data marts." (DAMA International, The DAMA Guide to the Data Management Body of Knowledge 1st Ed., 2009)

"(1) A centralized database for collecting the data from numerous other systems so that they can be made available for management reporting. The database is close to 3rd Normal Form.  (2)  A system that includes the central database described in 1; plus procedures for extracting, transforming, and loading data from other systems; and one or more data marts that organize subsets of the data for particular reporting purposes." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"[A] data warehouse is a collection of data designed to support management in the decision-making process. It is a subject oriented, integrated, time-variant, non-up datable collection of data used in support of management decision-making processes and business intelligence. It contains a wide variety of data that present a coherent picture of business condition at a single point of time. It is a unique kind of database which focuses on business intelligence, external data and time-variant data." (Vijay K Pallaw, "Database Management Systems" 2nd Ed., 2010)

"A data warehouse is a large, enterprise-wide database that acts as a central storage location for data that has been through the Extract, Transform, and Load (ETL) process. A data warehouse often includes historical data as well." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

[Active Data Warehouse (ADW):] "A data warehouse that is generally capable of supporting near-real-time updates, fast response times, and mixed workloads by leveraging well-architected data models, optimized ETL processes, and the use of workload management." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"A large, often enterprise-wide repository of data used for reporting and analysis. Data warehouses generally collect and manage data from a large number of operational and financial systems across an enterprise." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010) 

"A subject-oriented, integrated, time-variant, and historical collection of summary and detailed data used to support the decision-making and other reporting and analysis needs that require historical, point-in-time information. Data, once captured within the warehouse, is nonvolatile and relevant to a point in time." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A specialized type of database that is used to aggregate data from transaction databases for data analysis purposes, such as identifying and examining business trends, to support planning and decision making." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

[Enterprise Data Warehousing (EDW):] "A data repository of organizational data that is organized, analyzed, and used to enable more informed decision making and planning." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

[federated data warehouse:] "1.A conceptual Data Warehouse made up of multiple decision support databases, potentially on multiple servers, but presented transparently to Business Intelligence users as a unified schema for query, analysis, and reporting. 2.An Enterprise Data Warehouse fed by extracts from departmental Data Warehouses and/or legacy Data Warehouses prior to their incorporation and/or retirement." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The database that stores operations data for long periods of time." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A database used for reporting and analysis." (Craig S Mullins, "Database Administration", 2012)

"A large data store containing the organization’s historical data, which is used primarily for data analysis and data mining. It is the data system of record." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A shared repository of data, often used to support the centralized consolidation of information for decision support." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"A shared repository of data, often used to support the centralized consolidation of information for decision support." (Evan Stubbs, "Delivering Business Analytics: Practical Guidelines for Best Practice", 2013)

"A store that provides data from the originating source or the operational data stores; it contains historical and derived data. Also known as an information warehouse." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"A subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management’s decisions" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A large data store containing the organization’s historical data, which is used primarily for data analysis and data mining. It is the data system of record." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"A centralized database system which captures data and allows later analysis of the collected data." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A secondary database that holds older data for analysis. In some applications, you may want to analyze the data and store modified or aggregated forms in the warehouse instead of keeping every outdated record." (Rod Stephens, "Beginning Software Engineering", 2015)

"Decision-making data base containing the totality of a business’ decision-making data (all subjects)." (Fernando Iafrate, "From Big Data to Smart Data", 2015) 

[Enterprise Data Warehouse (EDW):] "A clean data store created to merge and store data from different sources for enterprise data analysis." (David K Pham, "From Business Strategy to Information Technology Roadmap", 2016)

"A granular, time-variant, structured store of historical data in a neutral, nonredundant format for multiple uses. Its purpose is the reuse of data." (Gregory Lampshire et al, "The Data and Analytics Playbook", 2016)

"Electronic storehouses where vast amounts of data are arrayed, integrated, categorised, stored, and sold." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A repository for storing business-relevant data." (Jonathan Ferrar et al, "The Power of People", 2017)

"A very large database designed to support decision making in organizations. It is usually batch updated and structured for rapid online queries and managerial summaries. A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"A data warehouse is a repository for all the data that an enterprise collects from internal and external sources." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"A data warehouse is a repository of enterprise data used for reporting and analysis." (Michelle Gutzait et al, "Hands-On Data Warehousing with Azure Data Factory", 2018) 

"A subject-oriented collection of data that is used to support strategic decision making. The warehouse is the central point of data integration for business intelligence. It is the source of data for data marts within an enterprise and delivers a common view of enterprise data." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

[Enterprise Data Warehouse (EDW):] "system used for analysis and reporting that consists of central repositories of integrated data from a wide spectrum of different sources." (Francesco Corea, "An Introduction to Data: Everything You Need to Know About AI, Big Data and Data Science", 2019)

"A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data that supports management’s decision-making process." (Piethein Strengholt, "Data Management at Scale", 2020)

"A data warehouse is a central storage for all data that an enterprise’s various business systems collect." (Logi Analytics) [source]

"A data warehouse is a data management solution to store large quantities of historical business data, performing queries to support various business intelligence and analytics use cases." (Qlik) [source]

"A data warehouse is a database designed to enable business intelligence activities: it exists to help users understand and enhance their organization's performance. It is designed for query and analysis rather than for transaction processing, and usually contains historical data derived from transaction data, but can include data from other sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources." (Oracle)

"A data warehouse is a relational database that is designed for analytical rather than transactional work. It collects and aggregates data from one or many sources so it can be analyzed to produce business insights. It serves as a federated repository for all or certain data sets collected by a business’s operational systems." (snowflake) [source]

"A data warehouse is a repository for data generated and collected by an enterprise's various operational systems." (Techtarget) [source]

"A data warehouse is a storage architecture designed to hold data extracted from transaction systems, operational data stores and external sources. The warehouse then combines that data in an aggregate, summary form suitable for enterprise-wide data analysis and reporting for predefined business needs." (Gartner)

"A data warehouse is a system used to do quick analysis of business trends using data from many sources." (KDnuggets)

[Big data warehouse:] "A specialized, cohesive set of data repositories andplatforms that supports a broad variety of analytics running on-premises, inthe cloud, or in a hybrid environment. BDW leverages traditional and new bigdata technologies such as Hadoop, Spark, columnar and row-based datawarehouses, ETL and streaming, and elastic in-memory and storage frameworks." (Forrester)

[Cloud data warehouse:] "An on-demand, secure, and scalable self-service data warehouse that automates the provisioning, administration, tuning, backup, and recovery to accelerate analytics and actionable insights while minimizing administration requirements." (Forrester)

"A database, typically very large, containing the historical data of an enterprise. Used for decision support or business intelligence, it organizes data and allows coordinated updates and loads." (Microstrategy)

"A large store of data drawing from a wide range of sources that can be processed, split, and analyzed to extract insights that guide management decisions. Data warehouses are typically relational databases that contain historical data and are designed for query and analysis." (Insight Software)

"A record of an enterprise’s past transactional and operational information, stored in a database. Data warehousing is not meant for current 'live' data; rather, data from the production databases are copied to the data warehouse so that queries can be performed without disturbing the performance or the stability of the production systems." (Appian)

"A system used for data analytics. They are a central location of integrated data from other more disparate sources, storing both current (real-time) and historical data which can then be used to create trends reports." (Solutions Review)

"An implementation of an informational database used to store sharable data sourced from an operational database-of-record. It is typically a subject database that allows users to tap into a company's vast store of operational data to track and respond to business trends and facilitate forecasting and planning efforts." (Information Management)

"The database that stores operations data for long periods of time." (Microsoft)

[Enterprise data warehouse:] "A repository of information that is used for reporting and analytics. It includes key data management functions, such as concurrency, security, storage, processing, SQL access, and integration." (Forrester)

[Enterprise data warehouse:] "a single database or set of databases that allow all of an organisation’s data to be stored in a central repository ideally in relationship to each other." (BI System Builders)

"In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The data stored in the warehouse is uploaded from the operational systems (such as marketing, sales, etc.)" (Wikipedia)

💎SQL Reloaded: Ways of Looking at Data II

In a previous post on the same topic I’ve started to exemplify how the logic for reports could be built using the PO Header/Details example provided by AdventureWorks. The queries were looking mainly at Vendor and PO Header data, the next logical step being the introduction of PO Details and Products. If the header contains general information about a PO that applies to all PO Lines, the PO Lines contain the reference to the purchased Product, Quantities (e.g. Ordered, Receipt, Rejected, etc.) & UOM, Prices (e.g. Unit Price), Reference Dates (e.g. Due Date), each PO Header having one or more PO Lines.

Following PO Header’s example and logical partitioning , could be created a view based only on PO Details and the child tables, though as in this case there are no child tables – does it makes sense to create a view? Yes, why not?! The logic encapsulated by a view doesn’t resume only to joining tables, but also to provide a restrained list of attributes, computed columns or limit the scope by using constraints.
As can be seen from the above query, has been introduced several computed columns: OpenQty – the Quantity remaining to be receipt calculated as the difference between OrderedQty and ReceiptQty, ReceiptTotal – the total value of quantities receipt, OutstandingTotal – the total value corresponding to OpenQty. These three columns are quite often met in analysis and even if their definition is quite easy to follow and implement, it makes sense to include them in the view, thus reducing logic replication. In query writing best practices I was recommending to encapsulate formulas and business logic in functions, however in this case the logic is quite simple and it doesn’t makes sense to create a function for each formula used in a query, the recommendation should be used in theory (complex) formulas that are often reused.

The alternative for partitioned view is to create an extended view that include PO Header and PO Details, such type of view is really easy to use by most of the users. I prefer to use both approaches, having thus more flexibility.
    Do not forget to check the variations in cardinality! This exercise allow also to identify whether the referential integrity is kept. Even if it’s hard to believe, I found referential integrity broken in databases designed to enforce it, therefore it’s a good idea to check whether the referntial integrity is kept between the various elements.
The same query could be also created also by using logical partitioned view, though before doing that we have to create a view also for Products:
    The final query based on logical partitioned views:
Before using the data there are 1-2 tests that needs to be done in order to assure ourselves that the data available at the various levels of details are synchronized: is the case of the SubTotal which should match the cumulated LineTotal for each line.
Simple reports based on PO Detail level resume in retrieving the POs matching a certain criteriea: closed POs, Open POs, POs with overreceipt, with rejections or any other business case.
Note:
The * (ALL) has been used only in order to minimize query’s length, as a good practice avoid using the “*” and provide instead only the list of attributes needed!

Other types of reports are showing the data at different aggregations, for example calculating the Total Amount spent for each Vendor/Product, the Outstanding Amount, etc.
More complex reports can be built with the help of window functions, for example showing the last PO Unit Price per Vendor/Product, such a report being quite useful when reevaluating the Standard Costs.
Happy coding!

01 February 2010

🕋Data Warehousing: Cube (Definitions)

"A subset of data, usually constructed from a data warehouse, that is organized and summarized into a multidimensional structure defined by a set of dimensions and measures. A cube's data is stored in one or more partitions." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Name for a dimensional structure on a multidimensional or online analytical processing (OLAP) database platform, originally referring to the simple three-dimension case of product, market, and time." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed, 2002)

"Proprietary data structure used to store data for an online analytical processing (OLAP) end user data access and analysis tool." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A multidimensional data structure that represents the intersections of each unique combination of dimensions. At each intersection there is a cell that contains a data value." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"Used with online analytical processing (OLAP), data cubes are multidimensional structures built from one or more tables in a relational database(s)." (Sara Morganand & Tobias Thernstrom , "MCITP Self-Paced Training Kit : Designing and Optimizing Data Access by Using Microsoft SQL Server 2005 - Exam 70-442", 2007)

"A multidimensional structure that contains dimensions and measures." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"A multidimensional structure that contains dimensions and measures. Cubes are a denormalized version of either the entire database or part of the database and are used within SQL Server Analysis Services (SSAS)." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A set of data that is organized and summarized into a multidimensional structure defined by a set of dimensions and measures." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A database object that organizes data for accessibility in an OLAP database." (Ken Withee, "Microsoft® Business Intelligence For Dummies®", 2010)

"A multi-dimensional data structure that contains an aggregate value at each point, i.e., the result of applying an aggregate function to an underlying relation. Data cubes are used to implement OLAP." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Refers to the multidimensional data structure used to store and manipulate data in a multidimensional DBMS. The location of each data value in the data cube is based on the x-, y-, and z-axes of the cube. Data cubes are static (must be created before they are used), so they cannot be created by an ad hoc query." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"A set of data that is organized and summarized into a multidimensional structure that is defined by a set of dimensions and measures." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A multidimensional representation of data needed for online analytical processing, multidimensional reporting, or multidimensional planning applications." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"Cubes, also known as OLAP cubes, are preprocessed and presummarized collections of data that drastically improve query time. [...] OLAP cubes are logical structures as defined by the metadata." (Piethein Strengholt, "Data Management at Scale", 2020)

🕋Data Warehousing: Data Warehousing (Definitions)

Data Warehousing (DW): "The operational extract, cleansing, transformation, and load processes, and associated control processes, that maintain the data containing within a Data Warehouse." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

Data Warehousing: "creating a central database, extracting data from source systems, and then cleaning and loading the data in to the database. The database is designed in such a way that business users can then access the data to answer their business questions." (BI System Builders)

Data Warehousing: "The process of aggregating data from disparate sources for the purpose of building a data warehouse. Data warehousing involves design, development, testing, deployment, operations, impact analysis, and change management." (Insight Software)

"Data warehousing is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence." (Informatica) [source]

🧭Business Intelligence: Enterprise Reporting (Part V: Filtering)

Business Intelligence
Business Intelligence Series

Generalities

Reports are often based on data sets ranging from a few thousands to millions and even billions of records, though the Users are interested most of the times in only smaller chunks of data that could be retrieved with the help of well-designed filters – controls that allow the input/selection of values applied for retrieving the needed data from the initial data set. In a reporting solution the filters are usually available in a screen of their own, called Filter, Input, Parameter or Prompt screen, the provided values being fetched into the query/queries on which the report is based on, or the queries are constructed dynamically.

Most of the values used in query constraints can be considered as parameters whose values are provided by filters; it’s even recommended to use filters instead of hard-coding values, improving thus reports’ flexibility. There are optional and mandatory parameters, in the later case a value needing to be provided in the corresponding filter, therefore such filter controls should be marked correspondingly. The actual values, especially in case of date values, should be validated before they are fetched as early as possible and should be avoided the not necessary roundtrips to the server.

In case of reports requested by Users the filters are typically specified beforehand, though in time other Users arrive to use the same report for other purposes, requesting additional filters in order to select the needed chunk of data. Adding a new filter of modifying an existing one requires a change to be done to report too, requiring in the end the availability of technical resources to develop, test and document the changes. Such gradual changes arrive to be quite expensive, therefore it should be targeted to capture and address the filtering requirements early in the design phase. 

A first problem related to filters is to identify the attributes more likely to be used by Users for filtering, though that’s not so easy as it seems because this requires a good understanding of the data and business as a whole. On the other side could be provided a filter for each attribute from the report, much like Excel lets Users do, though also this approach could be expensive as performance and costs, many attributes arriving to be never used, and overloads unnecessarily the filter screen, in some cases having a direct impact on report’s performance.

Talking from experience the focus for filtering is typically on Master Data (e.g. Customers, Vendors, Items, etc.), Document Numbers (e.g. PO Number, Invoice Number), Document Dates (e.g. Order Date, Invoice Date) or any other Dates important for the business, list of values (e.g. Statuses), and occasionally amounts or quantities, though maybe 50-70% of the attributes from a report match these criteria, the average report having between 20 to 40 attributes. The list of attributes used for filtering could be different for reports based on the same data, attribute’s selection depending on report’s type and level of detail. For example in case of a PO report it would be useful to filter for the PO Number, Vendor, Buyer, Item Number, Open/Closed Pos, etc.; in exchange in summary reports it might be needed to select the top x vendors based on the amounts spent or to select the POs with the price variation in a certain interval, etc.

A second problem is how to make use of the existing controls (text boxes, dropdowns, check boxes, radio buttons, mask controls, etc.) in order to provide Users easy-to-use and flexible filtering. The fact is that an inappropriate use of controls could decrease considerably reports’ performance, making a report even unusable. There is actually no big philosophy about how to choose the controls, this being depending mainly or attributes’ data type, of weather they are numeric, alphanumeric, date or boolean, and whether one or multi selection of values is required. List boxes, dropdowns, checkboxes and even textboxes, when the values are comma delimited, allow filtering on multiple of values, another technique for the same purpose being the use of wildcards.

Alphanumeric Attributes

As a general guideline for alphanumeric (text) values (e.g. PO Number, Serial Number) that have high value variability, it’s preferable to use a free entry text box, which could allow in addition the use of wildcards. For alphanumeric value with relatively small variability it’s preferable to use a (multi-selection) dropdown control or controls with similar features (e.g. list box control, text box + list box); such controls are populated with values from database each time the screen is loaded, fact that doesn’t recommend their use for huge list of values. Typically if the list of values have more than 100-200 values then it should be definitely considered the use of text boxes, eventually with the use of search capabilities or, given the fact that the rate of change of such lists can be ignored, the list could be cached.

Date Attributes

For date attributes it’s advisable to use a From Date/To Date combination of controls, which allow Users to limit report’s scope for a certain time interval. If one of the values is missing, then there will be a limit imposed only for the specified value. Date’s format could be enforced with the help of a mask control, which allow the selection of valid dates, or at least specify the expected format.

Numeric Attributes

For numeric values the normal text boxes would do when masked controls can’t be used, the former allowing using wildcards for the attributes that typically could be also alphanumeric (e.g. document numbers). For numeric intervals should be used a pair of controls for the upper and lower bound, having similar behavior as the date intervals. It is needed to differentiate between the various types of decimal delimiters, therefore it’s recommended to specify the delimiter used or at least transform in the background the numeric value to the format expected by database/tool. The use of controls that allow multi-selection of numeric values, excepting delimiter separated values provided in text boxes, doesn’t really makes sense unless the list of possible it’s quite small.

Boolean Attributes

Boolean data types are used to store the value of attributes that have false or true state, and eventually Null values, therefore checkboxes and radio buttons could be used for this purpose. The bi-state could be easily implemented only with a checkbox when is interested only one of the aspects, while two or three checkboxes can be used for selecting more than one value. On the other side radio buttons allow the selection of only one value, thus select either the false, true or missing values. Such filters could be also used to activate hardcoded complex logic based on multiple attributes and reduced to Boolean values.

31 January 2010

🧭Business Intelligence: Enterprise Reporting (Part IV: Choosing Report’s Attributes)

Business Intelligence
Business Intelligence Series

Introduction

How are chosen the attributes of a report? Attributes are added primarily based on users’ specifications, however often they can be too high level or the user ignored willingly or by mistake certain aspects. In general in a report is need to be shown the attributes of high relevance to a certain topic, for example Document information (Document Number, Type, Dates, Statuses, etc.), Product main information (Product Number, Description, Type, Status, etc.), Quantities, Prices, Amounts, Responsible Users (e.g. Buyers, Preparers, Managers, etc.) or Responsible Third Parties (e.g. Customers, Vendors, Carriers).

When choosing the attributes for a report, there are several important sets of attributes which needs to be considered:

Unique identifiers

Together with the various Names (e.g. Vendor Name, Customer Name) associated with entities, a report should include also the “unique identifier” (UID) for each entity, even if formed from one or more attributes. The UID allows identifying for example if duplicate records appear in report or it could be used to match/join the data from the reports with other data sets in order to pull details or for further analysis of data. For example in a PO report over PO Shipments a unique (natural) key could be identified by using the PO Number, Line Number and Shipment Number; for a Vendor could be used the Vendor Name or the GSL (Global Supplier Location) Number, though the later it’s more adequate because it’s more general and accurate, making easier Vendor’s identification. In theory, for the same scope could be used also the database (surrogate) unique identifier from PO Shipments table, the elements dictating report’s level of detail, respectively the Vendor ID, though even if surrogate UID are easier to use in joins, they could create confusion and overload the reports, given that surrogate UIDs need to be provided also for the other elements.

Documents like Invoices include an external and internal unique identifier, the Invoice Number together with the Vendor, typically unique in a system, form the external UID, while the Document or Voucher Number is used as internal UID. The external UID it’s easier to use for external-based considerations, while the internal UID it’s easier to use for internal needs, so it makes sense to include both types of unique identifiers.

Quantities & Related Attributes

In Item-related reports, most of the times it makes sense to include also the quantities (e.g. Transaction, Ordered, Delivered, Invoiced, On-Hand Quantities) together with the Unit of Measure (UOM) in which they are represented. It has to be made distinction between the Primary UOM, the UOM in which the item is stored, and the Transactional UOM, the UOM in which the Item is transacted; for example the Purchasing UOM, Sales UOM or Transaction UOM could be different than the Primary UOM in which the item is stored in Warehouse. In such cases together with the Transactional UOM should be provided also the Primary UOM and eventually the UOM Conversion Rate, when applicable.

Prices/Amounts & Related Attributes

For Item-related reports and not only, include the various Prices (e.g. Sales, Purchase, Standard Price) together with the Currency Code used even if only one Currency is used, same rule applying also for the amounts stored (e.g. Invoice, Sales Order, Purchase amounts). For financial reports it’s advisable to show both functional amounts, the amounts in the Currency used by GL (General Ledger), and transactional amounts, the Currency used in the transaction. When the level of details allows it, show also the Quantity, Price Unit used to calculate the amounts, the eventual Exchange Rate or UOM Conversion Rate used. When available, include also the Period when the Amount was booked in the system.

Dates

Typically should be included the Document Date (e.g. Invoice Date, Order Date) and Document Creation Date, together with the other Dates important for the business or data analysis (e.g. Need By Date, GL Date, Value Date). In general the Document Date or Document Creation Date, and GL Date for financial reports, should be mandatory attributes because they could be used to segment (partition) a data set in time units (e.g. days, weeks, months, periods, years, etc.).

Statuses

The various record statuses and document statuses should be again mandatory attributes in reports. Record statuses show whether a record is active, was cancelled or marked as deleted, while document statuses show documents’ processing status, often being associated with a workflow (e.g. approval or processing workflows). The record statuses could be synchronized and even merged with the document statuses.

Either expressed as flags or list of values, statuses are essential in delimiting the data set that needs to be considered for further calculations, because often not approved documents or cancelled records have low or no relevance for the business. Not approved documents are typically not considered for the various calculations until they were not approved, while cancelled records are associated with mistakes or the lack of need. Not being able to identify the active records can mess things pretty badly, because for example there are reports that show only active, while others show all the data available in a system. Therefore showing of statuses in reports can be important in the mitigation of differences between reports, especially when dealing with calculations.

It’s advisable to have the possibility to see also the cancelled records, for example in order to analyze the amount of waste expressed as overwork or for identifying the records that were cancelled by mistake.

In reports with multiple levels of details, it can be useful to show the statuses from all levels, as statuses might not be in sync or because they have different meaning. In theory, when the statuses are in synch and especially when considering cancellations, it should be enough to consider the status from the lowest level of detail from each logical entity (e.g. PO Shipment Status when considering PO, Invoice Line status when considering Invoices, both mentioned statuses when considering POs together with Invoices), though reality can prove to be a tough world for statuses, as programming errors and other business scenarios need to be considered.

Action Owners

Include Requestors, Document Preparers, Buyers, Managers or any other type of action owners, so a user can track the direct or indirect issues back to them.

Note:
Such attributes can be used as base to calculate/reflect action owner’s performance, fact that can infringe country or organization regulations so you need to check if there are any constraints in this direction and which set of attributes might be impacted. For example might be no problem to show the Buyer, though might be a problem to show information about who created/modified the record. Eventually if needed to calculate the performance at action owner level, substitute any attribute that can be used to identify a person with a random value, however if the mapping between the action owner and value used as substitute is known (in case unique identifiers are used) or easy to get (by checking records in the system), the data might be misused.

🧭Business Intelligence: Enterprise Reporting (Part IV: Design Best Practices)

Business Intelligence
Business Intelligence Series

If the query writing best practices are almost always clear, in what concerns the reporting/report’s best practices looks like a fuzzy area, the topics being centered around design (naming conventions, level of detail, what attributes need to be chosen, ownership, layout, formatting, filters, sorting, formatting, exporting functionality), security, deployment and report’s management.

Report’s Design pest practices: 
- Design for performance, usability, reusability, readability, accessibility, reliability, availability, personalization, interactivity, automation, adaptability and security.
- Reduce unnecessary network traffic.
- Group together the attributes coming from the same table/element/module.
- Use uniform coding style, formatting and naming conventions.
- Handle missing values (NULLs).
- Avoid hard-coding values.
- Include filter information, the date when the report was run and eventually the name of the Use who run the report.
- Design the report to fit screen’s resolution (e.g. 1280x1024) [1].
- Use existing reports as templates [1].
- Consider local date/time [1], currencies, decimal formatting.
- Highlight important data accordingly (e.g. styles & conditional formatting).
- Separating labels text from expressions [1].
- Use pagination in order to improve performance for large reports.
- Avoid blank pages.
- Keep objects together.
- Use flexible easy-to-use filters.
- Validate input filter data in early phases
- Provide a report for each level of detail.
- Backup aggregated reports with detailed reports - Link reports by using drill-down, drill-though, sub-reports techniques.
- Include unique identifiers for each important data element.
- Include a running number.
- Don’t overload the reports.

Report’s Management best practices:
- Provide an integrated reporting solution and integrate/align it with Data/Information/Knowledge Management strategy/vision.
- Align reports with organization’ strategic, financial and operational plans.
- Align reports with decisions making and performance management .
- Assign an Owner for each report.
- Provide metadata (e.g. Owner, Scope, Business Case, Attributes’ meaning, when the data were last time updated etc).
- Stabilize the requirements before creating the report.
- Train the Users how to use efficiently the reports/reporting tools. - Create processes for report’s creation, modification, testing, issues reporting and mitigation.
- Track report’s usage and performance
- Write unit tests & test the reports.
- Use versioning & keep older versions.
- Document reports (queries and dependent objects, filtering parameters, scope, business case, etc).
- Minimize reporting server’s workload.
- Archive/Backup important reports.

Report’s Security best practices:
- Use role-based security.
- Protect sensitive data.
- Enforce password best practices.
- Educate users about security concerns related to reporting tool’s security and security of data usage.
- Export data to formats that don’t allow data tempering.

Previous Post <<||>> Next Post   

References:
[1] Microsoft TechNet. (2010). Best Practices for Reporting. [Online] Available from: http://technet.microsoft.com/en-us/library/cc180385.aspx (Accessed: 31 January 2010).

30 January 2010

💎SQL Reloaded: Ways of Looking at Data I (Some Basics)

In two previous posts I was talking about the Level of Detail and the Level of Accuracy used in Reports, though from theory to practice is a long way and things are not always so clear as we would like them to be. Initially I wanted to exemplify the ideas with some queries based on Invoice Header/Lines example, though such an attempt supposes the construction and population of several tables– not the best thing to do when introducing concepts. AdventureWorks database’s tables seems the easiest way to start, the Purchase Orders (PO) Header/Details tables offering similar scenarios as the Invoice Header/Lines, in addition there are other tables that could be used together with them in order to demonstrate various techniques and reporting needs.

  First things first, even if in a Parent/Child situation as the one of the POs most of the examples deal with reports based on Parent & Chidren information together, it’s a good idea to start understanding first the Parent and then the Child element, in this case the PO Header and then the PO Lines. A query on PO Headers would include also the referenced tables – Vendors, Shipment Methods, Employees, etc. In case of big tables a query would typically include the most important attributes, while from referenced tables are included the most used elements.

Before further using the query it’s always a good idea to check the variations in cardinality starting from the main table, and increasingly with each join, this test allowing identifying the incorrect joins. When a WHERE clause is involved, it’s easier to use the comment characters (/* … */) and uncomment each join gradually.
Sometimes it makes sense to encapsulate the query in a view, allowing thus to reuse the logic when needed, minimize maintainability, enforce vertical and horizontal security – to mentioned just a few of views’ benefits. It’s said that views are coming with a slight decrease in performance, though there is always a trade between performance and reusability, including in OOP. It’s recommended to use query writing best practices and target to design your queries for the best performance. Excepting the columns from the initial query, in a view it’s recommended to add also the IDs used in the joins, for faster records filtering or for easier troubleshooting. I prefer to add them in front of the other attributes, though their use could be facilitated when grouped together with the attributes from the table referenced.
Probably you wonder what’s the meaning of Ex postfix, as you probably deducted from the comment, it stands for Extended. In general I prefer to create a view for each level of detail, in this case one for PO Headers and another one for PO Details, and to use nested views when the decrease in performance is minimal, or especially when the views include complex logic I don’t want to replicate. In addition the views could be reused by Users who don’t have to replicate the logic already included in view, making the analysis of data much easier. On the other side, when using views from various modules (e.g. Account Payables vs Pos, Sales vs. Pos), there are many elements which arrive to be used more than once, in general Master Data (e.g. Products, Vendors, Customers, Locations, etc.), thus in order to avoid these type of not necessary joins, I prefer to partition logically the views – two views won’t contain similar data, though this doesn’t mean that won’t be views that contradict this rule.

 Using the logical partitioning of views, an alternative view on PO Headers would be created without showing the Vendor end Employee information, following to use if necessary the corresponding views when needed. AdventureWorks includes a view on Vendors called Purchasing.vVendor though the level of detail is at Contact level, and a Vendor could have more than one contacts, therefore the view can’t be used for this purpose, following to create a second view called Purchasing.vVendors (if the name creates confusion you could rename it or rename the Purchasing.vVendor to reflect the fact that the level is Vendor Contacts, for example Purchasing.vVendorContacts).
The problem with the Purchasing.vPurchaseOrderHeaderEx view is that if a user needs additional information about Vendors not available in the view, I’ll have to add the attributes, arriving in the end to create most of the logic available in Purchasing.vVendors view, or the User will rejoin the Purchasing.vVendors view with Purchasing.vPurchaseOrderHeaderEx, having a redundant link to Vendors. The simplified version of the view is the following:
Now the three views - Purchasing.vVendors, HumanResources.vEmployee and Purchasing.vPurchaseOrderHeader could be joined in a query to obtain the same result as Purchasing.vPurchaseOrderHeaderEx view, this approach offering much more flexibility, even if the decrease in performance is higher, though the performance should be in theory better than joining Purchasing.vPurchaseOrderHeader_Ex with Purchasing.vVendors and HumanResources.vEmployee in order to cover the missing attributes. Unfortunately there is another small hitch – the HumanResources.vEmployee view doesn’t contain the ContactID, therefore the view needing to be modified in case is needed.
The logical partitioning allows also more flexibility and relatively increased performance when doing data analysis, for example in order to get the volume purchased per Vendor, the Purchasing.vPurchaseOrderHeader could be used to aggregate the data and join the resulting dataset with Purchasing.vVendors view.
A second look at the same sets of data, Vendors and PO Headers, could be based on a left join between the two, thus seeing also the Vendors for which no PO was approved or completed; the query could be used to return the same data as the previous data set (see ‘Vendors with Pos’ constraint). This second query offers greater flexibility, coming with a small decrease in performance and, in addition, must be taken care of NULL values.
The above examples were showing the cumulated values at Vendor level, what happens when we need to aggregate the data at CountryRegionName level? For this purpose could be used any of the two above queries, and simply aggregate the data at CountryRegionName level. Given the fact that the second query provides greater flexibility, it could be encapsulated in view (e.g. , thus allowing to easily aggregate the data at any level we want – CountryRegionName, StateProvinceName or even City.
 
Sometimes simple aggregations are not enough, Users needing more from the data – for example seeing which is the last/first (open) PO placed in the system for each Vendor. Such a report looks simple as request, though not always so easy to provide. Before the introduction of window functions in SQL Server and analytic functions in Oracle, the techniques for such requests were not so elegant… Here it is the SQL server implementation using RANK ranking window function in order to get the Last/First PO, the same aggregated amounts from the previous query being shown with the help of aggregate window functions.
Note:
Most of the data analysis focuses only for a certain list of vendors, countries or regions, for a certain time interval, or for any other special requirements, for example pending Pos , in order to approximate the future expenses, or for the rejected Pos, in order to see the volume of waste/rework or to quantify the lost opportunities. The above queries were created for general cases and they could be modified for specific requirements if needed.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.