SQL Troubles: granularity

Showing posts with label granularity. Show all posts

04 March 2010

🕋Data Warehousing: Drill Down (Definitions)

"A technique for navigating through levels of data ranging from the most summarized (up) to the most detailed (down). For example, to view the details of sales data by year, a user can drill down to display sales data by quarter, and drill down further to display data by month." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The act of adding a row header or replacing a row header in a report to break down the rows of the answer set more finely." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"To delve deeper into data by going from a summary value to more detailed values." (Margaret Y Chu, "Blissful Data ", 2004)

"The process of exposing progressively more detail by making selections of items in a dynamic report or further enhancing a query." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"Method of analysis for retrieving lower levels of detailed data starting from summary data." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A method of exploring detailed data that was used in creating a summary level of data. Drill down levels depend on the granularity of data within a dimension." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"To decompose data into more atomic components, that is, data at lower levels of aggregation." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"An analytical technique that lets a DSS user navigate among levels of data ranging from the most summarized (up) to the most detailed (down)." (Ciara Heavin & Daniel J Power, "Decision Support, Analytics, and Business Intelligence 3rd Ed.", 2017)

"A method of exploring detailed data that was used in creating a summary level of data. Drill Down levels depend on the granularity of the data in the data warehouse." (Intrafocus)

01 March 2010

🕋Data Warehousing: Granularity (Definitions)

"The degree of specificity of information contained in a data element. A fact table that has fine granularity contains many discrete facts, such as individual sales transactions. A table that has coarse granularity stores facts that are summaries of individual elements, such as sales totals per day." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The level of detail captured in the data warehouse." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"Granularity level is the level of detail of the data in a data warehouse or data mart." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A classification of data based upon the level of detail at which data is recorded. Atomic, base, and summary are all classifications of data granularity." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The level of detail contained in a unit of data. The more detail there is, the lower the level of granularity. The less detail there is, the higher the level of granularity." (William H Inmon, "Building the Data Warehouse", 2005)

"Refers to the level of detail represented by the values stored in a table’s row. Data stored at their lowest level of granularity are said to be atomic data." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"Granularity refers to the smallest reusable element of information. See also physical granularity." (Charles Cooper & Ann Rockley, "Managing Enterprise Content: A Unified Content Strategy, 2nd Ed.", 2012)

"A term sometimes used to describe the level of detail used in obtaining information." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"The level of detail found in a record of data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Granularity refers to the level of detail or precision for a specific information resource property. For instance, the postal address of a particular location might be represented as several different data items, including the number, street name, city, state, country and postal code (a high-granularity model). It might also be represented in one single line including all of the information above (a low-granularity model)." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed, 2016)

19 January 2010

🧭Business Intelligence: Enterprise Reporting (Part III: Levels of Accuracy)

Business Intelligence

Correlated with the level of detail (aka granularity), another aspect that needs to be considered in reports is results’ levels of accuracy – how much the data reflect the reality at each level of detail. Because the integration between modules of the same or different systems brought some of the attributes from one module into the other (e.g. Document Numbers, UIDs, Dates), the reports can consider the respective attributes in queries without involving the modules they come from. Thus for example in order to create an AP (Account Payables) Invoices report, could be sufficient to use the GL (General Ledger) Date stored in AP without using the GL data directly; on the other side, a more accurate report is obtain when using the AP and GL data, such a report could include also the manual postings made in GL and not available in AP.

Typically the data from one module that are stored in other modules, should be in synch, though there could be exceptions and exceptions. In the end it is developer’s task to understand the level of accuracy/detail the users need; in general it doesn’t always make sense to provide the highest level of accuracy/detail when the user needs only some rough number, as higher level of accuracy/details equate with more effort and more tables added to a report.

When considering the different levels of accuracy/detail within reports, given two modules/entities A and B, we might end up created a set of reports with:
a.  all records from A
b.  all records from B;
c.  all records from A and matching records from B, and vice-versa;
d.  all records from A and aggregated data from B when the level of detail is changed.
e.  aggregated data from A vs. aggregated data from B in case of many-to-many cardinality, the aggregation being made at a level of detail in which the amounts are not duplicated.
f.  mismatches between A and B for the same scope (it can be shown at different levels of details).

These scenarios would allow Users to choose the report with the needed level of detail/accuracy, though covering all existing scenarios could be quite expensive, not to neglect the fact that there are reports spanning more than 2 modules. On the other side, two reports for A and B would be sufficient as long as the reference attributes between modules are provided, falling in Users’ task to match and aggregate the data, though also this alternative could prove to be problematic because of the volume of data, synchronization issues between data sets or lack of adequate skill set, and all related issues. The first solution is the ideal, the second is workable, though in the end it depends what makes the users happy.

Previous Post <<||>> Next Post

17 January 2010

🧭Business Intelligence: Enterprise Reporting (Part II: Levels of Detail)

Working with data mainly from frontend (User Interface), in general Users have limited knowledge on how the data are physically stored in the various systems existing in an organization. When a new report is needed they point out the attributes they know from the screens they are working with, falling in developers’ duties to figure out whether the “soup of attributes” makes really sense and find a workable solution. Once the developer has identified the attributes and the tables they are stored in, he/she can go on and create the query/queries on which the report will be based upon. For this task is important to know how the various tables relate to each other, in other words knowing the attributes which can be used to link the tables or which is the relational path to link the table, and relations’ cardinality, which reflects how the number of records of a table or data set changes when is joined with another table or data set.

Between two tables and extensively two data sets the relations could have any of the four types of cardinality – many-to-many (represented as m:n), one-to-one (1:1), one-to-many (1:n) and the reverse many-to-one (n:1). It could be given a definition for each of the cardinality types, though it’s easier to remember that in the x-to-y compounds, between two tables or data sets A and B, x has the value ‘one’ when the number of references from table B to any record from A is maximum 1 (there could be records not referenced at all), and ‘many’ when at least a record could be referenced more than once; the same logic applies for y but inversing the tables’ perspective.

The level of detail (LOD) of a report (aka data granularity) is directly correlated with the changes in cardinality - by adding a data set to an existing data set, if the cardinality of one-to-many and many-to-many is implied then the level of detail changes too. In other words if a record in a given data set is referenced more than once in the new added table then the LOD changes. A simple example of change of LOD occurs in parent-child/master-details relations. For example if the Invoice Lines are added to a data set based on Invoice Header then the LOD changes from Invoice Headers to Invoice Lines. If the Payments are added to the resulted data set then the LOD changes from Invoice Lines to Payments only if there are multiple Payments for an Invoice (one-to-many or many-to-many cardinalities), otherwise LOD remains the same.

Summarizing, the level of detail of a report is the lowest level at which a cardinality change occurs at entity level. It can be discussed about lower or higher LOD in relation to a given level of detail, for example Invoice Payments have a lower LOD than Invoice Lines, while Invoice Header a higher LOD.

Why is it necessary to introduce the LOD especially when considering a relatively complex notion as cardinality?! It worth defining it mainly because the term is used when defining/mitigating reporting requirements, its use being intuitive rather than well-defined and understood. When creating reports it’s important to find the adequate LOD for a report and know what methods can be used in order to pull data that would normally change reports’ LOD without actually changing reports’ LOD.

The limitations imposed by the need to report at a certain LOD higher than the one implied by the row data can be overcome with the help of aggregate functions used with GROUP BY constructs. Such constructs make it possible to bring into a report data found at lower LOD than the reporting LOD by grouping the data based on a set of attributes. The aggregate functions provide functionality to calculate for a set of values the maximum, minimum, sum, count of records, standard deviation, and other statistical values, all of them working on numeric data types, while maximum/minimum and count of records work also with dates or alphanumeric data types.

For example using aggregate functions the Invoice amounts can be aggregated and shown at Invoice Header level, however, in contrast, it’s not possible to do the same with quantities, because typically each Invoice Line refers to a specific Product, and as apples can’t be counted with peaches, this means that in order to see the quantities, the level of detail has to be changed from Invoice Header to Invoice Line. The same technique could be applied to similar parent-child (master-details) relations covering one-to-many or one-to-one cardinality, and also many-to-many relations that involve higher reporting complexity.

Direct many-to-many relations are seldom, involving mainly data sets, the attributes used in relation appearing more than once in each dataset (at least once). Taking an example from ERP world, there could be multiple Receipts and Invoices for the same Purchase Order, thus if there is no direct match between Receipts and Invoices to identify uniquely which Invoice belong to each Receipt, a report involving all the three entities could be barely usable, records being duplicated (e.g. 2 Invoices vs. 3 Receipts result in 6 records in a such scenario). For such reports to be usable the LOD needs to be change at a higher level at least on one side, thus reconsidering the mentioned example a report could show the Purchase Order details and aggregating the Invoice & Receipt Quantities/Amounts at Purchase Order level, or show detailed Invoices with aggregated Receipt information, respectively detailed Receipts with aggregated Invoice information.

Unfortunately even if the aggregate functions are quite handy they have their own limitations, being difficult to use them in order to answer to questions like “what was the last product sold for each customer”, “which was latest invoice placed for each customer”, at least not without considerable effort from developer’s side. Fortunately database vendors implemented their own more specialized type of aggregate functions - analytic functions in Oracle and window functions in SQL Server, they allowing modeling such questions with a report. The downside for developers is that each database vendor comes with its own philosophy, so techniques and features working in one database might not work in another.

Previous Post <<||>> Next Post

29 August 2009

🛢DBMS: Lock Granularity (Definitions)

"The size of the locked area - database, file, table, page, row, or column." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"The size of the database element on which a lock is placed (usually a table or a row within a table)." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"Indicates the level of lock use. Locking can take place at the following levels: database, table, page, row, and field (attribute)." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"The level of data that is involved in a lock. Lock granularity can be at any level of physical data: database, block, set of tables, row, or column level, and include any amount of data contained within those objects." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

01 April 2009

🛢DBMS: Atomic(-level) Data (Definitions)

"Data elements that represent the lowest level of granularity. Depending on the context, this term may refer all the way back to the transactions from the operational systems, or it may refer to the base granularity held in a data warehouse." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The most detailed granular data captured by a business process. Atomic data must be made available in the data presentation area to respond to unpredictable ad hoc queries." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"Data elements that represent the lowest level of detail. For example, in a daily sales report, the individual items sold would be atomic data, and roll-ups such as invoice and summary totals from invoices are aggregate data." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

"Data with the lowest level of granularity. Atomic-level data sits in a data warehouse and is time-variant (that is, accurate as of some moment in time now passed)." (William H Inmon, "Building the Data Warehouse", 2005)

"1.Data at the lowest chosen level of detail (granularity). The level of detail chosen depends on the information requirements of the enterprise. For example, address could be one atomic item, or address could be split into further composite items such as house identifier and city. Opposite of aggregate data. 2.Non-aggregated observations, or measurements of characteristics of individual units, which cannot be further decomposed and retain any useful meaning." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data at the lowest available level of detail or granularity." (Craig S Mullins, "Database Administration", 2012)

06 April 2007

🌁Software Engineering: Granularity (Definitions)

"An important software design concept, especially in relation to components. The amount of detail or functionality - from fine to coarse - provided within a service component. One software component can do something quite simple, such as calculate a square root. Or, it can have a great deal of detail and functionality to represent a complex business rule or workflow. The first is fine grained, and the second coarse grained. Developers often aggregate fine-grained services into coarse-grained services to create a business service." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"Loosely defined as the amount of computation done before synchronization is needed. The longer the time between synchronizations, the coarser the granularity will be. Fine-grained parallelism runs the danger of not having enough work assigned to threads to overcome the overhead costs of using threads. Adding more threads, when the amount of computation doesn’t change, only exacerbates the problem. Coarse-grained parallelism has lower overhead costs and tends to be more readily scalable to an increase in the number of threads." (Clay Breshears, "The Art of Concurrency", 2009)

"A description, from "coarse" to "fine", of a computer activity or feature (such as screen resolution, searching and sorting, or time slice allocation) in terms of the size of the units it handles (pixels, sets of data, or time slices). The larger the pieces, the coarser the granularity." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The amount of decomposition applied to the parallelization of an algorithm, and the grain size of that decomposition. If the granularity is too coarse, there are not enough parallel tasks to effectively make use of all parallel hardware units and hide latency. If the granularity is too fine, there are too many parallel tasks and overhead may dominate the computation." (Michael McCool et al, "Structured Parallel Programming", 2012)

SQL Troubles

Pages