10 January 2009

🛢DBMS: Function (Definitions)

"A set of instructions that operates as a single logical unit, can be called by name, accepts input parameters, and returns information. In programming languages such as C, a function is a named subroutine of a program that encapsulates some logic. The function can be called by name, using parameters to pass data into the function and retrieve data produced by the function. In Transact-SQL, a function is a unit of syntax consisting of a keyword and, usually, a set of parameters. There are several categories of Transact-SQL functions: string, math, system, niladic, text and image, date, aggregate, and conversion functions." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A program that returns a value to the program or environment from which it is called." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A section of code that operates as a single logical unit. Transact-SQL supports both built-in functions and user-defined functions. The built-in functions cannot be modified by SQL Server 2000 users, whereas users can create and modify user- defined functions." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A named set of predefined programming language commands that performs a specific task given zero, one, or more arguments and returns a value." (Bob Bryla, "Oracle Database Foundations", 2004)

"A programming unit or expression returning a single value, also allowing determinant values to be passed in as parameters. Thus, parameter values can change the outcome or return result of a function. The beauty of a function is that it is self-contained and can thus be embedded into an expression." (Gavin Powell, "Beginning Database Design", 2006)

"A set of instructions that operates as a single logical unit." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A centralized, server-based routine that can be included as part of your Transact-SQL statements. Typically used to streamline logic and reduce the amount of required programming effort, you can build your own functions." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"Functions are routines that can accept parameters, perform an action, and return the result of that action. SQL Server includes many built-in functions. User- defined functions can be created to meet specific needs." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"A piece of code that operates as a single logical unit. A function is called by name, accepts optional input parameters, and returns a status and optional output parameters. Many programming languages support functions, including C, Visual Basic, and Transact-SQL. Transact-SQL supplies built-in functions, which cannot be modified, and supports user-defined functions, which can be created and modified by users." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A small program that performs one task and returns a single value. It may be built into the SQL language or written by a user, database administrator, or application programmer." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"A piece of code that operates as a single logical unit. A function is called by name, accepts optional input parameters, and returns a status and optional output parameters. Many programming languages support functions." (SQL Server 2012 Glossary, "Microsoft", 2012)

"A predefined algorithm supported by the DBMS for converting, manipulating, or calculating data based upon input. Functions can be used within SQL statements any place an expression can be used." (Craig S Mullins, "Database Administration", 2012)

"A relationship between a set of input data values and a set of result values that is used to extend and customize SQL or XQuery. Functions are invoked from elements of SQL statements such as the select list or the FROM clause and from XQuery as primary expressions." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

04 January 2009

🛢DBMS: User-defined Functions [UDFs] (Definitions)

"A Transact- SQL function defined by a user. Functions encapsulate frequently performed logic in a named entity that can be called by Transact-SQL statements instead of recoding the logic in each statement." (Microsoft Corporation, Microsoft SQL Server 2000 system administration, 2001)

"Functions that you write as opposed to built-in functions supplied by a programming language." (Greg Perry, "Sams Teach Yourself Beginning Programming in 24 Hours" 2nd Ed., 2001)

"A user-defined function is a Transact-SQL function defined by a user. Functions encapsulate frequently performed logic in a named entity that can be called by Transact-SQL statements instead of recoding the logic in each statement." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A function that is written by an analyst, user, or database administrator and does not come as part of the default installation of the Oracle server software." (Bob Bryla, "Oracle Database Foundations", 2004)

"A collection of T-SQL statements with a well-defined set of input parameters, but only one output which can be a scalar value or a table. User-defined functions allow the encapsulation of various logical and database operations, but cannot be used to affect changes to a database." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"A custom function written by developers in SQL Server." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance 70-444" , 2007)

"A Transact-SQL function defined by a user. Functions encapsulate frequently performed logic in a named entity that can be called by Transact-SQL statements instead of recoding the logic in each statement." (Microsoft Docs, SQL Server 2005)

"Extensible code added to the DBMS to be callable like built-in database functions." (Craig S Mullins, "Database Administration" 2nd Ed, 2012)

"A function that is defined to the DB2 database system by using the CREATE FUNCTION statement and that can be referenced thereafter in SQL statements." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"In SQL Server, a Transact-SQL function defined by a user." (Microsoft Technet)

"A user-defined routine that returns at least one value. (IBM)

03 January 2009

🛢DBMS: Stored Procedures [SP] (Definitions)

"A collection of SQL statements and optional control-of-flow statements stored under a name. SQL Server-supplied stored procedures are called system procedures." (Karen Paulsell et al,  "Sybase SQL Server: Performance and Tuning Guide", 1996)

"Transact-SQL statements stored under a name and processed as a unit. Stored procedures are stored within a database and can be executed with one request from an application. A stored procedure can also allow user-declared variables, conditional execution, and other powerful programming features." (Patrick Dalton, Microsoft SQL Server Black Book, 1997)

"Transact-SQL programs stored on the server in a compiled form. Processing for these objects is done at the server level, making them as fast as any other tables the procedure references on the local machine. They can be used to enforce entity integrity, referential integrity, and domain integrity." (Rob Scrimger et al, "MCSE TestPrep: SQL Server 6.5  Administration", 1998)

"A precompiled collection of Transact-SQL statements and optional control-of-flow statements stored under a name and processed as a unit. Stored procedures are stored within a database, can be executed with one call from an application, and allow user-declared variables, conditional execution, and other powerful programming features." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"A stored procedure is a group of Transact-SQL statements compiled into a single execution plan." (Microsoft Corporation, "SQL Server 2000 Books Online", 2000) 

"A program that resides and executes inside a database system. In Oracle, the term generally encompasses not only procedures but also functions and packages." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A software program stored in the database itself to be executed on the server based on stipulated conditions." (Paulraj Ponniah, "Data Warehousing Fundamentals", 2001)

"A precompiled collection of Transact-SQL statements stored under a name and processed as a unit." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A collection of T-SQL statements with a well-defined set of inputs, called input parameters, and a well-defined set of outputs, which may be output parameters, return values, or cursors. Stored procedures allow the encapsulation of various database operations." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"A batch of SQL commands that are precompiled and saved as a procedure in SQL Server" (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"A collection of T-SQL statements with a well-defined set of inputs, called input parameters, and a well-defined set of outputs, which may be output parameters, return values, or cursors. Stored procedures allow the encapsulation of various database operations." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006)

"Also called a database procedure, a chunk of code stored within and executed from within a database, typically on data stored in a database (but not always)." (Gavin Powell, "Beginning Database Design", 2006)

"An executable code module in SQL Server." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance 70-444" , 2007)

"A piece of code stored in the database that can be executed by various pieces of code such as check constraints or application code. Stored procedures are a good place to store business logic that should not be built into the database's structure." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"Centralized, server-based application code. Typically used to standardize business logic and reduce the amount of required programming effort, you can build your own stored procedures or leverage the many built-in stored procedures offered by SQL Server." (Robert D. Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A precompiled collection of Transact-SQL statements stored under a name and processed as a unit." (Jim Joseph et al, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A set of SQL statements (subroutine) available to applications accessing a relational database system. Stored procedures are physically stored in the database." (John Goodson & Robert A Steward, "The Data Access Handbook", 2009)

"A software program stored in the database itself to be executed on the server based on stipulated conditions." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"An executable module stored in a database instance, to be invoked as part of the verification process for particular data elements." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A SQL program module that is invoked by an application program using the SQL CALL command." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"(1) A named collection of procedural and SQL statements. (2) Business logic stored on a server in the form of SQL code or some other DBMS-specific procedural language." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"A precompiled code routine stored within a Database Management System." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A precompiled routine or program stored within the DBMS and executable using DBMS commands." (Craig S Mullins, "Database Administration" 2nd Ed, 2012)

"A precompiled collection of SQL statements and optional control-of-flow statements stored under a name and processed as a unit. They are stored in an SQL database and can be run with one call from an application." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A database object that can contain executable database server logic. Permissions can be assigned to stored procedures in order to prevent unwanted data modifications and to provide more granular control of security." (Mark Rhodes-Ousley, "Information Security: The Complete Reference" 2nd Ed., 2013)

"A precompiled collection of SQL statements and optional control-of-flow statements stored under a name and processed as a unit. They are stored in an SQL database and can be run with one call from an application." (Microsoft Docs, SQL Server 2014)

"A block of code that is stored in an SQL database and can be called by applications. In some systems the code is written in an extended SQL dialect, in others it is written in a standard programming language and contains Embedded SQL. Stored procedures are used for one or more of the following reasons: to improve performance; to provide greater security; or to improve manageability." (Microfocus)

"A block of procedural constructs and embedded SQL statements that is stored in a database and that can be called by name. Stored procedures allow an application program to be run in two parts, one on the client and the other on the server, so that one call can produce several accesses to the database." (IBM)

"A named PL/SQL block or Java program that Oracle Database stores in the database. Applications can call stored procedures by name." (Oracle)

"A precompiled collection of code such as SQL statements and optional control-of-flow statements stored under a name and processed as a unit. Stored procedures are stored within a database; they can be executed with one call from an application and allow user-declared variables, conditional execution, and other powerful programming features." (Microsoft)

"A precompiled collection of Transact-SQL statements that are stored under a name and processed as a unit." (Microsoft Technet)

"A stored procedure is a combination of SQL statements and control and conditional handling statements that provides an interface to the database." (Teradata)

"A stored procedure is a precompiled executable object that contains one or more Transact-SQL statements." (Microsoft) [source

"A stored procedure is an executable object stored in a database." (Microsoft) [source]

02 January 2009

🛢DBMS: Views (Definitions)

"An alternative way of looking at the data in one or more tables. Usually created as a subset of columns from one or more tables." (Karen Paulsell et al,  "Sybase SQL Server: Performance and Tuning Guide", 1996)

"Predefined query that provides an alternative method of looking at the data in one or more underlying tables. Typically views are used to focus a users view of data and for security." (Rob Scrimger et al, "MCSE TestPrep: SQL Server 6.5  Administration", 1998)

"An alternate way of looking at data from one or more tables in the database. A view is a virtual table, usually created as a subset of columns from one or more tables, which is expressed as a SELECT statement. Depending on the definition, data in base tables can be modified through views." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"A 'virtual table', usually referring to a database object that has been named and created with SQL's CREATE VIEW statement. Usually created for read-only purposes, it is possible to update the database through some views; as of Oracle8, it is also possible to associate an INSTEAD OF trigger with a view to allow INSERT, UPDATE, and DELETE operations on the view." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"SQL statement that creates logical copies of a table or a complete query that can be used separately in a SELECT statement. Views are semantically independent, so the separate roles of a dimension usually are implemented as views." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"A database object that can be referenced the same way as a table in SQL statements. Views are defined using a SELECT statement and are analogous to an object that contains the result set of this statement." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A view is a relational database object that can be referenced and built by using SELECT statements to join data from one or more base tables. Views are similar to tables in that data can be retrieved and modified and indexes can be built." (Thomas Moore, "MCSE Training Guide 70-229: Designing and Implementing Databases with Microsoft® SQL Server™ 2000 Enterprise Edition", 2003)

"A database object that is based on a SELECT statement against one or more tables or other views in the database. A regular view does not store any data in the database; only the definition is stored. Views are also known as stored queries." (Bob Bryla, "Oracle Database Foundations", 2004)

"A relational database object that can be referenced and built by using SELECT statements to join data from one or more base tables. Views are similar to tables in that data can be retrieved and modified and indexes can be built." (Thomas Moore, "MCTS 70-431: Implementing and Maintaining Microsoft SQL Server 2005", 2006) 

"A query that is stored in the data dictionary and is rerun when called for. A view appears to a user to be a table." (Sikha S Bagui & Richard W Earp, "Learning SQL on SQL Server 2005", 2006)

"virtual representations of a table whose contents are defined by a query" (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

 "A database object in SQL Server used to encapsulate a query. Commonly referred to as a virtual table." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance 70-444" , 2007)

"A virtual table which is derived from base table using a query." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A virtual grouping of one or more tables, often done to reduce complexity while increasing security and reliability." (Robert D. Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A stored SQL query from which a virtual table is created for use each time the name of the view is used." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"A virtual table consisting of columns and selected instances from other tables assembled in a systematic way." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A presentation of a set of data from one or more physical tables as one logical table. A view can include some or all the rows and columns from each contributing table, and can be defined as the result table from a SELECT statement." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A virtual table, defined as a SQL SELECT statement, to provide a subset of data from one or more tables." (Craig S Mullins, "Database Administration" 2nd Ed, 2012)

"A virtual table that uses the result set of a saved query." (Faisal Akkawi et al, "MCSA Guide to Microsoft SQL Server 2012 - Exam #70-462", 2014)

"A database object that is created from one or more tables and can be used the same as a table. A view is a virtual table that has no storage requirements of its own." (Arie D Jones, "SQL in 24 Hours" 6th Ed., 2015)

"A virtual relation defined by the database administrator in order to keep subjects from viewing certain data." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK 4th Ed.", 2015)

01 January 2009

🛢DBMS: Database Object (Definitions)

"One of the components of a database: table, view, index, procedure, trigger, column, default, or rule." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"One of the components of a database: a table, index, trigger, view, key, constraint, default, rule, user-defined data type, or stored procedure." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Any structure or entity that exists in an Oracle database, such as a table, index, PL/SQL program, or view. For a list of database objects owned by the current user, look in the data dictionary's USEROBJECTS view." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"Any database component. It could be a table, index, trigger, view, key, constraint, default, rule, user-defined data type, or stored procedure in a database." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"Any of the various items included in a database including tables, views, diagrams, and so on." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Any object in a database, such as a table, a view, an index, a stored procedure, or a trigger." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"An object that exists in an installation of a database system, such as an instance, a database, a database partition group, a buffer pool, a table, or an index." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

02 December 2008

🧭Business Intelligence: Perspectives (Part I: General Issues)

Business Intelligence
Business Intelligence Series

Introduction

BI projects are noble in intent though many managers and data professionals ignore their implications and prerequisites – data quality (incl. availability), cooperation, maturity, infrastructure, adequate tools and knowledge.

Data Quality

The problem with data starts usually at the source - ERP and other information systems (IS). In theory the system should cover all the basic reporting requirements existing in an enterprise, though that's seldom the case. Therefore, basic reporting needs arrive to be covered by ad-hoc developed tools which often include MS Excel/Access solutions, which are difficult to integrate and manage across organization.

Data Quality (DQ) is maybe the most ignored component in the attempt to build flexible, secure and reliable BI solutions. DQ is based on the validation implemented in source systems and the mechanisms used to cleanse the data before being reported, respectively on the efficiency and effectiveness of existing business processes and best practices.

DQ must be guaranteed for accurate decisions. If the quality is not validated and reviewed periodically, users will be reluctant in using the reports! The reports must be validated as part of the UAT process. Aggregated BI reports need detailed reports that can be used for validation, while the logic and data need to be synchronized accordingly.

The quality of decisions is based on the degree to which data were understood and presented to the decisional factors, though that’s not enough; it's need also a complete perspective, and maybe that’s why some business users prefer to prepare and aggregate data by themselves, the process allowing them in theory to get a deeper understanding of what’s happening.

Cooperation

A BI initiative doesn’t depend only on the effort of a department (usually IT), but on the business as a whole. Unfortunately, the so called partnership is more a theoretical term than a fact, while managers’ and business users' involvement is often suboptimal. 

BI implementations are also dependent on consultants’ skills and the degree to which they understood business’ requirements, on team’s cohesion and other project (management) related prerequisites, respectively on knowledge transfer and training. 

Tools

Most of the BI tools available on the market don’t satisfy all business, respectively users’ requirements. Even if they excel in some features, they lack in others. Usually, more than one BI tool is needed to cover (most of) the requirements. When features are not available, or they are not mature enough, or they are difficult to learn, users will prefer to use tools they already know.

Another important consideration is that BI tools rely on data models, often inflexible from the point of the data they provide, lacking integrating additional datasets, algorithms and customizations. The overall requirements need to be considered more recently from the point of cloud computing technologies, which becomes steadily a requirement for nowadays business dynamics. 

Maturity 

Besides the fact that Capability Maturity Models (CMMs) are difficult to implement, organizations lack the knowledge of transforming data into knowledge, respectively in understanding data and evolving it further in wisdom and competitive advantage. 

Most of the fancy words used by salesmen to sell a product don’t become reality overnight. Of course, a BI tool might have the potentiality of fulfilling the various technical and nontechnical goals, though between a theoretical potentiality and harnessing the respective potential is a long road that need to be addressed at strategical, tactical and operational levels.

Infrastructure

Infrastructure refers to human and technical components and the way they interact in getting the job done. It's not only about "breaking habits" and using the best tools, but in aligning people and technologies to the desired level of performance, of retaining and diffusing knowledge. 

11 November 2008

🗄️Data Management: Data Quality (Part I: Information Systems' Perspective)

Data Management
Data Management Series

One LinkedIn user brought to attention the fact that according to top IT managers the top two reasons why CRM investments fail is: (1) managing resistance within the organization; (2) bad data quality.

The two reasons are common not only to CRM or BI solutions but also to other Information Systems, though from the two data quality has usually the biggest impact. Especially in ERP systems the data quality continues to be a problem and here are a few reasons:
  • Processes span different functions and/or roles, each of them maintaining the data they are interested in, without any agreement or coordination on the ownership. The lack of ownership is in general management’s fault.
  • Within an enterprise many systems arrive to be integrated, the quality of the data depending on the quality and scope of the integrations, whether they were addressed fully or only superficially. Few integrations are stable and properly designed. If stability can be obtained in time, scope is seldom changed as it involves further investments, and thus the remaining data need to be maintained manually, respectively the issues need to be troubleshooted or let accumulate in the backlog.
  • There are systems which are not integrated but use the same data, users needing to duplicate their effort, so they often focus on their immediate needs. Moreover, the lack of mappings between systems makes data analysis and review difficult. 
  • The lack of knowledge about the systems used in terms of processes, procedures, best practices, policies, etc. Users usually try to do their best based on the knowledge they have, and despite their best intent, the systems arrive to be misused just to get things done. 
  • Basic or inexistent validation for data entry in each important entry point (UI, integration interfaces, bulk upload functionality), system permissiveness (allowing workarounds), stability and reliability (bugs/defects).
  • Inexistence of data quality control mechanisms or quality methodologies, respectively a Data and/or Quality Management strategy. If the data quality is not kept under review, it can easily decrease over time. 
  • The lack of a data culture and processes that support data quality.
  • People lack consistency and/or the self-discipline to follow the processes and update the data as the processes requires it and not only the data to move to the next or final step. Therefore, the gap between reality and the one presented by the system is considerable.
  • People are not motivated to improve data quality even if they may recognize the importance of doing that.
Data quality is usually ignored in BI projects, and this is because few are the ones that go and search for the causes, making it easier to blame the BI solution or the technical team than to do something. This is one of the reasons for which users are reticent in using a BI solution, to which add up solution’s flexibility and the degree up to which the solution satisfies users’ needs. On the other side BI solutions are often abused, including also reports which have OLTP characteristics or of providing too much unstructured or inadequate content that needs to be further reworked.

Data quality comes on the managers' agenda, especially during ERP implementations. Unfortunately, as soon as that happens, it also disappears, despite being warned of the consequences poor data quality might have on the implementation and further data use. An ERP implementation is supposed to be an opportunity for improving the data quality, though for many organizations it remains in this state. Once this opportunity passes, organizations need more financial and human resources to reach a fraction from the opportunity missed.

The above topics are complex and need further discussion (see [1], [2]).


Written: Nov-2008, Last Reviewed: Mar-2024

Resources:
[1] SQL-Troubles (2010) Data Management: Data Quality - An Introduction (link)
[2] SQL-Troubles (2012) Data Migration: Data Quality’s Perspective I - A Bird’s-Eye View (link)

08 November 2008

💎SQL Reloaded: Dealing with data duplicates on SQL Server

Duplicates or duplications can be defined as "result rows that are identical to one or more other rows" [1]. They occur frequently in data collections (e.g. Excel, Access, etc.) with concurrent access that have minimum or no validation at all. More users and/or more frequent the updates on the same information, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example, in information systems the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

Subject to duplication are whole records, a group of attributes (fields) or only single attributes. I depends from case to case. Often duplicates are easy to identify - it’s enough to let somebody who has the proper knowledge to look over them. But what you do when the volume of data is too large or when is need to automate the process as much as possible? Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to Data Mining techniques.

I will try to exemplify the techniques I use to deal with duplicates with the help of a simple example based on table that tracks information about Assets:

-- create test table
CREATE TABLE [dbo].[Assets](
 [ID] [int] NOT NULL,
 [CreationDate] smalldatetime NOT NULL,
 [Vendor] [varchar](50) NULL,
 [Asset] [varchar](50) NULL,
 [Model] [varchar](50) NULL,
 [Owner] [varchar](50) NULL,
 [Tag] [varchar](50) NULL,
 [Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

Here's some test data:

-- insert test data (SQL Server 2000+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO dbo.Assets
VALUES  ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO dbo.Assets
VALUES  ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO dbo.Assets
VALUES  ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO dbo.Assets
VALUES  ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    
Let’s check table’s content:

-- review the data
SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
ID CreationDate Vendor Asset Model Owner Tag Quantity
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Normally, a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag and Vendor combination appears more than once:

-- retrieve the duplicates
SELECT Vendor, Tag
FROM dbo.Assets A    
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output:
Vendor Tag
Dell DD0001
Microsoft WX0001

In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

-- retrieve duplicates with details
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- duplicates
	  SELECT Vendor, Tag
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  HAVING COUNT(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
5 1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally it's enough to use the DISTINCT keyword to remove duplicated rows:

-- select unique records
SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

Oracle and SQL Server introduced the dense_rank() analytic function, which returns the rank of rows within the partition of a result set, without any gaps in the ranking. In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

-- retrieve duplicates via ranking functions 
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery 
 SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
 , dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING 
 FROM dbo.Assets 
) A 
WHERE RANKING = 1

Output:
CreationDate Vendor Asset Model Owner Tag Quantity
1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1/31/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 3 DD0001 1
1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

Things are simple when records’ creation date is available and it's unique for the key used to validate the data, because a descending order would allow selecting the last record first. An Id can be added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number analytic function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. Actually, the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. 

Both functions should provide similar performance. Conversely, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing.

Unfortunately, this technique doesn’t work in SQL Server 2000, where a different approach is needed. In most of the cases the unique identifier for a record is a sequential unique number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1

The same technique can be used to delete the duplicates from a table:

-- nonduplicated records (SQL server 2000+)
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN ( -- last entry
	  SELECT Vendor, Tag, MAX(Id) MaxId
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	 -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND A.ID = B.MaxId

When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.

Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available, and it's unique:

-- nonduplicated records (SQL server 2000+)  
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM dbo.Assets A
     JOIN (-- last entry
	  SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
	  FROM dbo.Assets A    
	  GROUP BY Vendor, Tag
	  -- HAVING count(*)>1
     ) B
      ON A.Vendor = B.Vendor 
     AND A.Tag = B.Tag 
     AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:
Id CreationDate Vendor Asset Model Owner Tag Quantity
6 2/2/2024 10:46:00 PM Dell Laptop 4 Model 4 Owner 4 DD0001 1
1 1/29/2024 10:46:00 PM IBM Laptop 1 Model 1 Owner 1 XX0001 1
2 1/30/2024 10:46:00 PM IBM Laptop 2 Model 2 Owner 2 XX0002 1
3 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1
4 1/31/2024 10:46:00 PM Microsoft Laptop 3 Model 3 Owner 2 WX0001 1

2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial usable also on SQL Server 2000. Here’s the single multi-row insertion statement:

-- insert test data (SQL Server 2005+)
INSERT INTO dbo.Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function, as below:
 
-- insert test data (Oracle)
INSERT INTO dbo.Assets<br />
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
[1] S Hull (1998) Business Objects Glossary
[2] Microsoft Learn (2023) dense_rank() (link)
[3] Microsoft Learn (2023) row_number() (link)

ERP Systems: Learning about Oracle APPS internals I

How can I learn more about Oracle APPS (Oracle Applications) and from where? a normal question for every beginner but also for experienced professionals.

Oracle made available documentation about their products through Oracle Technology Network and Metalink. The first source contains documents mainly as pdf files, while Metalink provides richer content and it’s easier to use, however in order to access it, your company has to purchase an Oracle Support Identifier.

In Metalink, Oracle Applications’ documentation is grouped under eTRM (Electronic Technical Reference Manuals) section, while the pdf documents can be found under Oracle 11i Documentation Library, and many of them, especially for older versions, can be found also on the web, and revealed with a simple search by using tables' name or file’s name.
Both sources are by far incomplete, there are many gaps, not to forget that many of the Oracle implementations involve also some customization, information about these changes could find maybe in the documentation made during implementation/customization process.

Lately have appeared many blogs on Oracle Applications internals, and even if many of them resume by copying some material from Metalink or other documents, there are also professionals who respect themselves.

People can learn a lot by checking the objects that unveils the APPS internals, APPS.FND_TABLES providing the list of tables used, while APPS.FND_VIEWS provides the list of views, the problem with the later being that can't be done a search using the field that stores views' script, but the data can be exported to a text file and do the search in there (it won’t work to export the data completely to Excel). In time developers arrive to intuit how the views could be named, so a search on their name could help narrowing down the search.

Other professionals might be willing to help, so often it's a good idea to post questions on blogs, forums or social networks for professionals. Not all the questions get answered so rather than waiting for indirect enlightment, it’s better to do some research in parallel too.

There will be cases in which none of the specified sources will help you, most probably you'll have to reengineer Oracle Applications' internals by studying various business scenarios, and in this case the experimented users could help a lot.

🧭Business Intelligence: Enterprise Reporting (Part I: An Introduction)

Business Intelligence
Business Intelligence Series

Introduction

Let's suppose that your company invested a lots of money in an ERP system, and besides the complex setup many customizations were made. To increase ERP system's value, monitor the operations and make accurate decisions you'll need some reports out of it. What do you do then?

In general, there are 5 types of reporting needs: 
  • OLTP (On-Line Transaction Processing) system providing reports with actual (live) data;
  • OLAP (On-Line Analytical Processing) reports with drill-down, roll-up, slice and dice or pivoting functionality, working with historical data, the data source(s) being refreshed periodically;
  • ad-hoc reports – reports provided on request, often satisfying one time reports or reports with sporadic needs;
  • Data Mining tool(s) focusing on knowledge discovery (aka Data Science);
  • direct data access and analysis (aka self-service BI).
Standard Reports 

ERP systems like Oracle Applications, Dynamics AX/365 or SAP come by default with a set of (predefined) standard reports, which in theory cover basic reporting needs. Unfortunately the standard reports are not as flexible as expected, e.g. they can be exported only to text and/or in a non-tabular format, and therefore impossible to reuse for detailed analysis, have inadequate filtering parameters/constraints, behavior or scope. If existing functionality has been customized, most probably existing reports need to be adapted to the new logic. In the end customers need to change the existing reports or adopt an OLAP solution.
    
Vendors tend to keep the secrecy about their solutions and/or don't invest much time into documenting systems' functionality. Therefore, the information about ERP’s internals is limited, while good developers are hard to find or really expensive, and often they needing to reinvent the wheel. ERP vendors do provide documentation about their system's internals, though there are still many gaps concerning tables’ structure and functionality. Fortunately, armed with enough patience, some knowledge about existing business processes and databases, a developer can reengineer an important part of the logic, though there's always a shade of doubt whether the logic is entirely correct or complete. Other good news is that more and more professionals blog on ERP topics, however few are the source that bring something new.

OLAP Reporting  

OLAP solutions presume the existence of a data warehouse that reflects the business model, and when intelligently built it can satisfy an important percentage from the BI requirements. Building a data warehouse or a set of data marts is an expensive and time consuming endeavor and rarely arrives to satisfy everybody’s needs. There are also vendors that provide commercial off-the-shelf data models and solutions, and at a first view they look like an important deal, however such models are inflexible and seldom cover all requirements. One can end up by customizing and extending the model, running in all kind of issues involving model’s design, flexibility, quality, resources and costs.   
 
There are many ways in which things can go wrong or be misused. One of such scenarios is when an OLAP system is used to satisfy OLTP reporting needs. It’s like using a city car in a country cross race – you might make it to compete or even end the race, if you are lucky enough, but don’t expect to make a success out of it!

Ad-hoc Reporting   

The need for ad-hoc reports will be there no matter how complete and flexible are your existing reports. There are always new requirements that must be fulfilled in utile time and not rely on the long cycle time needed for an OLTP/OLAP report. Actually many of the reports start as ad-hoc reports and once their scope and logic stabilized they are moved to the reporting solution. Talking about new reports requirements, it worth to mention that many of the users don’t know exactly what they want, what is possible to get and what information it makes sense to show and at what level of detail in order to have a report that reflects the reality. 

In theory is needed a person who facilitate the communication between users and development team, especially when the work is outsourced. Such a person should have in theory a deep understanding of the business, of the ERP system and reporting possibilities, deeper the knowledge, shorter the delivery cycle time. Maybe such a person could be dispensable if the users and development have the required skill set and knowledge to define and interpret clearly the requirements, however I doubt that’s achievable on large scale. On the other side such attributions could be taken by the IM or functional leaders that support the ERP system, it might work, at least in theory.

Data Mining   

Data Mining tools and models are supposed to leverage the value of an ERP system beyond the functionality provided by analytic reports by helping to find hidden patterns and trends in data, to elaborate predictions and estimates. Here I resume only saying that DM makes sense only when the business reached a certain maturity, and I’m considering here mainly the costs/value ratio (the expected benefits needing to be greater than the costs) and effort required from business side in pursuing such a project.

Self-Service BI   

There are situations in which the functionality provided by reporting tools doesn’t fulfill users’ requirements, one of such situations being when users (aka data citizens) need to analyze data by themselves, to link data from different sources, especially Excel sheets. It’s true that vendors tried to address such requirements, though I don’t think they are mature enough, easy to use or allow users to go beyond their skills and knowledge.
 
Most of such scenarios resume in accessing various sources over ODBC or directly using Excel or MS Access, such solutions being adequate more for personal use. The negative side is that people arrive to misuse them, often ending up by having a multitude of such solution which maybe would make sense to have implemented as a report.

There are managers who believe that such tools would allow eliminating the need for ad-hoc reports, it might be possible in isolated cases though don’t expect from users to be a Bill Inmon or Bill Gates!

Conclusion   

All the tools have their limitations, no matter how complex they are, and I believe that not always a single reporting tool or platform will address all requirements. Each of such tools need a support team and even a center of excellence, so assure yourself that you have the resources, knowledge and infrastructure to support them!

Previous Post <<||>> Next Post
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.