02 December 2008

Business Intelligence: Issues

    BI projects are noble in intent though many managers and technical consultants ignore its implications and prerequisites – data availability and quality, cooperation, maturity, infrastructure, adequate tools and knowledge.

Data availability and quality

    The problem with data starts at the source, usually ERP and other IS. In theory they should cover all the requirements existing in an enterprise, though it’s not always possible to rich that state of art. Such needs arrive to be covered by ad-hoc developed tools which often include Excel & MS Access based solutions and whose data are often difficult to integrate.

    Data Quality is maybe the most ignored component in the attempt of achieving an effective BI solution. The quality of data is based primarily on the validation implemented in source systems and the mechanisms used to cleanse the data before being reported, and secondarily on the efficiency and effectiveness of existing business processes and best practices.

    Data Quality must be guaranteed for accurate decisions, if the quality is not proved, no wonder that users are reluctant of using the reports! Normally the data must be validated, and for this are needed users with adequate knowledge to do that and enough data to back up the BI reports. I would advance here that each BI report needs corresponding supporting OLTP reports, a good synchronization between data sets could offer the base for an adequate validation.

    The quality of decisions is based on the degree to which data were understood and presented to decisional factors, though that’s not enough; is need also a complete perspective, and maybe that’s why some professionals prefer to prepare and aggregate data by themselves, this process allowing to get a deeper feeling of what’s happening (even that’s not enough).

Cooperation

    A BI initiative doesn’t depends only on the effort of a department (usually IT), but of the business as a whole. Unfortunately the so called partnership is more a theoretical term than a fact, same as managers’ and users involvement.

    BI implementations are also dependant on consultants’ skills, the degree to which they understood business’ requirements, team’s cohesion and other project (management) related issues. Everything culminates with the knowledge transfer and training which happens to miss just because of I don’t know what occult reason.

Adequate tools

    Most of the BI tools available on the market don’t satisfy all business/users’ requirements, even if they excel in some features, they lack in others. I would say that more that one BI tool is needed to cover most of the requirements. When features are not available or they are not handy enough, there is no wonder that users prefer to use tools they already know and see the immediate benefit.

    Another important consideration is that BI tools rely on a data model, often inflexible from the point of the information it provides, but also of integrating additional data sets, algorithms and customization. It’s not an accident that customers are finding the concept of data cloud more appealing – it’s a requirement for nowadays business dynamics. Maturity Jumping over the complexity of CMM/CMMI and other such models, I would simplify and say that organizations lack in that knowledge of transforming data in knowledge, understanding it and evolving it further in wisdom and competitive advantage. Most of the fancy words used by Sales People to sell a product don’t become reality over night; of course a BI tool might have the potentiality of fulfilling such goals, though it’s a long way to go until reaching that level. Infrastructure Infrastructure here refers to human and technical components and the way they interact in getting the job done. Some of the issues were already covered in the previous subsections, while many others can be subject of more elaborated work. I resume in saying that it's not only about “breaking the habits” but aligning people and technologies to the desired level of performance, of retaining and diffusing knowledge. Unfortunately we deal with too much philosophy and less action.

11 November 2008

Data Management: Data Quality - Information Systems' Perspective

 
Data Management

   One of the LinkedIn users brought to our attention the fact that “according to top managers in the IT business the top 2 reasons why CRM investments fails is:
1. Managing resistance within the organisation
2. Bad data quality” (B. Hellgren, 2008) and I think that’s valid not only to CRM or BI solutions but also to other information system solutions.

    Especially in ERP systems the data quality continues to be a problem and from my experience I can point here the following reasons:
- processes span across different functions, each of them maintaining the data they are interested in, being rare the cases in which a full agreement is met between functions, and here is also management’s fault.
- within an enterprise many systems arrive to be integrated, the quality of the data depending on integrations’ stability and scope; if stability can be obtained in time, scope resumes to the volume of data available, the delta needing to be maintained mainly manually.
- there are systems which are not integrated but using the same data, users needing to duplicate their effort so they often focus on their immediate needs. - lack of knowledge about the system used, the processes in place and best practices (they can be further detailed).
- basic or inexistent validation for data entry in each important entry point (User Interface, integration interfaces, bulk upload), system permissiveness (allowing workarounds), stability and reliability (bugs/defects).
- inexistence of data quality control mechanisms or quality methodologies.

    Data quality is usually ignored in BI projects, and this because few are the ones that go and search the causes, being easier to blame the BI solution or the technical team. This is ones of the reasons for which users are reticent in using a BI solution, to which add up solution’s flexibility and the degree up to which the solution satisfies users’ needs. On the other side BI solutions are often abused, including also reports which have OLTP characteristics or of providing too much, unstructured or inadequate content that needs to be further reworked.

    I think that data quality comes on manager’s dashboards especially during ERP implementations, when the destination system has some strong validation. And then there is no wonder when each record contains at minimum a defect and the whole dashboard is in nuances of red (the red must be "en vogue").

08 November 2008

SQL Reloaded: Dealing with data duplicates on SQL Server

    Duplicates or duplications can be defined as “result rows that are identical to one or more other rows” (S.Hull, 1998), they occur frequently in data collections (e.g. Excel, Access, etc.) that have minimum or no validation, more users, higher the risks of having duplicated data. Duplicates occur also in modern architectures when validation wasn’t addressed correspondingly or from usage related issues, for example in information systems, the huge amount of data entry makes it hard to overlook documents already processed (e.g. purchase orders, invoices, payments, etc.).

    Subject to duplication are whole records, a group of attributes (fields) or only single attributes, it depends from case to case. Often duplicates are easy to identify, it’s enough to let somebody who has the proper knowledge to look over them, but what you do when the volume of data is too large or when is need to automate the process as much as possible. Using the DISTINCT keyword in a SELECT statement might do the trick, while other times it requires more complicated validation, ranging from simple checks to data mining techniques.

    I will try to exemplify the techniques I use to deal with duplicates with the help of a self-constructed example, a table having the following structure, which tracks information about Assets:

CREATE TABLE [dbo].[Assets](
[ID] [int] NOT NULL,
[CreationDate] smalldatetime NOT NULL,
[Vendor] [varchar](50) NULL,
[Asset] [varchar](50) NULL,
[Model] [varchar](50) NULL,
[Owner] [varchar](50) NULL,
[Tag] [varchar](50) NULL,
[Quantity] [decimal](13, 2) NULL
) ON [PRIMARY]

    Next step is to provide the test data:
INSERT INTO Assets
VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
INSERT INTO Assets
VALUES ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
INSERT INTO Assets
VALUES ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO Assets
VALUES ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
INSERT INTO Assets
VALUES ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
INSERT INTO Assets
VALUES ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

    Now let’s check table’s content:

SELECT ID, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

Output: All records selection     Normally a Tag or a Serial Number should uniquely identify a product coming from the same Vendor, so in this case duplicates’ identification will resume to the records in which the Tag-Vendor combination appears more than once:

SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag
HAVING COUNT(*)>1

Output: Vendor-Product duplications     In many cases showing only the duplicated values would be enough, however there are cases in which is required to see the whole record in which the duplicates appeared, for example to clarify the context in which the issues appear. A simple JOIN with the base table would provide the affected records:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity FROM Assets A
JOIN (
SELECT Vendor, Tag
FROM Assets A
GROUP BY Vendor, Tag
HAVING COUNT(*)>1
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag

Output:
selection of duplicates     There are two scenarios in which we need to remove the duplicates – in a result set or within an existing table.

In a result set normally is enough to use the DISTINCT keyword to remove duplicated rows,

SELECT DISTINCT CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM Assets

Output:
selecting distinct records     In our example only some combinations are duplicated while the other attributes might slightly differ, and therefore is needed another approach. First of all we need to identify which one is the most reliable record, in some cases the latest records entry should be the most accurate or closer to reality, but that’s not necessarily the truth. There are also cases in which we don’t care which the record that is selected is, but from experience these cases are few.

    Oracle and SQL Server introduced the dense_rank() analytic function, which “returns the rank of rows within the partition of a result set, without any gaps in the ranking” (MSDN, 2008 a). In our case the partition is determined by Vendor and Tag, following to identify which the logic used for raking. Supposing that we are always interested in the last record entered, the query would look like this:

SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM (--subquery
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
, dense_rank() OVER(PARTITION BY Vendor, Tag ORDER BY CreationDate DESC , Id DESC) RANKING
FROM Assets
) A
WHERE RANKING = 1

Output:
using dense_rank() function     Things are simple when records’ creation date is available, because a descending order would allow selecting the last record first, the Id being added to the clause for the cases in which multiple records have the same date. At least in this example the same output is obtained by using the row_number() analytic function, which “returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition” (MSDN, 2008 b). Actually the dense_rank function assigns the same rank for the same occurrence of the values appearing in ORDER BY clause (within the same partition), and thus forces the developer to choose the exact attributes that make a field unique within a partition, while row_number() will return a sequence no matter of the attributes used in ORDER BY clause. In this case I suppose that both functions should have the same performance, on the other side, when the selection of records within a partition is not important, it’s better to use the row_number() function, which would need to do less processing and thus better performance.

Unfortunately this technique doesn’t work in SQL Server 2000, so a different approach is requested. In most of the cases the unique identifier for a record is a running number, the highest id corresponding to the latest entered record. This would allow selecting the latest entered record, by using the Max function:

SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
JOIN (
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND A.ID = B.MaxId
Output:
SQL Server 2000 technique The same technique can be used to delete the duplicates from a table:
DELETE Assets
FROM (
SELECT Vendor, Tag, MAX(Id) MaxId
FROM Assets A
GROUP BY Vendor, Tag
) B
WHERE Assets.Vendor = B.Vendor
AND Assets.Tag = B.Tag
AND Assets.ID <> B.MaxId
    When an Id is not available, then we have most probably to create a temporary table with a sequence (numeric unique identifier), sort the records based on the criteria we chose for selection, and then apply the technique based on Id as above.
Notes:
1. In other scenarios it’s important to select all the records matching extreme values (first, last), the dense_rank function becoming handy, however for versions that doesn’t supports it, a creation date attribute saves the day, when available:
SELECT A.Id, A.CreationDate, A.Vendor, A.Asset, A.Model, A.Owner, A.Tag, A.Quantity
FROM Assets A
JOIN (
SELECT Vendor, Tag, MAX(CreationDate) LastCreationDate
FROM Assets A
GROUP BY Vendor, Tag
) B
ON A.Vendor = B.Vendor
AND A.Tag = B.Tag
AND DateDiff(d, A.CreationDate, B.LastCreationDate)=0

Output:
record matching extreme values 2. Duplicated records can be created by merging two sets of data with UNION ALL, in such cases a simple UNION would remove the duplicates introduced by the queries coming after the first one however this will burden the database with more checks. It is recommended to find other solutions, when possible!

3. Instead of using a single multi-row insertion I used multiple insertion statements because I preferred to make the tutorial portable also on SQL Server 2000. Here’s the single multi-row insertion statement:

INSERT INTO Assets VALUES ('1', DATEADD(d,-5, GetDate()), 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')
, ('2', DATEADD(d,-4, GetDate()),'IBM','Laptop 2','Model 2','Owner 2','XX0002','1')
, ('3', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('4', DATEADD(d,-3, GetDate()),'Microsoft','Laptop 3','Model 3','Owner 2','WX0001','1')
, ('5', DATEADD(d,-3, GetDate()),'Dell','Laptop 4','Model 4','Owner 3','DD0001','1')
, ('6', DATEADD(d,-1, GetDate()),'Dell','Laptop 4','Model 4','Owner 4','DD0001','1')

4. The above techniques should work also in Oracle with two amendments, the attributes’ type must be adapted to Oracle ones, while instead of SQL Server GetDate() function should be used the corresponding Oracle SYSDATE function as below:
INSERT INTO Assets
VALUES ('1', SYSDATE-1, 'IBM','Laptop 1','Model 1','Owner 1','XX0001','1')

References:
S.Hull. 1998. Business Objects Glossary. [Online] Available from:http://planning.ucsc.edu/IRPS/dwh/BOBGLOSS.HTM (Accessed: 7 November 2008)
MSDN. 2008 a. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms173825.aspx(Accessed: 7 November 2008)
MSDN. 2008 b. SQL Server 2008 Books Online, dense_rank(). [Online] Available from: http://msdn.microsoft.com/en-us/library/ms186734.aspx (Accessed: 7 November 2008)

ERP Systems: Learning about Oracle APPS internals I

How can I learn more about Oracle APPS (Oracle Applications) and from where? a normal question for every beginner but also for experienced professionals.

Oracle made available documentation about their products through Oracle Technology Network and Metalink. The first source contains documents mainly as pdf files, while Metalink provides richer content and it’s easier to use, however in order to access it, your company has to purchase an Oracle Support Identifier.

In Metalink, Oracle Applications’ documentation is grouped under eTRM (Electronic Technical Reference Manuals) section, while the pdf documents can be found under Oracle 11i Documentation Library, and many of them, especially for older versions, can be found also on the web, and revealed with a simple search by using tables' name or file’s name.
Both sources are by far incomplete, there are many gaps, not to forget that many of the Oracle implementations involve also some customization, information about these changes could find maybe in the documentation made during implementation/customization process.

Lately have appeared many blogs on Oracle Applications internals, and even if many of them resume by copying some material from Metalink or other documents, there are also professionals who respect themselves.

People can learn a lot by checking the objects that unveils the APPS internals, APPS.FND_TABLES providing the list of tables used, while APPS.FND_VIEWS provides the list of views, the problem with the later being that can't be done a search using the field that stores views' script, but the data can be exported to a text file and do the search in there (it won’t work to export the data completely to Excel). In time developers arrive to intuit how the views could be named, so a search on their name could help narrowing down the search.

Other professionals might be willing to help, so often it's a good idea to post questions on blogs, forums or social networks for professionals. Not all the questions get answered so rather than waiting for indirect enlightment, it’s better to do some research in parallel too.

There will be cases in which none of the specified sources will help you, most probably you'll have to reengineer Oracle Applications' internals by studying various business scenarios, and in this case the experimented users could help a lot.

Business Intelligence: Enterprise Reporting

Business Intelligence

    Supposing that your company invested lot of money in an ERP system, moreover around the nice expensive setup it needed to customize many of the existing functionality. So you end up with an ERP system which provides more or less the functionality you needed, though in order to increase its value and make accurate decisions you need some reports out of it. What you do then?

    From my point of view there are 4 types of reporting needs: - OLTP (On-Line Transaction Processing) system providing reports with actual (live) data; - OLAP (On-Line Analytical Processing) reports with drill-down, roll-up, slice and dice or pivoting functionality, working with historical data, the data source(s) being refreshed periodically;
- ad-hoc reports – reports provided on request, often satisfying one time reports or reports with sporadic needs;
- Data Mining tool(s) focusing on knowledge discovery;
    Actually I would have put a 5th requirement related to direct data access and analysis.


    ERP systems like Oracle Applications or SAP come by default with a set of (predefined) standard reports, which in theory they satisfy the prerequisites of an OLTP system. Unfortunately the standard reports satisfy only basic needs and are not as flexible as expected – it could happen that they can be exported only to text, moreover they being in a non-tabular format, and therefore impossible to reuse for detailed analysis. Another problem could be caused by inadequate filtering constraints, behavior or scope. If existing functionality has been customized, most probably existing reports need to be adapted to the new logic. In the end you realize that you either change the existing reports or adopt an OLTP solution.
    The bad news don’t stop here, as the vendors try to keep the secrecy about their solutions, the information about ERP’s internals is limited, good developers hard to find or really expensive, and often they needing to reinvent the wheel. It’s true that Oracle provided through Metalink a nice set of documentation about Oracle Application’s internals, and even if they updated currently web site’s look and feel, there are still having many gaps concerning tables’ structure and functionality. Fortunately armed with enough patience and some knowledge about existing business processes and databases, a developer can reengineer an important part of the logic. Other good news is that the increasing number of posting focusing on ERP solution, however few are the source that bring something new.

    OLAP solutions presume the existence of a data warehouse that reflects the business model, and when intelligently built it can satisfy an important percentage from the business intelligence requirements. Building a data warehouse or a set of data marts it’s an expensive and time consuming endeavor and rarely arrives to satisfy everybody’s needs. There are also vendors that provide already made models, and at a first view they look like an important deal, however in time excepting the fact that vendor detains the ownership of the model, I bet that such models have many gaps. Again you end up by customizing and extending the model, running in all kind of issues involving model’s design, flexibility, quality, resources and costs.     There are many ways of making things go wrong, one of the interesting situations being when an OLAP system is used to satisfy OLTP reporting needs. It’s like using a city car in a country cross race – you might make it to compete or even end the race, if you are lucky enough, but don’t expect to make a success out of it!

    The need for ad-hoc reports will be there no matter how complete and flexible are your existing reports. There are always new requirements that must be fulfilled in utile time and not rely on the long cycle time needed for an OLTP/OLAP report. Actually many of the reports start as ad-hoc reports and once their scope and logic stabilized they are moved to the reporting solution. Talking about new reports requirements, it worth to mention that many of the users don’t know exactly what they want, what is possible to get and what information it makes sense to show and at what level of detail in order to have a report that reflects the reality. In theory is needed a person who facilitate the communication between users and development team, especially when the work is outsourced. Such a person should have in theory a deep understanding of the business, of the ERP system and reporting possibilities, deeper the knowledge, shorter the delivery cycle time. Maybe such a person could be dispensable if the users and development have the required skill set and knowledge to define and interpret clearly the requirements, however I doubt that’s achievable on large scale. On the other side such attributions could be taken by the IM or functional leaders that support the ERP system, it might work, at least in theory.

    Data Mining tools and models are supposed to leverage the value of an ERP system beyond the functionality provided by analytic reports by helping to find hidden patterns and trends in data, to elaborate predictions and estimates. Here I resume only saying that DM makes sense only when the business reached a certain maturity, and I’m considering here mainly the costs/value ratio (the expected benefits needing to be greater than the costs) and effort required from business side in pursuing such a project.

    There are situations in which the functionality provided by reporting tools doesn’t fulfill users’ requirements, one of such situations being when users need to analyze data by themselves, to link data from different sources, especially Excel sheets. It’s true that vendors tried to address such requirements, though I don’t think they are mature enough, easy to use or allow users to go beyond their skills and knowledge.
    Most of such scenarios resume in accessing various sources over ODBC or directly using Excel or MS Access, such solutions being adequate more for personal use. The negative side is that people arrive to misuse them, often ending up by having a multitude of such solution which maybe would make sense to have implemented as a report.
    There are managers who believe that such tools would allow eliminating the need for ad-hoc reports, it might be possible in isolated cases though don’t expect from users to be a Bill Inmon or Bill Gates!

    All the tools have their limitations, no matter how complex they are, and I believe that not always a single tool addresses all requirements. Each of such tools need a support team and even a center of excellence, so assure yourself that you have the resources, knowledge and infrastructure to support them!

29 October 2008

W3: Resource Description Framework (Definitions)

"A framework for constructing logical languages that can work together in the Semantic Web. A way of using XML for data rather than just documents." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"An application of XML that enables the creation of rich, structured, machinereadable resource descriptions." (J P Getty Trust, "Introduction to Metadata" 2nd Ed., 2008)

"An example of ‘metadata’ language (metadata = data about data) used to describe generic ‘things’ (‘resources’, according to the RDF jargon) on the Web. An RDF document is a list of statements under the form of triples having the classical format: <object, property, value>, where the elements of the triples can be URIs (Universal Resource Identifiers), literals (mainly, free text) and variables. RDF statements are normally written into XML format (the so-called ‘RDF/XML syntax’)." (Gian P Zarri, "RDF and OWL for Knowledge Management", 2011)

"The basic technique for expressing knowledge on The Semantic Web." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A graph model for describing formal Web resources and their metadata, to enable automatic processing of such descriptions." (Mahdi Gueffaz, "ScaleSem Approach to Check and to Query Semantic Graphs", 2015)

"Specified by W3C, is a conceptual data modeling framework. It is used to specify content over the World Wide Web, most commonly used by Semantic Web." (T R Gopalakrishnan Nair, "Intelligent Knowledge Systems", 2015)

"Resource Description Framework (RDF) is a framework for expressing information about resources. Resources can be anything, including documents, people, physical objects, and abstract concepts." (Fu Zhang & Haitao Cheng, "A Review of Answering Queries over Ontologies Based on Databases", 2016)

"Resource Description Framework (RDF) is a W3C (World Wide Web Consortium) recommendation which provides a generic mechanism for representing information about resources on the Web." (Hairong Wang et al, "Fuzzy Querying of RDF with Bipolar Preference Conditions", 2016)

"Resource Description Framework (RDF) is a W3C recommendation that provides a generic mechanism for giving machine readable semantics to resources. Resources can be anything we want to talk about on the Web, e.g., a single Web page, a person, a query, and so on." (Jingwei Cheng et al, "RDF Storage and Querying: A Literature Review", 2016)

"The Resource Description Framework (RDF) metamodel is a directed graph, so it identifies one node (the one from which the edge is pointing) as the subject of the triple, and the other node (the one to which the edge is pointing) as its object. The edge is referred to as the predicate of the triple." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"Resource description framework (RDF) is a family of world wide web consortium (W3C) specifications originally designed as a metadata data model." (Senthil K Narayanasamy & Dinakaran Muruganantham, "Effective Entity Linking and Disambiguation Algorithms for User-Generated Content (UGC)", 2018)

"A framework for representing information on the web." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"Resource description framework (RDF) is a W3C (World Wide Web Consortium) recommendation which provides a generic mechanism for representing information about resources on the web." (Zongmin Ma & Li Yan, "Towards Massive RDF Storage in NoSQL Databases: A Survey", 2019)

"It is a language that allows to represent knowledge using triplets of the subject-predicate-object type." (Antonio Sarasa-Cabezuelo & José Luis Fernández-Vindel, "A Model for the Creation of Academic Activities Based on Visits", 2020)

"The RDF is a standard for representing knowledge on the web. It is primarily designed for building the semantic web and has been widely adopted in database and datamining communities. RDF models a fact as a triple which consists of a subject (s), a predicate (p), and an object (o)." (Kamalendu Pal, "Ontology-Assisted Enterprise Information Systems Integration in Manufacturing Supply Chain", 2020)

"It is a language that allows to represent knowledge using triplets of the subject-predicate-object type." (Antonio Sarasa-Cabezuelo, "Creation of Value-Added Services by Retrieving Information From Linked and Open Data Portals", 2021)

"Resource Description Framework, the native way of describing linked data. RDF is not exactly a data format; rather, there are a few equivalent formats in which RDF can be expressed, including an XML-based format. RDF data takes the form of ‘triples’ (each atomic piece of data has three parts, namely a subject, predicate and object), and can be stored in a specialised database called a triple store." ("Open Data Handbook")

26 October 2008

GSCM: Kanban (Definitions)

"In lean cellular manufacturing, a visual device, such as a card, floor space (kanban square), or production bin, which communicates to a cell that additional materials or products are demanded from the subsequent cell." (Leslie G Eldenburg & Susan K Wolcott, "Cost Management" 2nd Ed., 2011)

"A card-based techniques for authorizing the replenishment of materials." (Daryl Powell, "Integration of MRP Logic and Kanban Shopfloor Control", 2014)

"A just-in-time technique that uses kanban cards to indicate when a production station needs more parts. When a station is out of parts (or is running low), a kanban card is sent to a supply station to request more parts." (Rod Stephens, "Beginning Software Engineering", 2015)

"A note, card, or signal, a Kanban used to trigger a series of processes, usually downstream in the supply chain, in order complete tasks, products, and/or services. As part of a workflow management systems, timely Kanbans allow for efficient operations that enable agile, just-in-time (JIT), and lean philosophies to work." (Alan D Smith, "Lean Principles and Optimizing Flow: Interdisciplinary Case Studies of Best Business Practices", 2019)

"Agile method to manage work by limiting work in progress. Team members pull work as capacity permits, rather than work being pushed into the process when requested. Stimulates continuous, incremental changes. Aims at facilitating change by minimizing resistance to it." (Jurgen Janssens, "Managing Customer Journeys in a Nimble Way for Industry 4.0", 2019)

"This tool is used in pull systems as a signaling device to trigger action. Traditionally it used cards to signal the need for an item. It can trigger the movement, production, or supply of a unit in a production chain." (Parminder Singh Kang et al, "Continuous Improvement Philosophy in Higher Education", 2020)

"A signal that communicates a requirement for a quantity of product." (Microsoft, "Dynamics for Finance and Operations Glossary")

"A signaling device that gives instruction for production or conveyance of items in a pull system. Can also be used to perform kaizen by reducing the number of kanban in circulation, which highlights line problems." (Lean Enterprise Institute)

25 October 2008

GSCM: Supply Chain Management (Definitions)

"The practice of designing and optimizing supply chain business processes to provide superior service to those customers who drive the bulk of one’s profit." (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"The management of business units in the provision of products and services. It spans the movement and storage of raw materials, work-in-process inventory, and finished goods from point-of-origin to point-of-consumption." (Tony Fisher, "The Data Asset", 2009)

"Software tools or modules used in the planning, scheduling, and control of supply chain transactions (spanning raw materials to finished goods from point of origin to point of consumption), managing supplier relationships, and controlling associated business processes." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"To provision products or services to a network of interconnected businesses." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"The management of all of the activities along the supply chain, from suppliers, to internal logistics within a company, to distribution, to customers. This includes ordering, monitoring, and billing." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"The process of ensuring optimal flow of inputs and outputs." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"In basic terms, supply chain is the system of organizations, people, activities, information and resources involved in moving a product or service from supplier to customer. The configuration and management of supply chain operations is a key way companies obtain and maintain a competitive advantage." (Alan D Smith, "Lean Principles and Optimizing Flow: Interdisciplinary Case Studies of Best Business Practices", 2019)

"Supply chain management (SCM) refers to the processes of creating and fulfilling demands for goods and services. It encompasses a trading partner community engaged in the common goal of satisfying end customers." (Gartner)

24 October 2008

GSCM: Supply Chain (Definitions)

"Fulfillment process from customer purchase through manufacturing, factory, raw material, and component supplier." (Timothy J  Kloppenborg et al, "Project Leadership", 2003)

"The network of suppliers that provides raw materials, components, subassemblies, subsystems, software, or complete systems to your company." (Clyde M Creveling, "Six Sigma for Technical Processes: An Overview for R Executives, Technical Leaders, and Engineering Managers", 2006)

"The supply chain refers to the processes and methods supporting the physical existence of a product from the procurement of materials through the production, storage (creating inventory), and movement (logistics) of the product into its chosen distribution channels." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"A pipeline composed of multiple companies that perform any of the following functions: procurement of materials, transformation of materials into intermediate or finished products, distribution of finished products to retailers or customers, recycling or disposal in a landfill." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"Flow of resources from the initial suppliers (internal or external) through the delivery of goods and services to customers and clients. (510, 646)" (Leslie G Eldenburg & Susan K Wolcott, "Cost Management" 2nd Ed, 2011)

"The optimal flow of product from site of production through intermediate locations to the site of final use." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The people and processes involved in the production and distribution of goods or services. " (DK, "The Business Book", 2014)

"The channel of distribution that enables products to be delivered from the supplier to the final buyer."(Gökçe Ç Ceyhun, "An Assessment for Classification of Distribution Network Design", 2020)

"A system of organizations, people, activities, information, and resources, possibly international in scope, that provides products or services to consumers." (CNSSI 4009-2015)

"Linked set of resources and processes between multiple tiers of developers that begins with the sourcing of products and services and extends through the design, development, manufacturing, processing, handling, and delivery of products and services to the acquirer." (NIST SP 800-37)

"The network of retailers, distributors, transporters, storage facilities, and suppliers that participate in the sale, delivery, and production of a particular product." (NIST SP 800-98)

28 September 2008

W3: Semantic Web (Definitions)

"The Web of data with meaning in the sense that a computer program can learn enough about what the data  means to process it." (Tim Berners-Lee, "Weaving the Web", 1999)

"An evolving, collaborative effort led by the W3C whose goal is to provide a common framework that will allow data to be shared and re-used across various applications as well as across enterprise and community boundaries." (J P Getty Trust, "Introduction to Metadata" 2nd Ed, 2008)

"Communication protocols and standards that would include descriptions of the item on the Web such as people, documents, events, products, and organizations, as well as, relationship between documents and relationships between people." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it. The principle that one should represent separately the essence of a document and the style is presented." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"A machine-processable web of smart data, [where] smart data is data that is application-independent, composeable, classified, and part of a larger information ecosystem (ontology)." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"An evolving extension of the Web in which Web content can be expressed not only in natural language but also in a form that can be understood, interpreted, and used by intelligent computer software agents, permitting them to find, share, and integrate information more easily." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The next-generation Internet in which all content is tagged with semantic tags defined in published ontologies. Interlinking these ontologies will allow software agents to reason about information not directly connected by document creators." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"is a term coined by World Wide Web Consortium (W3C) director Sir Tim Berners-Lee. It describes methods and technologies to allow machines to understand the meaning - or 'semantics'- of information on the World Wide Web." (Jingwei Cheng et al, "RDF Storage and Querying: A Literature Review", 2016)

"The vision of a Semantic Web world builds upon the web world, but adds some further prescriptions and constraints for how to structure descriptions. The Semantic Web world unifies the concept of a resource as it has been developed in this book, with the web notion of a resource as anything with a URI. On the Semantic Web, anything being described must have a URI. Furthermore, the descriptions must be structured as graphs, adhering to the RDF metamodel and relating resources to one another via their URIs. Advocates of Linked Data further prescribe that those descriptions must be made available as representations transferred over HTTP." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

"A collaborative effort to enable the publishing of semantic machine-readable and shareable data on the Web." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

16 September 2008

W3: Cyberspace (Definitions)

"A term used to describe the nonphysical, virtual world of computers." (Andy Walker, "Absolute Beginner’s Guide To: Security, Spam, Spyware & Viruses", 2005)

"A metaphoric abstraction for a virtual reality existing inside computers and on computer networks." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The online world of computer networks where people can interact with others without physically being with them. People commonly interact with cyberspace via the Internet." (Darril Gibson, "Effective Help Desk Specialist Skills", 2014)

"The interdependent network of information technology infrastructures, which includes the Internet, telecommunications networks, computer systems, and embedded processors and controllers." (Olivera Injac & Ramo Šendelj, "National Security Policy and Strategy and Cyber Security Risks", 2016)

"A complex hyper-dimensional space involving the state of many mutually dependent computer and network systems with complex and often surprising properties as compared to physical space." (O Sami Saydjari, "Engineering Trustworthy Systems: Get Cybersecurity Design Right the First Time", 2018)

"Artifacts based on or dependent on computer and communications technology; the information that these artifacts use, store, handle, or process; and the interconnections among these various elements." (William Stallings, "Effective Cybersecurity: A Guide to Using Best Practices and Standards", 2018)

"Refers to a physical and non-physical terrain created by and/or composed of some or all of the following: computers, computer systems, networks, and their computer programs, computer data, content data, traffic data, and users." (Thokozani I Nzimakwe, "Government's Dynamic Approach to Addressing Challenges of Cybersecurity in South Africa", 2018)

"Cyberspace, is supposedly 'virtual' world/network created by links between computers, Internet-enabled devices, servers, routers, and other components of the Internet’s infrastructure." (Sanjeev Rao et al, "Online Social Networks Misuse, Cyber Crimes, and Counter Mechanisms", 2021)

31 August 2008

SQL Server New Features: ROWCOUNT in action

 Especially when working with big tables, the default behaviour of Query Analyzer is to not show the output until the last record has been fetched. This can be time and resource consuming and therefore I’ve appreciated the fact that TOAD and SQL Developer are fetching only a certain number of records. Now I can see that same can be done starting with SQL Server 2005 onward by modifying ROWCOUNT server property using Query/Query Options menu functionality.

  Query Options under SQL Server 2008 Query Options under SQL Server 2008 or by running the command: SET ROWCOUNT <number of records>; Of course somebody may limit the number of records returned by a query using TOP function when working with SQL Server and ROWNUM in Oracle, though I find it not always handy – it depends from case to case. There are also technical implications between the two types of usage, according SQL Server Books online it is recommended to TOP with SELECT over ROWCOUNT with regard to scope and query optimization, however in this context only the later makes sense:
"As a part a SELECT statement, the query optimizer can use the value of expression in the TOP clause as part of generating an execution plan for a query. Because SET ROWCOUNT is used outside a statement that executes a query, its value cannot be used to generate a query plan for a query."
Notes: 1. Do not mix the ROWNUM with @@ROWNUM function which returns the number of rows affected by the last statement. 2. Some of us list all the records in order to see the number of records returned by a query, though that’s totally not advisable!

AdventureWorks requires FILESTREAM enabled

 Surprises, surprises, surprises, programmers’ world is full of them! When you say that everything is ok, you just discover that something went wrong. I was expecting to have Adventure Works database installed though I haven’t checked that. I realized today that it’s missing, so I tried to reinstall it enabling this time the “Restore AdventureWorks DBs” feature, though I got another nice error:

  Setup failed for MSSQLSERVER. The following features are missing: FILESTREAM Fix the problems and re-run setup.

 Guy Burstein, in his blog, wrote that the STREAM support can be enabled using the following SQL command: exec [dbo.sp_filestream_configure] @enable_level = 3; I tried that and another error came in:

  Msg 2812, Level 16, State 62, Line 1 Could not find stored procedure 'sp_filestream_configure'

 Checking my local installation of SQL Server Books Online, I found no track of sp_filestream_configure stored procedure, but I found that I can enable the STREAM support using sp_configure stored procedure as below:

EXEC sp_configure filestream_access_level, 2
RECONFIGURE
GO

 Once I executed the 3 lines together, I got the following confirmation message which, amusingly, still recommands me to run the RECONFIGURE statement even if I did that. Anyway better more redundant information than nothing…

  Configuration option 'filestream access level' changed from 2 to 2. Run the RECONFIGURE statement to install.

30 August 2008

Oracle Troubleshooting: No records returned by queries (Checklist)

No records returned by a query even if there should be results? Usually I’m using the following checklist: 1. check if the tables contain data. Silly but effective, especially in Oracle APPS in which some tables got deprecated and were replaced by tables with similar names (PA_PROJECTS_ALL vs. PA_PROJECTS), though that could happen in other environments too; 
2. check if the JOIN syntax is correct; 
3. check if one of the columns use in JOIN has only NULL values; 
4. check if the constraints used in WHERE clause causes makes sense (e.g. wrong values or syntax); 
5. for Oracle flavored queries, check if in WHERE clause there is a column not referenced with the table name or alias, and the column is available in more than one table used in the query. This Oracle bug is really dangerous when doing fast query checks! 
6. for Oracle (APPS), check whether the query or view uses USERENV function with LANG or LANGUAGE text parameter, normally a constraint like: TABLE1.LANGUAGE = USERENV(‘LANG’).
The problem with such queries comes when user’s system language is other than the one expected, and thus query’s output might not be as expected. Usually it is preferable to hardcode the value, when possible: TABLE1.LANGUAGE = ‘US’ Note: Actually, also the tools you are using to run a query could create issues, for example a query run under Oracle’s SQL Developer was not returning records even if in TOAD did that. The problem was solved with the installation of a newer SQL Developer version.

Oracle Troubleshooting: ANSI 92 JOIN syntax error

 Lately I’ve been working a lot with Oracle APPS, doing mainly ad-hoc reporting. One of my nightmares is an Oracle bug related to ANSI 92 syntax:

  “ORA-01445: cannot select ROWID from, or sample, a join without a key-preserved table”

 Unfortunately, even if the bug was solved by Oracle, it seems the update was missed on some servers and the bug haunts my queries almost on a daily basis. 

 Having an SQL Server background and, for code clearness, I prefer ANSI 92 JOIN syntax:

SELECT A.column1, B.column2
FROM table1 A JOIN table2 B
 ON A.column1 = B.column2 

instead of using the old fashioned writing:

SELECT A.column1, B.column2
FROM table1 A , table2 B
WHERE A.column1 = B.column2

 In theory the two queries should provide the same output and have, hopefully, similar performance. The problem with ANSI 92 syntax is that, on some Oracle installations, when the number of joins exceeds a certain limit, usually greater than 7, the above error is thrown.

What one can do is to reduce the number of joins to the main table by restructuring the query and grouping multiple tables into subqueries, which are then joined to the main table. For the tables from which is returned only one column, one can move the table into the SELECT statement.

Happy coding! 

SQL Server Troubleshooting: Adventure Works installation error on Vista

 I tried to install the Adventure Works OLTP & DW on SQL Server 2008 RTM from CodePlex though I got an error:

  “The installer has encountered an unexpected error installing this package. This may indicate a problem with this package. The error code is 2738.”

Initially I thought that the problem was caused by existing Adventure Works installations made on a previous CTP version of SQL Server 2008, forgetting to uninstall them when I uninstalled the CTP version. Totally wrong! Doing a little research, I found first a post on CodePlex Discussion forum mentioning that the problem could be caused by VBScript runtime because, as Toms’Tricks blog highlights, VBScript and Jscript are not registered on Windows Vista. Wonderful! I just run regsvr32 vbscript.dll command and it worked! Another situation in which regsvr32 saved the day!

 I wonder why I haven’t got the same problem when I previously installed Adventure Works database on CTP version! Could it be because of Windows Vista SP1 changes (I installed Windows Vista SP1 after SQL Server CTP)?

SQL Server Troubleshooting: Microsoft SQL Server 2008 installation error

 This week I tried to install SQL Server 2008 however I got the following error: 
  “A previous release of Microsoft Visual Studio 2008 is installed on this computer. Upgrade Microsoft Visual Studio 2008 to the SP1 before installing SQL Server 2008.” 

 It’s true that I have previously installed Visual Studio 2008, though once I did that I checked if there are any updates and thus I installed SP1 too. I did a quick search on Google and the first results pointed me to an article o Microsoft Help and Support website: Visual Studio 2008 SP1 may be required for SQL Server 2008 installations. It didn’t make sense; in the end I’ve installed the server but enabled the installation of the following components: 
 • Management Tools (Basic or Complete) 
 • Integration Services 
 • Business Intelligence Development Studio 

 The server was installed without problems, so I tried to install the remaining components getting the same error as above. I had to stop at that point and today, giving more thought to the problem, I realized that the error could be caused by Microsoft Visual Studio 2008 Express edition, which I managed to install a few months back. Instead of uninstalling Microsoft Visual Studio 2008 it looked easier to uninstall the Express version, and once I did that, I managed to install the remaining components. Actually I checked before if there is a SP1 for Microsoft Visual Studio 2008 Express, I arrived at Microsoft Visual Studio 2008 Express Editions with SP1 page, though I remembered that I have to install the Web Developer, Visual Basic and C# 2008 separately and in the end I presumed that maybe it would be easier to uninstall the existing versions and try then to install SQL Server remaining components. I haven’t tried to install the Express editions with SP1 as now I have the Professional edition.

04 August 2008

Application Architecture: Enterprise Service Bus (Definitions)

"A layer of middleware that enables the delivery and sharing of services across and between business applications. ESBs are typically used to support communication, connections, and mediation in a service-oriented architecture." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"The infrastructure of a SOA landscape that enables the interoperability of services. Its core task is to provide connectivity, data transformations, and (intelligent) routing so that systems can communicate via services. The ESB might provide additional abilities that deal with security, reliability, service management, and even process composition. However, there are different opinions as to whether a tool to compose services is a part of an ESB or just an additional platform to implement composed and process services outside the ESB." (Nicolai M Josuttis, "SOA in Practice", 2007)

"A middleware software architecture construct that provides foundational services for more complex architectures via an event-driven and standards-based messaging engine (the bus). An ESB generally provides an abstraction layer on top of an implementation of an enterprise messaging system. |" (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"The infrastructure of an SOA landscape that enables the interoperability of services. Its core task is to provide connectivity, data transformations, and routing so that systems can communicate via services." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A software layer that provides data between services on an event-driven basis, using standards for data transmission between the services." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A packaged set of middleware services that are used to communicate between business services in a secure and predictable manner." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

22 February 2008

Business Applications: Customer Relationship Management (Definitions)

"Operational and analytic processes that focus on better understanding and servicing customers in order to maximize mutually beneficial relationships with each customer." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"a popular DSS application designed to streamline customer and/or corporate relationships." (William H Inmon, "Building the Data Warehouse", 2005)

"A database system containing information on interactions with customers." (Glenn J Myatt, "Making Sense of Data", 2006)

"The infrastructure that enables the delineation of and increase in customer value, and the correct means by which to increase customer value and motivate valuable customers to remain loyal, indeed, to buy again." (Jill Dyché & Evan Levy, Customer Data Integration, 2006)

"The tracking and management of all the organization’s interactions with its customers in order to provide better service, encourage customer loyalty, and increase the organization’s long-term profit per customer." (Steve Williams & Nancy Williams, "The Profit Impact of Business Intelligence", 2007)

"A strategy devoted to the development and management of close relationships between customers and the company. In many cases, CRM is referred to as the automation tool that helps bring about this strategy." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"Software intended to help you run your sales force and customer support." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"The technology and processes used to capture the details of interactions with customers and analyze that data to improve customer interaction, assess customer value, and build value and further loyalty." (Tony Fisher, "The Data Asset", 2009)

"A set of technologies and business processes designed to understand a customer, improve customer experience, and optimize customer-facing business processes across marketing, sales, and servicing channels." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"Refers to the set of procedures and computer applications designed to manage and improve customer service in an enterprise. Data warehousing, with integrated data about each customer, is eminently suitable for CRM." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"This is a packaged solution that delivers an end-to-end solution around contacting, understanding, and serving particular customer needs." (Martin Oberhofer et al, "The Art of Enterprise Information Architecture", 2010)

"Establishing relationships with individual customers and then using that information to treat different customers differently. Customer buying profiles and churn analysis are examples of decision support activities that can affect the success of customer relationships. Effective CRM is dependent on high quality master data about individuals and organizations (customer data integration)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The entire process of maximizing the value proposition to the customer through all interactions, both online and traditional. Effective CRM advocates one-to-one relationships and participation of customers in related business decisions." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"Software designed to help you run your sales force and customer support operations." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"The management of current and future customer interactions with a business. This can include sales support, warranty and technical support activity, Internet website, marketing, and product advertising." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Customer relationship management is a model for managing a company’s interactions with current and future customers. It involves using technology to organize, automate, and synchronize sales, marketing, customer service, and technical support." (Keith Holdaway, "Harness Oil and Gas Big Data with Analytics", 2014)

"System that manages the customer-related data along with their past, present, and future interactions with the organization." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A set of tools, techniques, and methodologies for understanding the needs and characteristics of customers in order to better serve them." (Robert M Grant, "Contemporary Strategy Analysis" 10th Ed., 2018)

13 January 2008

SQL Reloaded: Advices on SQL logic split in Web Applications

Yesterday I remembered about my first two small fights on SQL related topics, though in both situations I had to renounce temporarily to my opinions for the sake of armistice. It’s understandable, we can all make mistakes, unfortunately what we know hurts us more then what we don’t know. It’s amusing when I’m thinking about the two issues in discussion at those times, though then it was a little disturbing. What was about?! Actually both issues are related to web applications, my first “professional encounter” with programming.

Issue 1 – using JOINs or backend vs middle tier processing

JOINs are a powerful feature of SQL in combing related data from multiple tables in only one query, this coming with a little (or more) overhead from the database server side.
Web applications make use of lot of data access operations, data being pulled from a database each time a user requests a page, of course that happening when the page needs data from database(s) or execute commands on it, the CRUD (Create/Read/Update/Delete) gamma. That can become costly in time, depending on how data access was architected and requirements. The target is to pull smallest chunk of data possible (rule 1), with a minimum of trips to the database (rule 2).

Supposing that we need Employees data from a database for a summary screen with all employees, it could contain First Name, Last Name, Department and Contact information – City, Country, Email Address and Phone Number. Normally the information could be stored in 4 tables – Employees, Departments, Address and Countries, like in the below diagram.



The easiest and best way to pull the Employee needed data is to do a JOIN between tables:



More likely that two or more employees will have the same country or department, resulting in “duplication” of small pieces of information within the whole data set, contradicting rule 1. Can be pulled smaller chunks of data targeting only the content of a table, that meaning that we have to pull first all Employees or the ones matching a set of constraints, then all the departments or the only the ones for which an Employee was returned, and same with Addresses and Countries. In the end will have 4 queries and same number of roundtrips (or more). In the web page the code will have to follow the below steps:

Step 1: Pull the Employee data matching the query:
SELECT E.EmployeeID
, E.DepartmentID
, E.FirstName
, E.LastName
, E.EmailAddress
FROM Employees E
WHERE

Step 2: Build the (distinct) list of Department IDs and a (distrinct) list of Employee IDs.

Step 3: Pull the Department data matching the query:
SELECT D.DepartmentID
, D.Department
FROM Departments D
WHERE DepartmentID IN (<list of Department IDs>)

Step 4: Pull the Address data matching the query:
SELECT A.EmployeeID
, A.CountryID
, A.City
, A.Phone
FROM Addresses A
WHERE EmployeeID IN (<list of Employee IDs>)

Step 5: Build the (distinct) list of Country IDs.

Step 6: Pull the Country data matching the query:
SELECT C.CountryID
, C.Country
FROM Countries C
WHERE CountryID IN (<list of Country IDs>)

And if this doesn’t look like an overhead for you, you have to take into account that for each Employee is needed to search the right Department from the set of data returned in Step 3, and same thing for Addresses and Countries. It’s exactly what the database server does but done on the web server, with no built in capabilities for data matching.
In order to overcome the problems raised by matching, somebody could go and execute for each employee returned in Step 1 a query like the one defined in Step 4, but limited only to the respective Employee, thus resulting an additional number of new roundtrips matching the number of Employees. Quite a monster, isn’t it? Please don’t do something like this!

It’s true that we always need to mitigate between minimum of data and minimum of roundtrips to a web server, though we have to take into account also the overhead created by achieving extremities and balance them in an optimum manner, implementing the logic on the right tier. So, do data matching as much as possible on the database server because it was designed for that, and do, when possible, data enrichment (e.g. formatting) only on the web server.

In theory the easiest way of achieving something it’s the best as long the quality remains the same, so try to avoid writing expensive code that’s hard to write, maintain and debug!

Issue 2: - LEFT vs FULL JOINs

Normally each employee should be linked to a Department, have at least one Address, and the Address should be linked to a Country. That can be enforced at database and application level, though it’s not always the case. There could be Employees that are not assigned to a Department, or without an Address; in such cases then instead of a FULL JOIN you have to consider a LEFT or after case a RIGHT (OUTER) JOIN. So, I’ve rewritten the first query, this time using LEFT JOINs.



You don’t need to use always LEFT JOINs unless the business case requires it, and don’t abuse of them as they come with performance decrease!

12 January 2008

SQL Reloaded: SQL Server and Excel Data

Looking after information about SQL Server 2008, I stumble over Bob Beauchemin’s blog, the first posting I read Another use for SQL Server 2008 row constructors demonstrating a new use of VALUES clause that allows to insert multiple lines in a table by using only one query, or to group a set of values in a table and use them as source for a JOIN. I was waiting for this feature to be available under SQL Server 2005, though better later than never!

The feature is useful when you need to limit the output of a query based on matrix (tabular) values coming from an Excel or text file. For exemplification I’ll use HumanResources.vEmployee view from AdventureWorks database that co/mes with SQL Server 2005, you can download it from Code Plex in case you don’t have it for SQL Server 2008.
Let’s suppose that you have an Excel file with Employees for which you need contact information from a table available on SQL Server. You have the FirstName, MiddeName and LastName, and you need the EmailAddress and Phone. In SQL Server 2008 you can do that by creating a temporary table-like structure on the fly using VALUES clause, and use it then in JOIN or INSERT statements.

Query 1

The heart of the query is the below structure, where B(FirstName, MiddleName, LastName) is the new table, each row in its definition being specified by comma delimited triples of form ('FirstName ', 'MiddleName ', ' LastName'):



The construct it’s time consuming to build manually, especially when the number of lines is considerable big, though you can get the construct in Excel with the help of an easy formula.



The formula from column D is = ", ('" & A2 & "','" & B2 & "','" & C2 & "')" and it can be applied to the other lines too. You just need to copy now the data from Column D to SQL Server and use them in Query with a few small changes. Of course, you can create also a custom function (macro) in Excel to obtain the whole structure is a singe cell.

You can do something alike under older versions of SQL Server (or other databases) using a simple trick – concatenating the values from each column by row by using a delimiter like “/”, “~”, “|” or any other delimiter, though you have to be sure that the delimiter isn’t found in your data sources (Excel and table). Using “/” the formula is = ", '" & A2 & "/" & B2 & "/" & C2 & "'".



Then you have to use the same trick and concatenate the columns from the table, the query becoming:

Query 2

This technique involves small difficulties when:
• The data used for searching have other type than string derived data types, however that can be overcome by casting the values to string before concatenation.
• The string values contain spaces at extremities, so it’s better to trim the values using LTrim and RTrim functions.
• The values from the two sources are slightly different, for example diacritics vs. Latin standard characters equivalents, for this being necessary a transformation of the values to the same format.
Related Posts Plugin for WordPress, Blogger...