Yesterday I remembered about my first two small fights on SQL related topics, though in both situations I had to renounce temporarily to my opinions for the sake of armistice. It’s understandable, we can all make mistakes, unfortunately what we know hurts us more then what we don’t know. It’s amusing when I’m thinking about the two issues in discussion at those times, though then it was a little disturbing. What was about?! Actually both issues are related to web applications, my first “professional encounter” with programming.
Issue 1 – using JOINs or backend vs middle tier processing
JOINs are a powerful feature of SQL in combing related data from multiple tables in only one query, this coming with a little (or more) overhead from the database server side.
Web applications make use of lot of data access operations, data being pulled from a database each time a user requests a page, of course that happening when the page needs data from database(s) or execute commands on it, the CRUD (Create/Read/Update/Delete) gamma. That can become costly in time, depending on how data access was architected and requirements. The target is to pull smallest chunk of data possible (rule 1), with a minimum of trips to the database (rule 2).
Supposing that we need Employees data from a database for a summary screen with all employees, it could contain First Name, Last Name, Department and Contact information – City, Country, Email Address and Phone Number. Normally the information could be stored in 4 tables – Employees, Departments, Address and Countries, like in the below diagram.
The easiest and best way to pull the Employee needed data is to do a JOIN between tables:
More likely that two or more employees will have the same country or department, resulting in “duplication” of small pieces of information within the whole data set, contradicting rule 1. Can be pulled smaller chunks of data targeting only the content of a table, that meaning that we have to pull first all Employees or the ones matching a set of constraints, then all the departments or the only the ones for which an Employee was returned, and same with Addresses and Countries. In the end will have 4 queries and same number of roundtrips (or more). In the web page the code will have to follow the below steps:
Step 1: Pull the Employee data matching the query:
FROM Employees E
Step 2: Build the (distinct) list of Department IDs and a (distrinct) list of Employee IDs.
Step 3: Pull the Department data matching the query:
FROM Departments D
WHERE DepartmentID IN (<list of Department IDs>)
Step 4: Pull the Address data matching the query:
FROM Addresses A
WHERE EmployeeID IN (<list of Employee IDs>)
Step 5: Build the (distinct) list of Country IDs.
Step 6: Pull the Country data matching the query:
FROM Countries C
WHERE CountryID IN (<list of Country IDs>)
And if this doesn’t look like an overhead for you, you have to take into account that for each Employee is needed to search the right Department from the set of data returned in Step 3, and same thing for Addresses and Countries. It’s exactly what the database server does but done on the web server, with no built in capabilities for data matching.
In order to overcome the problems raised by matching, somebody could go and execute for each employee returned in Step 1 a query like the one defined in Step 4, but limited only to the respective Employee, thus resulting an additional number of new roundtrips matching the number of Employees. Quite a monster, isn’t it? Please don’t do something like this!
It’s true that we always need to mitigate between minimum of data and minimum of roundtrips to a web server, though we have to take into account also the overhead created by achieving extremities and balance them in an optimum manner, implementing the logic on the right tier. So, do data matching as much as possible on the database server because it was designed for that, and do, when possible, data enrichment (e.g. formatting) only on the web server.
In theory the easiest way of achieving something it’s the best as long the quality remains the same, so try to avoid writing expensive code that’s hard to write, maintain and debug!
Issue 2: - LEFT vs FULL JOINs
Normally each employee should be linked to a Department, have at least one Address, and the Address should be linked to a Country. That can be enforced at database and application level, though it’s not always the case. There could be Employees that are not assigned to a Department, or without an Address; in such cases then instead of a FULL JOIN you have to consider a LEFT or after case a RIGHT (OUTER) JOIN. So, I’ve rewritten the first query, this time using LEFT JOINs.
You don’t need to use always LEFT JOINs unless the business case requires it, and don’t abuse of them as they come with performance decrease!