According to Wikipedia, computer programming (shortly programming or coding) is “the process of designing, writing, testing, debugging, and maintaining the source code of computer programs”. That’s an extensive definition, because typically programming refers mainly to the writing of a set of instructions understandable by a computer or any other electronic device. At least that’s what programming was at its beginnings. With time, giving the increasing complexity of software, programming included also activities like gathering requirements, architecting, designing, testing, debugging and troubleshooting, refactoring, documenting, configuring, deploying, performing maintenance, etc. Each of these activities comes with their own set of methods, procedures, processes, models, methodologies, best/good practices, standards and tools. In addition, when we look at the architecture of an application, we can delimit several layers: server vs. client, front-end (user interface), business layer, backend (database), transport (network), communication or hardware – they coming with their own set of technologies and knowledge luggage, and requiring some specialization too.
However, making abstraction of all these, programming implies the (partial) knowledge of a programming language, an artificial language used to communicate with machines, in terms of language syntax, semantics and built-in libraries, and of a IDE (Integrated Development Environment), an application in which the code is written, compiled/interpreted and debugged. As programming can be often a redundant task, being necessary to solve the same kind of problems or to write the same kind of instructions, in addition to the various structures and techniques made available in order to minimize redundancy, a programmer can take advantage of a huge collection of algorithms, abstracted step-by-step instructions, and afferent technical literature. The deeper needs to go their understanding, the broader the set of knowledge to be acquired for it.
And even if we consider all above, that’s not enough because programming is used in order to model and solve business-specific problems. So is required some minimal knowledge of the respective business domains, and that’s quite a lot if we consider that each project may address one or more business domains. Talking about projects, as most of the programming is performed within projects, a programmer needs to have some knowledge of the procedures, methods and methodologies for project management and team management. That’s not programming anymore, but it’s part of the landscape and nowadays is kind of a must because programming is performed within projects and teams. This means also that a programmer needs to cover several important interpersonal skills, to which add up customer oriented, social and thinking skills. They are important because they impact directly or indirectly the act of programming, and many ignore this.
It’s important to stress that programming is not only the knowledge of languages, algorithms, tools, methods, models, practices, methodologies, standards, but also their adequate use in order to make most of the programming experience. Or as a long time ago retrieved quotes puts it: “programming is 10% science, 20% ingenuity, and 70% getting the ingenuity to work with the science” (anonymous). We all (or almost all) master our native language to the degree of writing sentences or communicating, though it takes skills to communicate effectively and efficiently, or of making from language a tool of expression through poetry or other forms o literature. Fantasizing a little, programming is like writing poetry, is one thing to write chunks of words, and another thing to write something meaningful. And programming is a lot about interpretation and representation of meaning in order to solve problems, is about understanding and breaking down complexity to a level that can be translated in meaning to machines.
Programming is an art, to the same degree each endeavor can be transformed in art. It requires skills, knowledge, dedication, creativity, and most of all the pleasure of programming. Programming is a state of spirit, is a way or model of thinking, of seeing the whole world in computable terms.
SQL Troubles
A blog on SQL, data and database related topics
Saturday, February 18, 2012
Programming Reviewed – Part 1: What Programming Is About
Friday, December 02, 2011
Window Functions – Part I: The Case for Window Functions
Introduction
In the past, in the absence or in parallel with other techniques, aggregate functions proved to be quite useful in order to solve several types of problems that involve the retrieval of first/last record or the display of details together with averages and other aggregates. Typically their use involves two or more joins between a dataset and an aggregation based on the same dataset or a subset of it. An aggregation can involve one or more columns that make the object of analysis. Sometimes it might be needed multiple such aggregations based on different sets of columns. Each such aggregation involves at least a join. Such queries can become quite complex, though they were a price to pay in order to solve such problems.
Partitions
The introduction of analytic functions in Oracle and of window functions, a similar concept, in SQL Server, allowed the approach of such problems from a different simplified perspective. Central to this feature it’s the partition (of a dataset), its meaning being same as of mathematical partition of a set, defined as a division of a set into non-overlapping and non-empty parts that cover the whole initial set. The introduction of partitions it’s not necessarily something new, as the columns used in a GROUP BY clause determines (implicitly) a partition in a dataset. The difference in analytic/window functions is that the partition is defined explicitly inline together with a ranking or average function evaluated within a partition. If the concept of partition is difficult to grasp, let’s look at the result-set based on two Products (the examples are based on AdventureWorks database):
-- Price Details for 2 Products
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
ORDER BY A.ProductID
, A.StartDate
In this case a partition is “created” based on the first Product (ProductId = 707), while a second partition is based on the second Product (ProductId = 708). As a parenthesis, another partitioning could be created based on ProductId and StartDate; considering that the two attributes are a key in the table, this will partition the dataset in partitions of 1 record (each partition will have exactly one record).
Details and Averages
In order to exemplify the use of simple versus window aggregate functions, let’s consider a problem in which is needed to display Standard Price details together with the Average Standard Price for each ProductId. When a GROUP BY clause is applied in order to retrieve the Average Standard Cost, the query is written under the form:
-- Average Price for 2 Products
SELECT A.ProductID
, AVG(A.StandardCost) AverageStandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
GROUPBY A.ProductID
ORDERBY A.ProductID
In order to retrieve the details, the query can be written with the help of a FULL JOIN as follows:
-- Price Details with Average Price for 2 Products - using JOINs
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
, B.AverageStandardCost
, A.StandardCost - B.AverageStandardCost DiffStandardCost
FROM [Production].[ProductCostHistory] A
JOIN ( -- average price
SELECT A.ProductID
, AVG(A.StandardCost) AverageStandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
GROUP BY A.ProductID
) B
ON A.ProductID = B.ProductID
WHERE A.ProductID IN (707, 708)
ORDERBY A.ProductID
, A.StartDate
As pointed above the partition is defined by ProductId. The same query written with window functions becomes:
-- Price Details with Average Price for 2 Products - using AVG window function
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
, AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) AverageStandardCost
, A.StandardCost - AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) DiffStandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
ORDER BY A.ProductID
, A.StartDate
As can be seen, in the second example, the AVG function is defined using the OVER clause with PartitionId as partition. Even more, the function is used in a formula to calculate the Difference Standard Cost. More complex formulas can be written making use of multiple window functions.
The Last Record
Let’s consider the problem of retrieving the nth record. Because with aggregate functions is easier to retrieve the first or last record, let’s consider that is needed to retrieve the last Standard Price for each ProductId. The aggregate function helps to retrieve the greatest Start Date, which farther helps to retrieve the record containing the Last Standard Price.
-- Last Price Details for 2 Products - using JOINs
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
FROM [Production].[ProductCostHistory] A
JOIN ( -- average price
SELECT A.ProductID
, Max(A.StartDate) LastStartDate
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
GROUP BY A.ProductID
) B
ON A.ProductID = B.ProductID
AND A.StartDate = B.LastStartDate
WHERE A.ProductID IN (707, 708)
ORDERBY A.ProductID
,A.StartDate
With window functions the query can be rewritten as follows:
-- Last Price Details for 2 Products - using AVG window function
SELECT *
FROM (—ordered prices
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
, RANK() OVER(PARTITION BY A.ProductID ORDER BY A.StartDate DESC) Ranking
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
) A
WHERE Ranking = 1
ORDER BY A.ProductID
, A.StartDate
As can be seen, in order to retrieve the Last Standard Price, was considered the RANK function, the results being ordered descending by StartDate. Thus, the Last Standard Price will be always positioned on the first record. Because window functions can’t be used in WHERE clauses, it’s needed to encapsulate the initial logic in a subquery. Similarly could be retrieved the First Standard Price, this time ordering ascending the StartDate. The last query can be easily modified to retrieve the nth records (this can prove to be more difficult with simple average functions), the first/last nth records.
Conclusion
Without going too deep into details, I shown above two representative scenarios in which solutions based on average functions could be simplified by using window functions. In theory the window functions provide greater flexibility but they have their own trade offs too. In the next posts I will attempt to further detail their use, especially in the context of Statistics.
In the past, in the absence or in parallel with other techniques, aggregate functions proved to be quite useful in order to solve several types of problems that involve the retrieval of first/last record or the display of details together with averages and other aggregates. Typically their use involves two or more joins between a dataset and an aggregation based on the same dataset or a subset of it. An aggregation can involve one or more columns that make the object of analysis. Sometimes it might be needed multiple such aggregations based on different sets of columns. Each such aggregation involves at least a join. Such queries can become quite complex, though they were a price to pay in order to solve such problems.
Partitions
The introduction of analytic functions in Oracle and of window functions, a similar concept, in SQL Server, allowed the approach of such problems from a different simplified perspective. Central to this feature it’s the partition (of a dataset), its meaning being same as of mathematical partition of a set, defined as a division of a set into non-overlapping and non-empty parts that cover the whole initial set. The introduction of partitions it’s not necessarily something new, as the columns used in a GROUP BY clause determines (implicitly) a partition in a dataset. The difference in analytic/window functions is that the partition is defined explicitly inline together with a ranking or average function evaluated within a partition. If the concept of partition is difficult to grasp, let’s look at the result-set based on two Products (the examples are based on AdventureWorks database):
-- Price Details for 2 Products
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
ORDER BY A.ProductID
, A.StartDate
In this case a partition is “created” based on the first Product (ProductId = 707), while a second partition is based on the second Product (ProductId = 708). As a parenthesis, another partitioning could be created based on ProductId and StartDate; considering that the two attributes are a key in the table, this will partition the dataset in partitions of 1 record (each partition will have exactly one record).
Details and Averages
In order to exemplify the use of simple versus window aggregate functions, let’s consider a problem in which is needed to display Standard Price details together with the Average Standard Price for each ProductId. When a GROUP BY clause is applied in order to retrieve the Average Standard Cost, the query is written under the form:
-- Average Price for 2 Products
SELECT A.ProductID
, AVG(A.StandardCost) AverageStandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
GROUPBY A.ProductID
ORDERBY A.ProductID
In order to retrieve the details, the query can be written with the help of a FULL JOIN as follows:
-- Price Details with Average Price for 2 Products - using JOINs
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
, B.AverageStandardCost
, A.StandardCost - B.AverageStandardCost DiffStandardCost
FROM [Production].[ProductCostHistory] A
JOIN ( -- average price
SELECT A.ProductID
, AVG(A.StandardCost) AverageStandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
GROUP BY A.ProductID
) B
ON A.ProductID = B.ProductID
WHERE A.ProductID IN (707, 708)
ORDERBY A.ProductID
, A.StartDate
As pointed above the partition is defined by ProductId. The same query written with window functions becomes:
-- Price Details with Average Price for 2 Products - using AVG window function
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
, AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) AverageStandardCost
, A.StandardCost - AVG(A.StandardCost) OVER(PARTITION BY A.ProductID) DiffStandardCost
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
ORDER BY A.ProductID
, A.StartDate
As can be seen, in the second example, the AVG function is defined using the OVER clause with PartitionId as partition. Even more, the function is used in a formula to calculate the Difference Standard Cost. More complex formulas can be written making use of multiple window functions.
The Last Record
Let’s consider the problem of retrieving the nth record. Because with aggregate functions is easier to retrieve the first or last record, let’s consider that is needed to retrieve the last Standard Price for each ProductId. The aggregate function helps to retrieve the greatest Start Date, which farther helps to retrieve the record containing the Last Standard Price.
-- Last Price Details for 2 Products - using JOINs
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
FROM [Production].[ProductCostHistory] A
JOIN ( -- average price
SELECT A.ProductID
, Max(A.StartDate) LastStartDate
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
GROUP BY A.ProductID
) B
ON A.ProductID = B.ProductID
AND A.StartDate = B.LastStartDate
WHERE A.ProductID IN (707, 708)
ORDERBY A.ProductID
,A.StartDate
With window functions the query can be rewritten as follows:
-- Last Price Details for 2 Products - using AVG window function
SELECT *
FROM (—ordered prices
SELECT A.ProductID
, A.StartDate
, A.EndDate
, A.StandardCost
, RANK() OVER(PARTITION BY A.ProductID ORDER BY A.StartDate DESC) Ranking
FROM [Production].[ProductCostHistory] A
WHERE A.ProductID IN (707, 708)
) A
WHERE Ranking = 1
ORDER BY A.ProductID
, A.StartDate
As can be seen, in order to retrieve the Last Standard Price, was considered the RANK function, the results being ordered descending by StartDate. Thus, the Last Standard Price will be always positioned on the first record. Because window functions can’t be used in WHERE clauses, it’s needed to encapsulate the initial logic in a subquery. Similarly could be retrieved the First Standard Price, this time ordering ascending the StartDate. The last query can be easily modified to retrieve the nth records (this can prove to be more difficult with simple average functions), the first/last nth records.
Conclusion
Without going too deep into details, I shown above two representative scenarios in which solutions based on average functions could be simplified by using window functions. In theory the window functions provide greater flexibility but they have their own trade offs too. In the next posts I will attempt to further detail their use, especially in the context of Statistics.
Labels:
AdventureWorks,
aggregate functions,
aggregate window functions,
AVG,
JOIN,
nth record,
OVER,
RANK,
SQL Server 2005
| Reactions: |
Subscribe to:
Posts (Atom)