## 23 September 2007

### 6 out of 49 – Data Analysis (Part 1)

One of the interesting things to see in a data set of drawings is the distance between two drawings in which appears the same numbers. It doesn’t look too easy to compare two consecutive records within the same query; actually this can be done easy with the help of DENSE_RANK function (works only on SQL Server 2005+ and Oracle), which ranks the data within a partition, thus if two values are identical, they have the same ranking. Let’s see how DENSE_RANK function works: SELECT Number , DrawingDate , DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking FROM vLottoNumbers     The partition is created by Number, while the ranking is delimited by DrawingDate. Now all we have to do is to join two such queries by Number with two consecutive Rankings: --creating vLottoConsecutiveDrawings view CREATE VIEW vLottoConsecutiveDrawings AS SELECT A.Number , A.DrawingDate , B.DrawingDate NextDrawingDate , IsNull(DateDiff(d, A.DrawingDate, B.DrawingDate), 0) DifferenceDays , IsNull(DateDiff(wk, A.DrawingDate, B.DrawingDate), 0) DifferenceWeeks FROM (     SELECT Number     , DrawingDate     , DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking     FROM vLottoNumbers ) A LEFT JOIN (     SELECT Number     , DrawingDate     , DENSE_RANK() OVER (PARTITION BY Number ORDER BY DrawingDate) Ranking     FROM vLottoNumbers ) B ON A.Number = B.Number AND A.Ranking = B.Ranking – 1 --checking the data SELECT * FROM vLottoConsecutiveDrawings ORDER BY Number, DrawingDate     The first interesting thing to find out is:     What’s the average number of occurrences?     What’s the minimum/maximum number of consecutive drawings in which the number hasn’t appeared?     How many times a number appeared?     The following query answers to these questions, by doing a simple grouping by Number over vLottoConsecutiveDrawings output: SELECT Number , Min(NullIf(DifferenceWeeks, 0)) Minimum , Max(NullIf(DifferenceWeeks, 0)) Maximum , Avg(NullIf(DifferenceWeeks, 0)) Average , count(*) NumberDrawings FROM vLottoConsecutiveDrawings GROUP BY Number ORDER BY Number     The output doesn’t give too much information, but it’s a good start. The study interval can be decreased by using the DrawingDate in GROUP and/or WHERE clause: SELECT Number , Year(DrawingDate) DrawingYear , Min(NullIf(DifferenceWeeks, 0)) Minimum , Max(NullIf(DifferenceWeeks, 0)) Maximum , Avg(NullIf(DifferenceWeeks, 0)) Average , count(*) NumberDrawings FROM vLottoConsecutiveDrawings --WHERE Year(DrawingDate) IN (2000, 2001) GROUP BY Number , Year(DrawingDate) ORDER BY Number, DrawingYear