22 September 2007

6 out of 49 – Data Analysis (Part 2)

    In a posting I read long time ago, it was mentioned that’s interesting to see in a data set in which value intervals fall the data. Thus it was recommended to split the interval in 10 buckets, the only problem – we don’t know which is the right most extremity for our interval. We could use the maximum value: -- get maximum number of consecutive drawings for which any number haven’t appeared SELECT Max(DifferenceWeeks) FROM vLottoConsecutiveDrawings     But this doesn’t guarantees that there will be cases in which the extremity is exceeded. From my point of view, if the maximum is 63, then I’ll take the extremity 70 and split it in 10 buckets: first interval will be 1 … 7, second 8 … 14, etc. Thus, I’m taking the GROUP query I used in the previous copy and add to it additional logic for the split: SELECT Number , Min(NullIf(DifferenceWeeks, 0)) Minimum , Max(NullIf(DifferenceWeeks, 0)) Maximum , Avg(NullIf(DifferenceWeeks, 0)) Average , count(*) NumberDrawings , SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7 , SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END) Range8_14 , SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END) Range15_21 , SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END) Range22_28 , SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END) Range29_35 , SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END) Range36_42 , SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END) Range43_49 , SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END) Range50_56 , SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END) Range57_63 , SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END) Range64_70 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 8 AND 14 THEN 1 END)/count(*) Percentage8_14 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 15 AND 21 THEN 1 END)/count(*) Percentage15_21 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 22 AND 28 THEN 1 END)/count(*) Percentage22_28 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 29 AND 35 THEN 1 END)/count(*) Percentage29_35 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 36 AND 42 THEN 1 END)/count(*) Percentage36_42 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 43 AND 49 THEN 1 END)/count(*) Percentage43_49 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 50 AND 56 THEN 1 END)/count(*) Percentage50_56 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 57 AND 63 THEN 1 END)/count(*) Percentage57_63 , 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 64 AND 70 THEN 1 END)/count(*) Percentage64_70 FROM vLottoConsecutiveDrawings GROUP BY Number ORDER BY Number     As can be seen I just summed the number of cases in which a value fallen in the given interval: SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END) Range1_7     Additionally, I wanted to see what was the percentage for a value to fall in a given interval from the total number of occurences: 100*SUM(CASE WHEN DifferenceWeeks BETWEEN 1 AND 7 THEN 1 END)/count(*) Percentage1_7     How can we benefit from such output?     We can check for each number what’s the bucket with the highest percentage, usually there are only 1-2 intervals which qualify. The percentage the is key in finding useful information. For example if there is only one interval with a 75% occurences, then most probably the number will appear in a drawing from that interval, if the number didn’t appeared in the respective interval, then the probability for that number to appear increases. When we have multiple intervals with a considerable high percentage, if the right extremity of the highest interval is exceeded, the chances for a number to appear are considerbly higher.

No comments: