Wednesday, November 30, 2016

Check your lotto chances: what do the numbers say?

This post is a part 2 of using probability to determine our chances to win the lottery. In the previous post we checked the probability of drawing the numbers.

Oh the lottery win, could it be? We have determined that our chances are very slim, but we also know that the lotto people have to use a machine or a bunch of balls in a rolling drum to get the numbers. Maybe there could be an issue with the balls or the machine? or the environment its in? Maybe its not really random? Maybe there will be some patterns to how they get drawn.


Lets get all the numbers from previous draws and see what comes out of that. In Canada the lottery published the list of winning numbers, so we are going to get the files, and find out if there is (just maybe) a way to predict the winning numbers. Maybe one number is more likely to be drawn then the others.

To find out, lets load up the numbers from a CSV file and find out whats there:


import os, csv
from collections import Counter
#'Atlantic649.csv'#'Lotto649.csv'#
class LottoStats(object):
    def __init__(self, file_name='Lotto649.csv'):
        '''
        We want to get the numbers out of a spreadsheet, its colums are like so:
        Draw Date,No1,No2,No3,No4,No5,No6,Bonus,
        '''
        lotto_file = open(os.path.abspath(file_name), 'rb')
        reader = csv.reader(lotto_file, dialect = csv.excel)
        self.lotto_numbers = [] 
        self.draws = 0
        for row in reader:  
            self.draws += 1 
            for i, cell in enumerate(row):
                if (i != 0 and i != 7 and cell):
                    self.lotto_numbers.append(cell)
        '''
        Python has the handy Counter object that 
        will return the elements and their counts
        '''
        self.counts = Counter(self.lotto_numbers)
 
    def get_numbers(self):
        return self.lotto_numbers

    def get_counts(self):
        return self.counts.items()

    def count_most_common(self, n=9):
        return self.counts.most_common(n)

    def count_least_common(self, n=9):
        return self.counts.most_common(len(self.lotto_numbers))[:-n-1:-1]

    def get_draws(self):
        return self.draws

stats = LottoStats('Lotto649.csv') 
# lets validate the expected counts with a test
assert len(stats.get_counts()) == 49

print stats.get_draws() # 2950
print stats.count_most_common(6) #[('34', 406), ('31', 405), ('45', 397), ('43', 394), ('40', 393), ('47', 386)]
print stats.count_least_common(6) #[('28', 331), ('14', 333), ('15', 339), ('16', 340), ('22', 341), ('13', 342)]

counts = stats.get_counts()
print(counts)
occurances = [occ[1] for occ in counts]
print(str(occurances))


From our observation by just checking the common and least common, we see that some numbers seem to be drawn more than others. Is that a big deal?

From the occurrences, we have the number 34 appearing 406 times, and we have 28 in there 331 times. That seems like it could be a big spread, but remember that we had 2950 draws.

We need to determine the chances that the number can be drawn at all. Out of 2950 draws, it's a 406/2950 or .1376 or 13% chance of happening and 28 has .1122.

Whats a good tool to add some insight here? What does this data tell us? There are two really valuable statistical tools to give us some understanding.

One is sample standard deviation which just indicates how far apart the values are. This isn't really useful on its own, but because our data has the properties it does it can be combined with the average to find.the coefficient of variance, which tells us how close all the values are to the average.

A set like [10,10,10] has a cv of 0; there is no difference. What does the set of our probabilities tell us?

from stats import standard_deviation

num_chances = [(num / float(2950)) for num in occurances]
print (num_chances)
sd = standard_deviation(num_chances, False) #The standard deviation is 0.00614129698134.
mean = sum(num_chances) / len(num_chances)
print("Mean "+str(mean))
variance = sd / float(mean) 
print("coefficient of variance: "+str(variance)) #0.0501539253476

The standard deviation is 0.00614129698134, and the Mean (or average) is 0.122448979592, so the coefficient of variance: 0.0501539253476

The 0.05 is extremely low, getting close to 0 and that tells us the values of the numbers are all very, very close together. This means the actual drawing of the numbers seems to have no meaningful effect on the odds. It pretty much a random draw, so we can conclude that the spinning drum with the balls flying around do a pretty good job.

Predictions?

So we have described the situation, so lets try some Predictive or Prescriptive analytics on this issue.

Can we predict the lottery? With the long odds, and the measured fairness of the process, it's just not accurately predictable enough to pursue specific numbers. The prescriptive then follows that you just shouldn't buy a lottery ticket expecting to make money doing it. But what are the odds of winning over 10 or 20 years, even just once? They are the same each week, they don't change.

Statistics tell a story about data. There has to be some understanding about the data itself, and
the types of questions you have understood to apply the methods and tools of stats properly.

Aside

There is a way to make money from the lottery. Let's calculate the cost if played over 10 years, every week playing 100 draws a year. I think its $2. Lets say I bought an investment that gave 5% every year.

year_cost = 2 * 52 * 2 # $208 per year
total_gain = 0
for y in range(10):
    total_gain += year_cost * 1.05
print(total_gain)


If you kept that under your mattress,  you would have $2080 or $2184 if you bought the investment and the best part is that you have 100% chance of getting that money, instead of the 0.00000000715112% chance of the lotto.

Maybe think of it this way: what are the chances that you will need a spare $2000 in the future?

Saturday, November 26, 2016

Checking my lottery chances with Probability

Check your lottery chances with Statistics and Probability, maybe we can find a way to get some lucky numbers!

In this post we show how methods of statistics and probability can be used to
understand the chances of actually winning the lottery. The various methods will be worked out in python.

Analytics

This is really 'Descriptive' analytics, we are just telling the story of what's there.
After, we can get into predictive and prescriptive analytics but before that the foundation needs some building.

First lets get an understanding of the odds of winning the lottery, and how we find out what those odds are.

There are actually 2 interpretations of probability; 2 different approaches to solve a how likely an event will happen. One is called "classic" and has evolved a bit to be termed "frequentist". The other is usually called "subjective" or "bayesian". Lets see how these approaches compare with our lotto chances problem.

Counting

First, the classic example. We have a set of 49 numbers and we want to find out what the chances are of finding a set of 6 unique numbers. This we can check as a combination, it's 49 'choose' 6:

Combinations are expressed as:

$$ \dfrac{n!}{(n-r!)*(r!)} $$

We use combinations instead of permutations because the actual order of the result doesn't matter. If we draw a 1, 20, 33, 41, 5, 6, or 1, 33, 41, 20, 6, 5. It doesn't matter, we just need 6 unique numbers.

A solution in python could be something like:

n = 49
r = 6
import math
denominator = math.factorial(n-r) * math.factorial(r)
num_combos = math.factorial(n) / denominator
print(num_combos) #13983816

So that's about 14 million, so you have a 1 in 14 million chances to get those numbers.

Bayesian

Now let's use a Bayesian probability approach to find out how good our chances are. In Bayesian probability only the current likelihood is known and that prior knowledge is taken into account to determine the probability of the next event. Remember that the entire set of data doesn't need to be known, you are just venturing forward with what you have for each new case.

Do the numbers drawn previously matter? They do: we are looking for unique numbers, so we can't draw the same number twice. These are dependent events. Two events are dependent if the outcome or occurrence of the first affects the outcome or occurrence of the second so that the probability is changed.

When two events, A and B, are dependent, the probability of both occurring is expressed by:

    P(A and B)  =  P(A)*P(B|A)

Its the probability of the first event, combined with second, and that result it combined with the third, and so on. The order of the result still doesn't matter.

In python it could be expressed like so:


select = 49
choose = 6
total_prob = choose / float(select) 
for x in range(5):
    select = select - 1
    choose = choose - 1
    prob =  choose / float(select)
    total_prob = total_prob * prob

odds = 1 / float(total_prob)
print str(odds) # 13983816.0


Just under 14 million again, but when you look closer you see that the result is exactly the same (13983816).
It's the same result, just a different way of going about it.

What is interesting about the Bayesian probability interpretation solution is that we loop until we get to the end; but don't need to know how big the initial set is. That way interpreting probability lends itself nicely to computation models and is why Bayesian probability methods are a big part of the emerging field of Data Science. This isn't so obvious in the descriptive part of analytics, but the Bayesian methods have really shined when data isn't complete, or when the data set is extremely large.

Using frequentest methods with large data sets you need to take samples and understand what constitutes a sample set that is significant enough.

Is one better than the other? I don't think it's a better or worse, but just different. They both have advantages considering the problem and conditions around it, but it's interesting to see how different methods can have the same results.

To get really into it, read this very good book:

  • http://xcelab.net/rm/statistical-rethinking/ 
Its also a good idea just to read posts from those that do this all the time:
  • http://www.fharrell.com/2017/02/my-journey-from-frequentist-to-bayesian.html


Save the theory for another day, we have to get back to our Lottery problem!

We now know that the chances are long, no matter how they are measured. How about the actual drawing of the numbers? In the next post, we are going to get some historical data on the numbers that have been drawn, and see if we can find some patterns to better our odds. 






Tuesday, November 15, 2016

Algorithms intro: Profiling python code


When developing software and implementing algorithms its always important to be as efficient as possible, to have the most elegant solution that is concise, understandable to others, but most importantly uses the resources it has in the most efficient way.

This post extends the previous http://jseller.blogspot.ca/2016/11/algorithms-intro-implementing-math-proof.html and profiles the algorithms we created to implement Euclids GCD method

Profiling

With python and most every language has some profiling tools to check various computer resources; most frequently this is the memory used and CPU cycles consumed. For our algorithm post we are just concerned with the time any of our algorithms take to execute.


import cProfile, pstats, io

#you can run any code as a string to be executed:
cProfile.run('fill_rectangle_with_squares(Rectangle(24000,19453))')

#but the profile object can also be enabled and disabled to run many lines.
pr = cProfile.Profile()
pr.enable()
fill_rectangle_with_squares(Rectangle(2400000,194530))
pr.disable()
pr.print_stats()

With

If there are many lines, this can get a little difficult keeping the first two and last two in the same spot, or accidentally deleted the cleanup part. Python has a 'with' statement to make this nice and clean. http://effbot.org/zone/python-with-statement.htm

import cProfile, pstats, io
class profile_block:
    def __enter__(self):
        self.profile = cProfile.Profile()
        self.profile.enable()
        #set things up
        #return thing
    def __exit__(self, type, value, traceback):
        #tear things down
        self.profile.disable()
        self.profile.print_stats()

Try running a few versions and look at the output. It will probably take some larger numbers until you can see a change. The times depends on the machine, but by using the 'with' block we know the disable and printing will always happen, and the code is nicer.

with profile_block():
    fill_rectangle_with_squares(Rectangle(24000000,19453003))

with profile_block():
    fill_rectangle_with_squares_iterative(24000000,19453003)

Running the different approaches with the large example show a 3x difference with the optimized version.


with profile_block():
    fill_rectangle_with_squares_iterative(2400000223334219423423424234240,1945303434230234234242322)

with profile_block():
    greatest_common_divisor(2400000223334219423423424234240,1945303434230234234242322)

The output. This is a big improvement between our two versions.


fill 2400000223334219423423424234240, 1945303434230234234242322 with squares 
smallest square found 2, 2 
         6 function calls in 0.989 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 euclid.py:104(__exit__)
        1    0.000    0.000    0.000    0.000 euclid.py:22(__init__)
        2    0.000    0.000    0.000    0.000 euclid.py:25(__str__)
        1    0.989    0.989    0.989    0.989 euclid.py:55(fill_rectangle_with_squares_iterative)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

greatest common divisor 2
         3 function calls in 0.386 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 euclid.py:104(__exit__)
        1    0.386    0.386    0.386    0.386 euclid.py:77(greatest_common_divisor)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


Wrap it up

From the first part of this post we solved the problem with computation and in this case it was a classic, Euclid Greatest Common Denominator method. His proof in Elements uses only geometry, but we also used geometry as a handy tool to visualize and understand the proof.
What we wanted to find out though, was how to validate this method with computation, and understand which approach works best.
As recursive implementations can more closely model the real world scenario, we have to keep in mind that this is being solved with computation, so our iterative approach is able to take advantage of the computing power at hand.

Finding out how well that actually performed is the job of profiling, and a well measured system, or just a small part of it, can go a long way to understanding how to write efficient code.



Algorithms intro: implementing a math proof


What's a nice way of showing a visual example to better understand how a math proof can be implemented as an algorithm in code? Here we are showing an example of implementing an algorithm, comparing iterative and recursive algorithms for solving a method; and validating the results with profiling.

In problem solving, a great methodology to use is: Understand, Plan, Execute and Review.
We will use a bit of geometry to understand the problem, and then plan how to solve it; then we will implement code to execute that plan and review how things worked out.

Understand

The Euclidean algorithm calculates the greatest common divisor (GCD) of two natural numbers a and b. (or x and y).

We can use some geometry to visualize this, check out this video of it in action. From that, it looks like a matter of:
  1. Fill with squares along the shortest side of the rectangle
  2. For the remaining rectangle, fill with squares again, until smallest square is found.

Plan

Lets write down the steps in more detail:
  • fill rectangle with squares
    • find shortest sid
    • divide long side by square side, get remainder
    • if remainder is 0, smallest square is found
  • for remainder, fill rectangle with squares

Execute

To model the visualization, lets make a Rectangle object, and use that with a function. Lets start with what we know, the Rectangle:

class Rectangle(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __str__(self):
        return "%s, %s" % (self.x, self.y)

Now lets plan how to fill the rectangle just like the visualization happens, the next rectangle is filled, and so on until the end. This style of implementation is recursive.
  • fill_rectangle_with_squares(2,4)
    • fill_rectangle_with_squares(2,2)
      • 2 is smallest square
We keep calling the fill_rectangle_with_squares function, until the x and y sides are equal. This is the stopping condition.

def fill_rectangle_with_squares(rectangle):
    print("Fill " + str(rectangle))
    # check the sides, if they aren't equal make a new rectangle with the remaining
    if rectangle.x == rectangle.y:
        print('smallest square found %s ' % str(rectangle))
        return rectangle.x
    elif rectangle.x > rectangle.y:
        return fill_rectangle_with_squares(Rectangle(rectangle.x - rectangle.y, rectangle.y))
    elif rectangle.y > rectangle.x:
        return fill_rectangle_with_squares(Rectangle(rectangle.x, rectangle.y - rectangle.x))

Now we call the function with an assert to test the result

To test to see if this function returns what we want, we use assert to check. We want to assert that a 2 by 4 rectangle is anything other than 2. Lets test some others as well

assert fill_rectangle_with_squares(Rectangle(2,2)) == 2
assert fill_rectangle_with_squares(Rectangle(140,68)) == 4

assert fill_rectangle_with_squares(Rectangle(24,12)) == 12

Review

Lets keep reviewing our solution with other sizes, like: Rectangle(143,62) and Rectangle(270,192). These work fine, but what happens when larger values are involved?

Rectangle(2400000223334219423423424234240,1945303434230234234242322)

When this would fail would vary from machine to machine, but on my home laptop I get a:

RuntimeError: maximum recursion depth exceeded


This means the call stack limit is reached, and this limit is bound by the design of the python language itself. Some languages like LISP don't have this limitation, it's just ends up using what resources it has. The limit is imposed because the amount of callable memory is a finite space on the machine. If you are using a large amount of recursion depth, and not getting the results you need then there is way to only bound by time, and the running of the CPU.

How can we make this better? Lets implement the algorithm using iteration.

def fill_rectangle_with_squares_iterative(x,y):
    rectangle = Rectangle(x,y)
    print('fill %s with squares ' % str(rectangle))
    found = False
    while not found:
        found = rectangle.x == rectangle.y
        if rectangle.x > rectangle.y:
            rectangle.x = rectangle.x - rectangle.y
        elif rectangle.y > rectangle.x:
            rectangle.y = rectangle.y - rectangle.x
    print('smallest square found %s ' % str(rectangle))

So far, so good, and the tests work with fill_rectangle_with_squares_iterative(24000000,19453003)

Can the iterative version take the massive difficult rectangle that blew the call stack with the recursive version?

fill_rectangle_with_squares_iterative(2400000223334219423423424234240,1945303434230234234242322)

It does, but takes a couple seconds.

Optimize

I know it took a couple seconds because the console lets me know how long any code execution takes. What's really happening here? To find out more I need to profile the code which will will in the second part of this post.

Before we do though, lets take a crack at optimizing the code even farther.

When you look at the iterative function, see that we are just modifying the Rectangle objects x and y values. We can just keep track of the x and y and not bother making the Rectangle object at all. We could maybe make one object and pass it instead of making a new object for each call, but lets go the step further. It's only 2 variables that makes it concise.

We used the Rectangle as a way to visualize the problem, but nice terse notation is possible and ends up being very clear, so lets boil down the description to the original:

Given two (natural) numbers not prime to one another, to find their greatest common measure.
(The Elements: Book VII: Proposition 2)


def greatest_common_divisor(x,y):
    while not x == y:
        if x > y:
            x = x - y
        elif y > x:
            y = y - x
    return x
    
print('greatest common divisor found %s' % (str(greatest_common_divisor(24,10))))

That's much more straightforward, and can probably be more concise, but I'd like to stop here to keep it readable and easier to understand.

Wrap it up

Lets review what we did in the post. we understood the problems and used some geometry to show a solution. Since it was Euclids proof, it seems like the right thing to do. We then planned a solution using recursion as it matched our real world model.
This worked to a point, but we reviewed our options and implemented an iterative method that could handle much larger rectangles. Once we had that solved we optimized for the computational abilities of the language and the machine we are using.

Our last implementation works for all the tests above, and certainly seems to run faster. How much faster?
We will profile all these functions and find out in the next post:  http://jseller.blogspot.ca/2016/11/algorithms-intro-profiling-python-code.html